---
title: "Algorithm and Mathematical Framework"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Algorithm and Mathematical Framework}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  fig.align = "center"
)
```

## Introduction

TorchDecon implements a deep learning-based approach for cell type deconvolution, originally proposed by Menden et al. (2020) in their Scaden algorithm. This vignette provides a comprehensive overview of the mathematical framework and algorithmic principles underlying TorchDecon.

**Author**: Zaoqu Liu (liuzaoqu@163.com)

## Problem Formulation

### The Deconvolution Problem

Cell type deconvolution aims to estimate the cellular composition of bulk tissue samples. Given a bulk expression profile $\mathbf{x} \in \mathbb{R}^G$ (where $G$ is the number of genes), we seek to estimate the cell type fraction vector $\mathbf{f} \in \mathbb{R}^K$ (where $K$ is the number of cell types) such that:

$$\sum_{k=1}^{K} f_k = 1, \quad f_k \geq 0 \quad \forall k$$

### Traditional Approaches vs. Deep Learning

Traditional deconvolution methods (e.g., CIBERSORT, MuSiC) rely on:

1. **Signature matrices**: Pre-defined gene expression signatures for each cell type
2. **Linear mixing models**: Assumption that bulk expression is a linear combination of cell type signatures

$$\mathbf{x} = \mathbf{S} \cdot \mathbf{f} + \boldsymbol{\epsilon}$$

where $\mathbf{S} \in \mathbb{R}^{G \times K}$ is the signature matrix.

**TorchDecon's deep learning approach** instead learns a non-linear mapping directly from bulk expression to cell type fractions:

$$\mathbf{f} = \mathcal{F}_\theta(\mathbf{x})$$

where $\mathcal{F}_\theta$ is a neural network with parameters $\theta$.

## Algorithmic Pipeline

### Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                    TorchDecon Pipeline                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │  scRNA-seq   │───▶│    Bulk      │───▶│   Training   │      │
│  │  Reference   │    │  Simulation  │    │     Data     │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│                                                 │               │
│                                                 ▼               │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐      │
│  │  Cell Type   │◀───│   Trained    │◀───│   Training   │      │
│  │  Fractions   │    │   Ensemble   │    │    Loop      │      │
│  └──────────────┘    └──────────────┘    └──────────────┘      │
│         ▲                                                       │
│         │                                                       │
│  ┌──────────────┐                                              │
│  │  Real Bulk   │                                              │
│  │    Data      │                                              │
│  └──────────────┘                                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Step 1: Bulk RNA-seq Simulation

#### Mathematical Formulation

For each simulated sample $s$:

1. **Generate random fractions**: Sample $\mathbf{f}^{(s)} \sim \text{Dirichlet}(\boldsymbol{\alpha})$ or uniform distribution, then normalize:
   
   $$f_k^{(s)} = \frac{u_k}{\sum_{j=1}^K u_j}, \quad u_k \sim \text{Uniform}(0, 1)$$

2. **Sample cells**: For each cell type $k$, sample $n_k = \lfloor f_k^{(s)} \cdot N \rfloor$ cells (with replacement), where $N$ is the total cells per sample.

3. **Aggregate expression**: Sum the expression across sampled cells:
   
   $$x_g^{(s)} = \sum_{k=1}^{K} \sum_{i \in C_k^{(s)}} c_{gi}$$
   
   where $C_k^{(s)}$ is the set of sampled cells from type $k$, and $c_{gi}$ is the count for gene $g$ in cell $i$.

#### Sparse Sample Generation

To improve model generalization, TorchDecon generates "sparse" samples where some cell types are absent:

$$f_k^{(s)} = \begin{cases} \tilde{f}_k / \sum_{j \in \mathcal{A}} \tilde{f}_j & \text{if } k \in \mathcal{A} \\ 0 & \text{otherwise} \end{cases}$$

where $\mathcal{A} \subset \{1, ..., K\}$ is a randomly selected subset of cell types.

### Step 2: Data Preprocessing

#### Log Transformation

$$\tilde{x}_g = \log_2(x_g + 1)$$

This transformation:
- Reduces the dynamic range of expression values
- Stabilizes variance
- Makes the data more normally distributed

#### Sample-wise Min-Max Normalization

For each sample $s$:

$$\hat{x}_g^{(s)} = \frac{\tilde{x}_g^{(s)} - \min_j(\tilde{x}_j^{(s)})}{\max_j(\tilde{x}_j^{(s)}) - \min_j(\tilde{x}_j^{(s)})}$$

This ensures:
- All features are in $[0, 1]$
- Sample-specific technical variations are mitigated
- Neural network training stability is improved

#### Gene Filtering

Genes are filtered based on variance across samples:

$$\text{Var}(x_g) = \frac{1}{n-1} \sum_{s=1}^{n} (x_g^{(s)} - \bar{x}_g)^2 > \tau$$

where $\tau$ is the variance threshold (default: 0.1).

### Step 3: Neural Network Architecture

#### Fully Connected Network

Each model in the ensemble is a fully connected neural network:

$$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$$

where:
- $\mathbf{h}^{(l)}$ is the hidden representation at layer $l$
- $\mathbf{W}^{(l)}, \mathbf{b}^{(l)}$ are learnable weights and biases
- $\sigma(\cdot)$ is the ReLU activation: $\sigma(z) = \max(0, z)$

#### Output Layer (Softmax)

The final layer uses softmax to ensure valid probability distribution:

$$f_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}$$

This guarantees:
- $f_k \in (0, 1)$ for all $k$
- $\sum_k f_k = 1$

#### Dropout Regularization

During training, dropout randomly zeroes elements:

$$\tilde{h}_i = \begin{cases} h_i / (1-p) & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}$$

### Step 4: Training

#### Loss Function

Mean Squared Error (MSE) between predicted and true fractions:

$$\mathcal{L}(\theta) = \frac{1}{n \cdot K} \sum_{s=1}^{n} \sum_{k=1}^{K} (\hat{f}_k^{(s)} - f_k^{(s)})^2$$

#### Adam Optimizer

Parameter updates using Adam:

$$\begin{aligned}
m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\
\hat{m}_t &= m_t / (1 - \beta_1^t) \\
\hat{v}_t &= v_t / (1 - \beta_2^t) \\
\theta_{t+1} &= \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)
\end{aligned}$$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\eta = 10^{-4}$

### Step 5: Ensemble Prediction

Final prediction is the arithmetic mean across three models:

$$\hat{\mathbf{f}} = \frac{1}{3} \sum_{m \in \{256, 512, 1024\}} \mathcal{F}_{\theta_m}(\mathbf{x})$$

Ensemble benefits:
- Reduced variance in predictions
- Improved robustness to initialization
- Better generalization

## Architecture Specifications

| Model | Layer Dimensions | Dropout Rates | Total Parameters* |
|-------|-----------------|---------------|-------------------|
| M256  | G → 256 → 128 → 64 → 32 → K | 0, 0, 0, 0 | ~G×256 + 50K |
| M512  | G → 512 → 256 → 128 → 64 → K | 0, 0.3, 0.2, 0.1 | ~G×512 + 200K |
| M1024 | G → 1024 → 512 → 256 → 128 → K | 0, 0.6, 0.3, 0.1 | ~G×1024 + 800K |

*Approximate; depends on number of genes (G) and cell types (K)

## Theoretical Considerations

### Universal Approximation

Deep neural networks are universal function approximators (Hornik, 1991). Given sufficient capacity, TorchDecon can theoretically approximate any continuous mapping from expression space to fraction space.

### Advantages over Linear Models

1. **Non-linear relationships**: Can capture complex gene-cell type associations
2. **Feature learning**: Automatically learns relevant features from data
3. **Scalability**: Handles high-dimensional data efficiently
4. **No signature matrix**: Eliminates bias from pre-defined signatures

### Limitations

1. **Training data dependency**: Performance depends on quality of simulated training data
2. **Batch effects**: May be sensitive to technical differences between reference and target data
3. **Novel cell types**: Cannot predict cell types not present in training data

## References

1. Menden, K., et al. (2020). Deep learning-based cell composition analysis from tissue expression profiles. *Science Advances*, 6(30), eaba2619.

2. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. *Neural Networks*, 4(2), 251-257.

3. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. *ICLR*.

---

**Package Author**: Zaoqu Liu  
**Contact**: liuzaoqu@163.com  
**GitHub**: https://github.com/Zaoqu-Liu/TorchDecon