--- title: "Algorithm and Mathematical Framework" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm and Mathematical Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, fig.align = "center" ) ``` ## Introduction TorchDecon implements a deep learning-based approach for cell type deconvolution, originally proposed by Menden et al. (2020) in their Scaden algorithm. This vignette provides a comprehensive overview of the mathematical framework and algorithmic principles underlying TorchDecon. **Author**: Zaoqu Liu (liuzaoqu@163.com) ## Problem Formulation ### The Deconvolution Problem Cell type deconvolution aims to estimate the cellular composition of bulk tissue samples. Given a bulk expression profile $\mathbf{x} \in \mathbb{R}^G$ (where $G$ is the number of genes), we seek to estimate the cell type fraction vector $\mathbf{f} \in \mathbb{R}^K$ (where $K$ is the number of cell types) such that: $$\sum_{k=1}^{K} f_k = 1, \quad f_k \geq 0 \quad \forall k$$ ### Traditional Approaches vs. Deep Learning Traditional deconvolution methods (e.g., CIBERSORT, MuSiC) rely on: 1. **Signature matrices**: Pre-defined gene expression signatures for each cell type 2. **Linear mixing models**: Assumption that bulk expression is a linear combination of cell type signatures $$\mathbf{x} = \mathbf{S} \cdot \mathbf{f} + \boldsymbol{\epsilon}$$ where $\mathbf{S} \in \mathbb{R}^{G \times K}$ is the signature matrix. **TorchDecon's deep learning approach** instead learns a non-linear mapping directly from bulk expression to cell type fractions: $$\mathbf{f} = \mathcal{F}_\theta(\mathbf{x})$$ where $\mathcal{F}_\theta$ is a neural network with parameters $\theta$. ## Algorithmic Pipeline ### Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ TorchDecon Pipeline │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ scRNA-seq │───▶│ Bulk │───▶│ Training │ │ │ │ Reference │ │ Simulation │ │ Data │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Cell Type │◀───│ Trained │◀───│ Training │ │ │ │ Fractions │ │ Ensemble │ │ Loop │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ ▲ │ │ │ │ │ ┌──────────────┐ │ │ │ Real Bulk │ │ │ │ Data │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### Step 1: Bulk RNA-seq Simulation #### Mathematical Formulation For each simulated sample $s$: 1. **Generate random fractions**: Sample $\mathbf{f}^{(s)} \sim \text{Dirichlet}(\boldsymbol{\alpha})$ or uniform distribution, then normalize: $$f_k^{(s)} = \frac{u_k}{\sum_{j=1}^K u_j}, \quad u_k \sim \text{Uniform}(0, 1)$$ 2. **Sample cells**: For each cell type $k$, sample $n_k = \lfloor f_k^{(s)} \cdot N \rfloor$ cells (with replacement), where $N$ is the total cells per sample. 3. **Aggregate expression**: Sum the expression across sampled cells: $$x_g^{(s)} = \sum_{k=1}^{K} \sum_{i \in C_k^{(s)}} c_{gi}$$ where $C_k^{(s)}$ is the set of sampled cells from type $k$, and $c_{gi}$ is the count for gene $g$ in cell $i$. #### Sparse Sample Generation To improve model generalization, TorchDecon generates "sparse" samples where some cell types are absent: $$f_k^{(s)} = \begin{cases} \tilde{f}_k / \sum_{j \in \mathcal{A}} \tilde{f}_j & \text{if } k \in \mathcal{A} \\ 0 & \text{otherwise} \end{cases}$$ where $\mathcal{A} \subset \{1, ..., K\}$ is a randomly selected subset of cell types. ### Step 2: Data Preprocessing #### Log Transformation $$\tilde{x}_g = \log_2(x_g + 1)$$ This transformation: - Reduces the dynamic range of expression values - Stabilizes variance - Makes the data more normally distributed #### Sample-wise Min-Max Normalization For each sample $s$: $$\hat{x}_g^{(s)} = \frac{\tilde{x}_g^{(s)} - \min_j(\tilde{x}_j^{(s)})}{\max_j(\tilde{x}_j^{(s)}) - \min_j(\tilde{x}_j^{(s)})}$$ This ensures: - All features are in $[0, 1]$ - Sample-specific technical variations are mitigated - Neural network training stability is improved #### Gene Filtering Genes are filtered based on variance across samples: $$\text{Var}(x_g) = \frac{1}{n-1} \sum_{s=1}^{n} (x_g^{(s)} - \bar{x}_g)^2 > \tau$$ where $\tau$ is the variance threshold (default: 0.1). ### Step 3: Neural Network Architecture #### Fully Connected Network Each model in the ensemble is a fully connected neural network: $$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)} \mathbf{h}^{(l-1)} + \mathbf{b}^{(l)})$$ where: - $\mathbf{h}^{(l)}$ is the hidden representation at layer $l$ - $\mathbf{W}^{(l)}, \mathbf{b}^{(l)}$ are learnable weights and biases - $\sigma(\cdot)$ is the ReLU activation: $\sigma(z) = \max(0, z)$ #### Output Layer (Softmax) The final layer uses softmax to ensure valid probability distribution: $$f_k = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}$$ This guarantees: - $f_k \in (0, 1)$ for all $k$ - $\sum_k f_k = 1$ #### Dropout Regularization During training, dropout randomly zeroes elements: $$\tilde{h}_i = \begin{cases} h_i / (1-p) & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}$$ ### Step 4: Training #### Loss Function Mean Squared Error (MSE) between predicted and true fractions: $$\mathcal{L}(\theta) = \frac{1}{n \cdot K} \sum_{s=1}^{n} \sum_{k=1}^{K} (\hat{f}_k^{(s)} - f_k^{(s)})^2$$ #### Adam Optimizer Parameter updates using Adam: $$\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \\ \hat{m}_t &= m_t / (1 - \beta_1^t) \\ \hat{v}_t &= v_t / (1 - \beta_2^t) \\ \theta_{t+1} &= \theta_t - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \end{aligned}$$ Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\eta = 10^{-4}$ ### Step 5: Ensemble Prediction Final prediction is the arithmetic mean across three models: $$\hat{\mathbf{f}} = \frac{1}{3} \sum_{m \in \{256, 512, 1024\}} \mathcal{F}_{\theta_m}(\mathbf{x})$$ Ensemble benefits: - Reduced variance in predictions - Improved robustness to initialization - Better generalization ## Architecture Specifications | Model | Layer Dimensions | Dropout Rates | Total Parameters* | |-------|-----------------|---------------|-------------------| | M256 | G → 256 → 128 → 64 → 32 → K | 0, 0, 0, 0 | ~G×256 + 50K | | M512 | G → 512 → 256 → 128 → 64 → K | 0, 0.3, 0.2, 0.1 | ~G×512 + 200K | | M1024 | G → 1024 → 512 → 256 → 128 → K | 0, 0.6, 0.3, 0.1 | ~G×1024 + 800K | *Approximate; depends on number of genes (G) and cell types (K) ## Theoretical Considerations ### Universal Approximation Deep neural networks are universal function approximators (Hornik, 1991). Given sufficient capacity, TorchDecon can theoretically approximate any continuous mapping from expression space to fraction space. ### Advantages over Linear Models 1. **Non-linear relationships**: Can capture complex gene-cell type associations 2. **Feature learning**: Automatically learns relevant features from data 3. **Scalability**: Handles high-dimensional data efficiently 4. **No signature matrix**: Eliminates bias from pre-defined signatures ### Limitations 1. **Training data dependency**: Performance depends on quality of simulated training data 2. **Batch effects**: May be sensitive to technical differences between reference and target data 3. **Novel cell types**: Cannot predict cell types not present in training data ## References 1. Menden, K., et al. (2020). Deep learning-based cell composition analysis from tissue expression profiles. *Science Advances*, 6(30), eaba2619. 2. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. *Neural Networks*, 4(2), 251-257. 3. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. *ICLR*. --- **Package Author**: Zaoqu Liu **Contact**: liuzaoqu@163.com **GitHub**: https://github.com/Zaoqu-Liu/TorchDecon