--- title: "Algorithm and Mathematical Framework" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm and Mathematical Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, message = FALSE, warning = FALSE ) ``` ## Overview scGate implements a marker-based cell type purification algorithm that combines: 1. **UCell scoring** for robust signature quantification 2. **k-Nearest Neighbor (kNN) smoothing** for noise reduction 3. **Hierarchical decision trees** for multi-level gating This document provides a detailed explanation of the mathematical framework behind scGate. ```{r load-packages} library(scGate) library(Seurat) library(ggplot2) library(patchwork) ``` ## Algorithm Pipeline ### Step 1: Signature Scoring with UCell scGate uses the UCell algorithm for computing signature scores. UCell is a rank-based method that is robust to technical variation and batch effects. #### Mathematical Formulation For a cell $c$ and a gene signature $S = \{g_1, g_2, ..., g_n\}$: 1. **Rank genes** by expression in cell $c$: $R_c(g)$ is the rank of gene $g$ 2. **Compute UCell score**: $$U_c(S) = 1 - \frac{\sum_{g \in S} \min(R_c(g), R_{max})}{|S| \cdot R_{max}}$$ Where $R_{max}$ is the maximum rank considered (default: 1500). #### Key Properties - **Range**: UCell scores range from 0 to 1 - **Robustness**: Rank-based scoring is robust to outliers - **Interpretability**: Higher scores indicate stronger signature expression ```{r ucell-demo, eval=FALSE, fig.width=10, fig.height=4} # Load example data data(query.seurat) # Create a simple model model <- gating_model(name = "Tcell", signature = c("CD3D", "CD3E", "CD2")) # Apply scGate (this computes UCell scores internally) query.seurat <- scGate(query.seurat, model = model, reduction = "pca") # Visualize UCell scores p1 <- FeaturePlot(query.seurat, features = "Tcell_UCell", cols = c("gray90", "darkblue")) + ggtitle("T cell Signature Score (UCell)") p2 <- DimPlot(query.seurat, group.by = "is.pure", cols = c("Pure" = "#00ae60", "Impure" = "gray80")) + ggtitle("scGate Classification") p1 + p2 ``` ### Step 2: kNN Smoothing Single-cell data is inherently sparse. scGate applies kNN smoothing to reduce noise and improve classification accuracy. #### Mathematical Formulation For each cell $c$, let $N_k(c)$ be its k-nearest neighbors in the reduced dimensional space. The smoothed score is: $$\tilde{U}_c(S) = \sum_{c' \in N_k(c)} w_{cc'} \cdot U_{c'}(S)$$ Where the weights $w_{cc'}$ are computed using exponential decay: $$w_{cc'} = (1 - \alpha)^{d(c, c')}$$ - $\alpha$ is the decay parameter (default: 0.1) - $d(c, c')$ is the rank of neighbor $c'$ (1 for nearest, k for furthest) #### Effect of Smoothing ```{r smoothing-effect, eval=FALSE, fig.width=10, fig.height=4} # Compare raw vs smoothed scores # The smoothed scores are already computed by scGate # Create histogram of scores score_data <- data.frame( score = query.seurat$Tcell_UCell, classification = query.seurat$is.pure ) p1 <- ggplot(score_data, aes(x = score, fill = classification)) + geom_histogram(bins = 30, alpha = 0.7, position = "identity") + scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) + labs(title = "Distribution of T cell Scores", x = "UCell Score (kNN-smoothed)", y = "Count") + theme_minimal() + theme(legend.position = "bottom") p2 <- ggplot(score_data, aes(x = classification, y = score, fill = classification)) + geom_violin(alpha = 0.7) + geom_boxplot(width = 0.2, fill = "white", alpha = 0.8) + scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) + labs(title = "Score Distribution by Class", x = "Classification", y = "UCell Score") + theme_minimal() + theme(legend.position = "none") p1 + p2 ``` ### Step 3: Hierarchical Decision Trees scGate models can have multiple levels, forming a hierarchical decision tree similar to flow cytometry gating strategies. #### Gating Logic At each level $l$, cells are classified based on: 1. **Positive signatures** $S^+$: Must exceed threshold $\theta^+$ 2. **Negative signatures** $S^-$: Must be below threshold $\theta^-$ A cell passes level $l$ if: $$\max_{s \in S^+} \tilde{U}_c(s) > \theta^+ \quad \text{AND} \quad \max_{s \in S^-} \tilde{U}_c(s) < \theta^-$$ #### Parameter Decay For multi-level models, parameters decay at each level to handle progressively smaller cell populations: $$k_l = k_0 \cdot (1 - \delta)^{l-1}$$ $$n_l = n_0 \cdot (1 - \delta)^{l-1}$$ Where: - $k_0$ is the initial number of neighbors - $n_0$ is the initial number of features - $\delta$ is the decay parameter (default: 0.25) ```{r hierarchical-demo, fig.width=10, fig.height=4} # Create a hierarchical model hierarchical_model <- gating_model( level = 1, name = "Immune", signature = c("PTPRC") # CD45 ) hierarchical_model <- gating_model( model = hierarchical_model, level = 2, name = "Tcell", signature = c("CD3D", "CD3E") ) # View model structure print(hierarchical_model) ``` ## Performance Metrics scGate provides functions to evaluate classification performance. ### Matthews Correlation Coefficient (MCC) MCC is a balanced measure that works well even with imbalanced classes: $$MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$ Where: - TP = True Positives - TN = True Negatives - FP = False Positives - FN = False Negatives ### Numerical Stability scGate implements numerically stable calculations: ```{r performance-demo} # Simulate classification results set.seed(42) actual <- sample(c(0, 1), 100, replace = TRUE, prob = c(0.7, 0.3)) predicted <- actual # Add some noise predicted[sample(which(actual == 1), 5)] <- 0 predicted[sample(which(actual == 0), 3)] <- 1 # Calculate performance metrics metrics <- performance.metrics(actual, predicted) print(metrics) ``` ## Computational Considerations ### Vectorized Operations scGate uses vectorized operations for efficiency: ```{r vectorization} # Example: rowSums is much faster than apply n <- 1000 m <- 100 mat <- matrix(runif(n * m), nrow = n) # Timing comparison (conceptual) # apply(mat, 1, sum) # Slower # rowSums(mat) # Much faster - used by scGate ``` ### Parallel Processing For multi-model classification, scGate supports parallel processing: ```{r parallel, eval=FALSE} library(BiocParallel) # Use multiple cores result <- scGate( data = seurat_obj, model = model_list, ncores = 4 # Use 4 cores ) # Or specify custom BPPARAM result <- scGate( data = seurat_obj, model = model_list, BPPARAM = MulticoreParam(workers = 4) ) ``` ## Summary The scGate algorithm combines three key components: | Component | Purpose | Key Parameter | |-----------|---------|---------------| | UCell scoring | Robust signature quantification | `maxRank` | | kNN smoothing | Noise reduction | `k.param`, `smooth.decay` | | Hierarchical gating | Multi-level classification | `param_decay` | This combination enables accurate, interpretable cell type purification without requiring reference datasets. ## References 1. Andreatta M, Berenstein AJ, Carmona SJ. scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets. *Bioinformatics*. 2022;38(9):2642-2644. 2. Andreatta M, Carmona SJ. UCell: Robust and scalable single-cell gene signature scoring. *Computational and Structural Biotechnology Journal*. 2021;19:3796-3798. ## Session Info ```{r session-info} sessionInfo() ```