Algorithm and Mathematical Framework

Overview

scGate implements a marker-based cell type purification algorithm that combines:

  1. UCell scoring for robust signature quantification
  2. k-Nearest Neighbor (kNN) smoothing for noise reduction
  3. Hierarchical decision trees for multi-level gating

This document provides a detailed explanation of the mathematical framework behind scGate.

library(scGate)
library(Seurat)
library(ggplot2)
library(patchwork)

Algorithm Pipeline

Step 1: Signature Scoring with UCell

scGate uses the UCell algorithm for computing signature scores. UCell is a rank-based method that is robust to technical variation and batch effects.

Mathematical Formulation

For a cell \(c\) and a gene signature \(S = \{g_1, g_2, ..., g_n\}\):

  1. Rank genes by expression in cell \(c\): \(R_c(g)\) is the rank of gene \(g\)
  2. Compute UCell score:

\[U_c(S) = 1 - \frac{\sum_{g \in S} \min(R_c(g), R_{max})}{|S| \cdot R_{max}}\]

Where \(R_{max}\) is the maximum rank considered (default: 1500).

Key Properties

  • Range: UCell scores range from 0 to 1
  • Robustness: Rank-based scoring is robust to outliers
  • Interpretability: Higher scores indicate stronger signature expression
# Load example data
data(query.seurat)

# Create a simple model
model <- gating_model(name = "Tcell", signature = c("CD3D", "CD3E", "CD2"))

# Apply scGate (this computes UCell scores internally)
query.seurat <- scGate(query.seurat, model = model, reduction = "pca")

# Visualize UCell scores
p1 <- FeaturePlot(query.seurat, features = "Tcell_UCell", 
                  cols = c("gray90", "darkblue")) +
  ggtitle("T cell Signature Score (UCell)")

p2 <- DimPlot(query.seurat, group.by = "is.pure",
              cols = c("Pure" = "#00ae60", "Impure" = "gray80")) +
  ggtitle("scGate Classification")

p1 + p2

Step 2: kNN Smoothing

Single-cell data is inherently sparse. scGate applies kNN smoothing to reduce noise and improve classification accuracy.

Mathematical Formulation

For each cell \(c\), let \(N_k(c)\) be its k-nearest neighbors in the reduced dimensional space. The smoothed score is:

\[\tilde{U}_c(S) = \sum_{c' \in N_k(c)} w_{cc'} \cdot U_{c'}(S)\]

Where the weights \(w_{cc'}\) are computed using exponential decay:

\[w_{cc'} = (1 - \alpha)^{d(c, c')}\]

  • \(\alpha\) is the decay parameter (default: 0.1)
  • \(d(c, c')\) is the rank of neighbor \(c'\) (1 for nearest, k for furthest)

Effect of Smoothing

# Compare raw vs smoothed scores
# The smoothed scores are already computed by scGate

# Create histogram of scores
score_data <- data.frame(
  score = query.seurat$Tcell_UCell,
  classification = query.seurat$is.pure
)

p1 <- ggplot(score_data, aes(x = score, fill = classification)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
  labs(title = "Distribution of T cell Scores",
       x = "UCell Score (kNN-smoothed)",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "bottom")

p2 <- ggplot(score_data, aes(x = classification, y = score, fill = classification)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.8) +
  scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
  labs(title = "Score Distribution by Class",
       x = "Classification",
       y = "UCell Score") +
  theme_minimal() +
  theme(legend.position = "none")

p1 + p2

Step 3: Hierarchical Decision Trees

scGate models can have multiple levels, forming a hierarchical decision tree similar to flow cytometry gating strategies.

Gating Logic

At each level \(l\), cells are classified based on:

  1. Positive signatures \(S^+\): Must exceed threshold \(\theta^+\)
  2. Negative signatures \(S^-\): Must be below threshold \(\theta^-\)

A cell passes level \(l\) if:

\[\max_{s \in S^+} \tilde{U}_c(s) > \theta^+ \quad \text{AND} \quad \max_{s \in S^-} \tilde{U}_c(s) < \theta^-\]

Parameter Decay

For multi-level models, parameters decay at each level to handle progressively smaller cell populations:

\[k_l = k_0 \cdot (1 - \delta)^{l-1}\] \[n_l = n_0 \cdot (1 - \delta)^{l-1}\]

Where: - \(k_0\) is the initial number of neighbors - \(n_0\) is the initial number of features - \(\delta\) is the decay parameter (default: 0.25)

# Create a hierarchical model
hierarchical_model <- gating_model(
  level = 1, 
  name = "Immune", 
  signature = c("PTPRC")  # CD45
)
hierarchical_model <- gating_model(
  model = hierarchical_model,
  level = 2,
  name = "Tcell",
  signature = c("CD3D", "CD3E")
)

# View model structure
print(hierarchical_model)
#>   levels   use_as   name signature
#> 1 level1 positive Immune     PTPRC
#> 2 level2 positive  Tcell CD3D;CD3E

Performance Metrics

scGate provides functions to evaluate classification performance.

Matthews Correlation Coefficient (MCC)

MCC is a balanced measure that works well even with imbalanced classes:

\[MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]

Where: - TP = True Positives - TN = True Negatives
- FP = False Positives - FN = False Negatives

Numerical Stability

scGate implements numerically stable calculations:

# Simulate classification results
set.seed(42)
actual <- sample(c(0, 1), 100, replace = TRUE, prob = c(0.7, 0.3))
predicted <- actual
# Add some noise
predicted[sample(which(actual == 1), 5)] <- 0
predicted[sample(which(actual == 0), 3)] <- 1

# Calculate performance metrics
metrics <- performance.metrics(actual, predicted)
print(metrics)
#>      PREC       REC       MCC 
#> 0.9062500 0.8529412 0.8200066

Computational Considerations

Vectorized Operations

scGate uses vectorized operations for efficiency:

# Example: rowSums is much faster than apply
n <- 1000
m <- 100
mat <- matrix(runif(n * m), nrow = n)

# Timing comparison (conceptual)
# apply(mat, 1, sum)  # Slower
# rowSums(mat)        # Much faster - used by scGate

Parallel Processing

For multi-model classification, scGate supports parallel processing:

library(BiocParallel)

# Use multiple cores
result <- scGate(
  data = seurat_obj,
  model = model_list,
  ncores = 4  # Use 4 cores
)

# Or specify custom BPPARAM
result <- scGate(
  data = seurat_obj,
  model = model_list,
  BPPARAM = MulticoreParam(workers = 4)
)

Summary

The scGate algorithm combines three key components:

Component Purpose Key Parameter
UCell scoring Robust signature quantification maxRank
kNN smoothing Noise reduction k.param, smooth.decay
Hierarchical gating Multi-level classification param_decay

This combination enables accurate, interpretable cell type purification without requiring reference datasets.

References

  1. Andreatta M, Berenstein AJ, Carmona SJ. scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets. Bioinformatics. 2022;38(9):2642-2644.

  2. Andreatta M, Carmona SJ. UCell: Robust and scalable single-cell gene signature scoring. Computational and Structural Biotechnology Journal. 2021;19:3796-3798.

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] patchwork_1.3.2    ggplot2_4.0.3      Seurat_5.5.0       SeuratObject_5.4.0
#> [5] sp_2.2-1           scGate_1.7.2       rmarkdown_2.31    
#> 
#> loaded via a namespace (and not attached):
#>   [1] deldir_2.0-4           pbapply_1.7-4          gridExtra_2.3         
#>   [4] rlang_1.2.0            magrittr_2.0.5         RcppAnnoy_0.0.23      
#>   [7] otel_0.2.0             spatstat.geom_3.8-1    matrixStats_1.5.0     
#>  [10] ggridges_0.5.7         compiler_4.6.0         png_0.1-9             
#>  [13] vctrs_0.7.3            reshape2_1.4.5         stringr_1.6.0         
#>  [16] pkgconfig_2.0.3        fastmap_1.2.0          promises_1.5.0        
#>  [19] purrr_1.2.2            xfun_0.57              cachem_1.1.0          
#>  [22] jsonlite_2.0.0         goftest_1.2-3          later_1.4.8           
#>  [25] BiocParallel_1.47.0    spatstat.utils_3.2-3   irlba_2.3.7           
#>  [28] parallel_4.6.0         cluster_2.1.8.2        R6_2.6.1              
#>  [31] ica_1.0-3              spatstat.data_3.1-9    bslib_0.11.0          
#>  [34] stringi_1.8.7          RColorBrewer_1.1-3     reticulate_1.46.0     
#>  [37] spatstat.univar_3.2-0  parallelly_1.47.0      lmtest_0.9-40         
#>  [40] jquerylib_0.1.4        scattermore_1.2        Rcpp_1.1.1-1.1        
#>  [43] knitr_1.51             tensor_1.5.1           future.apply_1.20.2   
#>  [46] zoo_1.8-15             sctransform_0.4.3      httpuv_1.6.17         
#>  [49] Matrix_1.7-5           splines_4.6.0          igraph_2.3.1          
#>  [52] tidyselect_1.2.1       abind_1.4-8            yaml_2.3.12           
#>  [55] spatstat.random_3.5-0  codetools_0.2-20       miniUI_0.1.2          
#>  [58] spatstat.explore_3.8-1 listenv_0.10.1         lattice_0.22-9        
#>  [61] tibble_3.3.1           plyr_1.8.9             withr_3.0.2           
#>  [64] shiny_1.13.0           S7_0.2.2               ROCR_1.0-12           
#>  [67] evaluate_1.0.5         Rtsne_0.17             future_1.70.0         
#>  [70] fastDummies_1.7.6      survival_3.8-6         polyclip_1.10-7       
#>  [73] fitdistrplus_1.2-6     pillar_1.11.1          KernSmooth_2.23-26    
#>  [76] plotly_4.12.0          generics_0.1.4         RcppHNSW_0.6.0        
#>  [79] scales_1.4.0           globals_0.19.1         xtable_1.8-8          
#>  [82] glue_1.8.1             lazyeval_0.2.3         maketools_1.3.2       
#>  [85] tools_4.6.0            BiocNeighbors_2.7.2    sys_3.4.3             
#>  [88] data.table_1.18.4      RSpectra_0.16-2        RANN_2.6.2            
#>  [91] buildtools_1.0.0       dotCall64_1.2          cowplot_1.2.0         
#>  [94] grid_4.6.0             tidyr_1.3.2            colorspace_2.1-2      
#>  [97] nlme_3.1-169           cli_3.6.6              spatstat.sparse_3.2-0 
#> [100] spam_2.11-3            viridisLite_0.4.3      dplyr_1.2.1           
#> [103] uwot_0.2.4             gtable_0.3.6           sass_0.4.10           
#> [106] digest_0.6.39          progressr_0.19.0       ggrepel_0.9.8         
#> [109] htmlwidgets_1.6.4      farver_2.1.2           htmltools_0.5.9       
#> [112] lifecycle_1.0.5        httr_1.4.8             mime_0.13             
#> [115] MASS_7.3-65