scGate implements a marker-based cell type purification algorithm that combines:
This document provides a detailed explanation of the mathematical framework behind scGate.
scGate uses the UCell algorithm for computing signature scores. UCell is a rank-based method that is robust to technical variation and batch effects.
For a cell \(c\) and a gene signature \(S = \{g_1, g_2, ..., g_n\}\):
\[U_c(S) = 1 - \frac{\sum_{g \in S} \min(R_c(g), R_{max})}{|S| \cdot R_{max}}\]
Where \(R_{max}\) is the maximum rank considered (default: 1500).
# Load example data
data(query.seurat)
# Create a simple model
model <- gating_model(name = "Tcell", signature = c("CD3D", "CD3E", "CD2"))
# Apply scGate (this computes UCell scores internally)
query.seurat <- scGate(query.seurat, model = model, reduction = "pca")
# Visualize UCell scores
p1 <- FeaturePlot(query.seurat, features = "Tcell_UCell",
cols = c("gray90", "darkblue")) +
ggtitle("T cell Signature Score (UCell)")
p2 <- DimPlot(query.seurat, group.by = "is.pure",
cols = c("Pure" = "#00ae60", "Impure" = "gray80")) +
ggtitle("scGate Classification")
p1 + p2Single-cell data is inherently sparse. scGate applies kNN smoothing to reduce noise and improve classification accuracy.
For each cell \(c\), let \(N_k(c)\) be its k-nearest neighbors in the reduced dimensional space. The smoothed score is:
\[\tilde{U}_c(S) = \sum_{c' \in N_k(c)} w_{cc'} \cdot U_{c'}(S)\]
Where the weights \(w_{cc'}\) are computed using exponential decay:
\[w_{cc'} = (1 - \alpha)^{d(c, c')}\]
# Compare raw vs smoothed scores
# The smoothed scores are already computed by scGate
# Create histogram of scores
score_data <- data.frame(
score = query.seurat$Tcell_UCell,
classification = query.seurat$is.pure
)
p1 <- ggplot(score_data, aes(x = score, fill = classification)) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
labs(title = "Distribution of T cell Scores",
x = "UCell Score (kNN-smoothed)",
y = "Count") +
theme_minimal() +
theme(legend.position = "bottom")
p2 <- ggplot(score_data, aes(x = classification, y = score, fill = classification)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.8) +
scale_fill_manual(values = c("Pure" = "#00ae60", "Impure" = "gray60")) +
labs(title = "Score Distribution by Class",
x = "Classification",
y = "UCell Score") +
theme_minimal() +
theme(legend.position = "none")
p1 + p2scGate models can have multiple levels, forming a hierarchical decision tree similar to flow cytometry gating strategies.
At each level \(l\), cells are classified based on:
A cell passes level \(l\) if:
\[\max_{s \in S^+} \tilde{U}_c(s) > \theta^+ \quad \text{AND} \quad \max_{s \in S^-} \tilde{U}_c(s) < \theta^-\]
For multi-level models, parameters decay at each level to handle progressively smaller cell populations:
\[k_l = k_0 \cdot (1 - \delta)^{l-1}\] \[n_l = n_0 \cdot (1 - \delta)^{l-1}\]
Where: - \(k_0\) is the initial number of neighbors - \(n_0\) is the initial number of features - \(\delta\) is the decay parameter (default: 0.25)
# Create a hierarchical model
hierarchical_model <- gating_model(
level = 1,
name = "Immune",
signature = c("PTPRC") # CD45
)
hierarchical_model <- gating_model(
model = hierarchical_model,
level = 2,
name = "Tcell",
signature = c("CD3D", "CD3E")
)
# View model structure
print(hierarchical_model)
#> levels use_as name signature
#> 1 level1 positive Immune PTPRC
#> 2 level2 positive Tcell CD3D;CD3EscGate provides functions to evaluate classification performance.
MCC is a balanced measure that works well even with imbalanced classes:
\[MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]
Where: - TP = True Positives - TN = True Negatives
- FP = False Positives - FN = False Negatives
scGate implements numerically stable calculations:
# Simulate classification results
set.seed(42)
actual <- sample(c(0, 1), 100, replace = TRUE, prob = c(0.7, 0.3))
predicted <- actual
# Add some noise
predicted[sample(which(actual == 1), 5)] <- 0
predicted[sample(which(actual == 0), 3)] <- 1
# Calculate performance metrics
metrics <- performance.metrics(actual, predicted)
print(metrics)
#> PREC REC MCC
#> 0.9062500 0.8529412 0.8200066scGate uses vectorized operations for efficiency:
For multi-model classification, scGate supports parallel processing:
The scGate algorithm combines three key components:
| Component | Purpose | Key Parameter |
|---|---|---|
| UCell scoring | Robust signature quantification | maxRank |
| kNN smoothing | Noise reduction | k.param, smooth.decay |
| Hierarchical gating | Multi-level classification | param_decay |
This combination enables accurate, interpretable cell type purification without requiring reference datasets.
Andreatta M, Berenstein AJ, Carmona SJ. scGate: marker-based purification of cell types from heterogeneous single-cell RNA-seq datasets. Bioinformatics. 2022;38(9):2642-2644.
Andreatta M, Carmona SJ. UCell: Robust and scalable single-cell gene signature scoring. Computational and Structural Biotechnology Journal. 2021;19:3796-3798.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] patchwork_1.3.2 ggplot2_4.0.3 Seurat_5.5.0 SeuratObject_5.4.0
#> [5] sp_2.2-1 scGate_1.7.2 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] deldir_2.0-4 pbapply_1.7-4 gridExtra_2.3
#> [4] rlang_1.2.0 magrittr_2.0.5 RcppAnnoy_0.0.23
#> [7] otel_0.2.0 spatstat.geom_3.8-1 matrixStats_1.5.0
#> [10] ggridges_0.5.7 compiler_4.6.0 png_0.1-9
#> [13] vctrs_0.7.3 reshape2_1.4.5 stringr_1.6.0
#> [16] pkgconfig_2.0.3 fastmap_1.2.0 promises_1.5.0
#> [19] purrr_1.2.2 xfun_0.57 cachem_1.1.0
#> [22] jsonlite_2.0.0 goftest_1.2-3 later_1.4.8
#> [25] BiocParallel_1.47.0 spatstat.utils_3.2-3 irlba_2.3.7
#> [28] parallel_4.6.0 cluster_2.1.8.2 R6_2.6.1
#> [31] ica_1.0-3 spatstat.data_3.1-9 bslib_0.11.0
#> [34] stringi_1.8.7 RColorBrewer_1.1-3 reticulate_1.46.0
#> [37] spatstat.univar_3.2-0 parallelly_1.47.0 lmtest_0.9-40
#> [40] jquerylib_0.1.4 scattermore_1.2 Rcpp_1.1.1-1.1
#> [43] knitr_1.51 tensor_1.5.1 future.apply_1.20.2
#> [46] zoo_1.8-15 sctransform_0.4.3 httpuv_1.6.17
#> [49] Matrix_1.7-5 splines_4.6.0 igraph_2.3.1
#> [52] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.12
#> [55] spatstat.random_3.5-0 codetools_0.2-20 miniUI_0.1.2
#> [58] spatstat.explore_3.8-1 listenv_0.10.1 lattice_0.22-9
#> [61] tibble_3.3.1 plyr_1.8.9 withr_3.0.2
#> [64] shiny_1.13.0 S7_0.2.2 ROCR_1.0-12
#> [67] evaluate_1.0.5 Rtsne_0.17 future_1.70.0
#> [70] fastDummies_1.7.6 survival_3.8-6 polyclip_1.10-7
#> [73] fitdistrplus_1.2-6 pillar_1.11.1 KernSmooth_2.23-26
#> [76] plotly_4.12.0 generics_0.1.4 RcppHNSW_0.6.0
#> [79] scales_1.4.0 globals_0.19.1 xtable_1.8-8
#> [82] glue_1.8.1 lazyeval_0.2.3 maketools_1.3.2
#> [85] tools_4.6.0 BiocNeighbors_2.7.2 sys_3.4.3
#> [88] data.table_1.18.4 RSpectra_0.16-2 RANN_2.6.2
#> [91] buildtools_1.0.0 dotCall64_1.2 cowplot_1.2.0
#> [94] grid_4.6.0 tidyr_1.3.2 colorspace_2.1-2
#> [97] nlme_3.1-169 cli_3.6.6 spatstat.sparse_3.2-0
#> [100] spam_2.11-3 viridisLite_0.4.3 dplyr_1.2.1
#> [103] uwot_0.2.4 gtable_0.3.6 sass_0.4.10
#> [106] digest_0.6.39 progressr_0.19.0 ggrepel_0.9.8
#> [109] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.9
#> [112] lifecycle_1.0.5 httr_1.4.8 mime_0.13
#> [115] MASS_7.3-65