--- title: "SCEVAN Algorithm and Methodology" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: true vignette: > %\VignetteIndexEntry{SCEVAN Algorithm and Methodology} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>", fig.width = 10, fig.height = 6, out.width = "100%" ) ``` ## Overview SCEVAN implements a **variational segmentation algorithm** combined with a **hierarchical hidden Markov model** to infer copy number alterations from single-cell RNA sequencing data. This document provides a comprehensive description of the methodology. ## Algorithmic Framework ### 1. Data Preprocessing Pipeline The preprocessing pipeline consists of several key steps: **Pipeline Steps:** 1. Raw Count Matrix → Quality Filtering (cells >200 genes) 2. Gene Annotation (Genomic coordinates) 3. Cell Cycle Removal 4. Freeman-Tukey Transformation 5. Confident Normal Cell Detection 6. Baseline Subtraction 7. VegaMC Segmentation 8. EM-based CN Calling 9. Subclone Detection #### 1.1 Quality Control Cells are filtered based on: - Minimum 200 expressed genes per cell - Genes expressed in at least 1% of cells - Minimum genes per chromosome (default: 5) ```{r qc-formula, echo=FALSE, results='asis'} cat("**Filtering criteria:**\n\n") cat("$$\\text{Keep cell } c \\text{ if: } \\sum_{g=1}^{G} \\mathbb{1}(x_{gc} > 0) \\geq 200$$\n\n") cat("$$\\text{Keep gene } g \\text{ if: } \\frac{\\sum_{c=1}^{C} \\mathbb{1}(x_{gc} > 0)}{C} \\geq 0.01$$\n") ``` #### 1.2 Variance Stabilization The **Freeman-Tukey transformation** is applied to stabilize variance: $$y_{gc} = \\log\\left(\\sqrt{x_{gc}} + \\sqrt{x_{gc} + 1}\\right)$$ where $x_{gc}$ is the raw count for gene $g$ in cell $c$. This transformation: - Approximates a variance-stabilizing transformation for Poisson data - Handles zero counts gracefully - Produces approximately homoscedastic data ### 2. Confident Normal Cell Detection SCEVAN automatically identifies normal (non-malignant) cells using gene set enrichment analysis. #### 2.1 Single-Sample Gene Set Testing The **Mann-Whitney-Wilcoxon Gene Set Test** (MWW-GST) evaluates enrichment of cell-type specific signatures: $$\\text{NES}_s = \\frac{\\bar{R}_s - \\mu_0}{\\sigma_0}$$ where: - $\\bar{R}_s$ is the mean rank of signature genes - $\\mu_0 = (n+1)/2$ is the expected mean under null - $\\sigma_0$ is the standard deviation under null #### 2.2 Cell Type Signatures Built-in signatures include: | Cell Type | Description | |-----------|-------------| | T cells | CD3D, CD3E, CD4, CD8A markers | | B cells | CD19, CD79A, MS4A1 markers | | Myeloid | CD14, CD68, ITGAM markers | | Endothelial | PECAM1, VWF, CDH5 markers | | Fibroblasts | COL1A1, DCN, FAP markers | ### 3. VegaMC Segmentation Algorithm The core segmentation uses **VegaMC** (Variational Estimator of Genomic Alterations using Multichannel), a greedy algorithm optimizing the objective: $$\\mathcal{L}(\\mathbf{s}) = \\sum_{i=1}^{K} \\sum_{c=1}^{C} \\left( y_{ic} - \\mu_{s_i,c} \\right)^2 + \\beta K$$ where: - $\\mathbf{s}$ is the segmentation - $K$ is the number of segments - $\\beta$ controls segmentation granularity - $\\mu_{s_i,c}$ is the mean expression in segment $s_i$ for cell $c$ #### 3.1 Greedy Optimization The algorithm uses a **priority queue** to efficiently merge adjacent segments: **Algorithm: VegaMC Segmentation** ``` Input: Expression matrix Y, parameter β Output: Segmentation S 1. Initialize: Each probe is a segment 2. Build priority queue Q with merge costs 3. While min(Q) < β: a. Pop minimum cost merge (i, j) b. Merge segments i and j c. Update affected merge costs in Q 4. Return final segmentation S ``` #### 3.2 Statistical Testing For each segment, hypothesis tests determine significance: - **Loss test**: $H_0: \\mu = 0$ vs $H_1: \\mu < 0$ - **Gain test**: $H_0: \\mu = 0$ vs $H_1: \\mu > 0$ P-values are computed using t-tests across cells. ### 4. Copy Number State Calling An **Expectation-Maximization (EM)** algorithm assigns copy number states. #### 4.1 Gaussian Mixture Model Segment means are modeled as a mixture of 5 Gaussian components: $$p(x) = \\sum_{k=0}^{4} \\pi_k \\mathcal{N}(x | \\mu_k, \\sigma_k^2)$$ | State | CN | Description | |-------|---|-------------| | 0 | Homozygous deletion | Deep loss | | 1 | Hemizygous deletion | Single copy loss | | 2 | Diploid | Normal | | 3 | Single copy gain | Amplification | | 4 | High amplification | Multi-copy gain | #### 4.2 EM Algorithm **E-step**: Compute posterior probabilities $$\\gamma_{ik} = \\frac{\\pi_k \\mathcal{N}(x_i | \\mu_k, \\sigma_k^2)}{\\sum_{j=0}^{4} \\pi_j \\mathcal{N}(x_i | \\mu_j, \\sigma_j^2)}$$ **M-step**: Update parameters $$\\mu_k^{new} = \\frac{\\sum_i \\gamma_{ik} x_i}{\\sum_i \\gamma_{ik}}$$ $$\\sigma_k^{2,new} = \\frac{\\sum_i \\gamma_{ik} (x_i - \\mu_k^{new})^2}{\\sum_i \\gamma_{ik}}$$ ### 5. Subclone Detection #### 5.1 Graph-Based Clustering Subclones are identified using **Louvain community detection** on a shared nearest neighbor (SNN) graph: 1. Construct k-NN graph on CNA profiles 2. Build SNN graph 3. Apply Louvain algorithm 4. Filter small clusters (< 2% of cells) #### 5.2 Modularity Optimization The Louvain algorithm maximizes modularity: $$Q = \\frac{1}{2m} \\sum_{ij} \\left[ A_{ij} - \\frac{k_i k_j}{2m} \\right] \\delta(c_i, c_j)$$ where: - $A_{ij}$ is the adjacency matrix - $k_i$ is the degree of node $i$ - $m$ is total edges - $\\delta(c_i, c_j) = 1$ if nodes are in same community ### 6. Differential Analysis #### 6.1 Subclone-Specific Alterations Alterations are classified as: - **Clonal**: Present in all subclones - **Shared**: Present in multiple (not all) subclones - **Subclone-specific**: Present in single subclone #### 6.2 Pathway Enrichment **fGSEA** (fast Gene Set Enrichment Analysis) identifies enriched pathways: $$ES = \\max_{j=1}^{n} \\left| \\sum_{i=1}^{j} \\frac{|r_i|^p}{N_R} - \\frac{j - \\text{hit}_j}{N - N_H} \\right|$$ ## Performance Characteristics ### Benchmarking Results Based on analysis of 106 samples (93,332 cells): | Metric | SCEVAN | InferCNV | CopyKAT | |--------|--------|----------|---------| | F1 Score | **0.90** | 0.63 | 0.65 | | Precision | 0.89 | 0.60 | 0.62 | | Recall | 0.91 | 0.66 | 0.68 | ### Computational Complexity - **Time**: O(n × g × log(g)) where n = cells, g = genes - **Space**: O(n × g) for count matrix storage - **Parallelization**: Linear speedup with cores ## Parameter Guidelines | Parameter | Default | Description | Tuning Advice | |-----------|---------|-------------|---------------| | `beta_vega` | 0.5 | Segmentation granularity | Increase for noisy data | | `par_cores` | 20 | Parallel cores | Set to available CPUs | | `ngenes_chr` | 5 | Min genes per chromosome | Lower for sparse data | | `perc_genes` | 0.1 | Gene expression threshold | Lower for low-coverage | ## References 1. De Falco, A., et al. (2023). A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. *Nature Communications*, 14, 1074. 2. Mardis, E.R., & Wilson, R.K. (2009). Cancer genome sequencing: a review. *Human Molecular Genetics*, 18(R2), R163-R168. 3. Blondel, V.D., et al. (2008). Fast unfolding of communities in large networks. *Journal of Statistical Mechanics*, P10008. ## Session Info ```{r session-info} sessionInfo() ```