---
title: "SCEVAN Algorithm and Methodology"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_caption: true
vignette: >
  %\VignetteIndexEntry{SCEVAN Algorithm and Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 6,
  out.width = "100%"
)
```

## Overview

SCEVAN implements a **variational segmentation algorithm** combined with a **hierarchical hidden Markov model** to infer copy number alterations from single-cell RNA sequencing data. This document provides a comprehensive description of the methodology.

## Algorithmic Framework

### 1. Data Preprocessing Pipeline

The preprocessing pipeline consists of several key steps:

**Pipeline Steps:**

1. Raw Count Matrix → Quality Filtering (cells >200 genes)
2. Gene Annotation (Genomic coordinates)
3. Cell Cycle Removal
4. Freeman-Tukey Transformation
5. Confident Normal Cell Detection
6. Baseline Subtraction
7. VegaMC Segmentation
8. EM-based CN Calling
9. Subclone Detection

#### 1.1 Quality Control

Cells are filtered based on:

- Minimum 200 expressed genes per cell
- Genes expressed in at least 1% of cells
- Minimum genes per chromosome (default: 5)

```{r qc-formula, echo=FALSE, results='asis'}
cat("**Filtering criteria:**\n\n")
cat("$$\\text{Keep cell } c \\text{ if: } \\sum_{g=1}^{G} \\mathbb{1}(x_{gc} > 0) \\geq 200$$\n\n")
cat("$$\\text{Keep gene } g \\text{ if: } \\frac{\\sum_{c=1}^{C} \\mathbb{1}(x_{gc} > 0)}{C} \\geq 0.01$$\n")
```

#### 1.2 Variance Stabilization

The **Freeman-Tukey transformation** is applied to stabilize variance:

$$y_{gc} = \\log\\left(\\sqrt{x_{gc}} + \\sqrt{x_{gc} + 1}\\right)$$

where $x_{gc}$ is the raw count for gene $g$ in cell $c$.

This transformation:

- Approximates a variance-stabilizing transformation for Poisson data
- Handles zero counts gracefully
- Produces approximately homoscedastic data

### 2. Confident Normal Cell Detection

SCEVAN automatically identifies normal (non-malignant) cells using gene set enrichment analysis.

#### 2.1 Single-Sample Gene Set Testing

The **Mann-Whitney-Wilcoxon Gene Set Test** (MWW-GST) evaluates enrichment of cell-type specific signatures:

$$\\text{NES}_s = \\frac{\\bar{R}_s - \\mu_0}{\\sigma_0}$$

where:

- $\\bar{R}_s$ is the mean rank of signature genes
- $\\mu_0 = (n+1)/2$ is the expected mean under null
- $\\sigma_0$ is the standard deviation under null

#### 2.2 Cell Type Signatures

Built-in signatures include:

| Cell Type | Description |
|-----------|-------------|
| T cells | CD3D, CD3E, CD4, CD8A markers |
| B cells | CD19, CD79A, MS4A1 markers |
| Myeloid | CD14, CD68, ITGAM markers |
| Endothelial | PECAM1, VWF, CDH5 markers |
| Fibroblasts | COL1A1, DCN, FAP markers |

### 3. VegaMC Segmentation Algorithm

The core segmentation uses **VegaMC** (Variational Estimator of Genomic Alterations using Multichannel), a greedy algorithm optimizing the objective:

$$\\mathcal{L}(\\mathbf{s}) = \\sum_{i=1}^{K} \\sum_{c=1}^{C} \\left( y_{ic} - \\mu_{s_i,c} \\right)^2 + \\beta K$$

where:

- $\\mathbf{s}$ is the segmentation
- $K$ is the number of segments
- $\\beta$ controls segmentation granularity
- $\\mu_{s_i,c}$ is the mean expression in segment $s_i$ for cell $c$

#### 3.1 Greedy Optimization

The algorithm uses a **priority queue** to efficiently merge adjacent segments:

**Algorithm: VegaMC Segmentation**

```
Input: Expression matrix Y, parameter β
Output: Segmentation S

1. Initialize: Each probe is a segment
2. Build priority queue Q with merge costs
3. While min(Q) < β:
   a. Pop minimum cost merge (i, j)
   b. Merge segments i and j
   c. Update affected merge costs in Q
4. Return final segmentation S
```

#### 3.2 Statistical Testing

For each segment, hypothesis tests determine significance:

- **Loss test**: $H_0: \\mu = 0$ vs $H_1: \\mu < 0$
- **Gain test**: $H_0: \\mu = 0$ vs $H_1: \\mu > 0$

P-values are computed using t-tests across cells.

### 4. Copy Number State Calling

An **Expectation-Maximization (EM)** algorithm assigns copy number states.

#### 4.1 Gaussian Mixture Model

Segment means are modeled as a mixture of 5 Gaussian components:

$$p(x) = \\sum_{k=0}^{4} \\pi_k \\mathcal{N}(x | \\mu_k, \\sigma_k^2)$$

| State | CN | Description |
|-------|---|-------------|
| 0 | Homozygous deletion | Deep loss |
| 1 | Hemizygous deletion | Single copy loss |
| 2 | Diploid | Normal |
| 3 | Single copy gain | Amplification |
| 4 | High amplification | Multi-copy gain |

#### 4.2 EM Algorithm

**E-step**: Compute posterior probabilities

$$\\gamma_{ik} = \\frac{\\pi_k \\mathcal{N}(x_i | \\mu_k, \\sigma_k^2)}{\\sum_{j=0}^{4} \\pi_j \\mathcal{N}(x_i | \\mu_j, \\sigma_j^2)}$$

**M-step**: Update parameters

$$\\mu_k^{new} = \\frac{\\sum_i \\gamma_{ik} x_i}{\\sum_i \\gamma_{ik}}$$

$$\\sigma_k^{2,new} = \\frac{\\sum_i \\gamma_{ik} (x_i - \\mu_k^{new})^2}{\\sum_i \\gamma_{ik}}$$

### 5. Subclone Detection

#### 5.1 Graph-Based Clustering

Subclones are identified using **Louvain community detection** on a shared nearest neighbor (SNN) graph:

1. Construct k-NN graph on CNA profiles
2. Build SNN graph
3. Apply Louvain algorithm
4. Filter small clusters (< 2% of cells)

#### 5.2 Modularity Optimization

The Louvain algorithm maximizes modularity:

$$Q = \\frac{1}{2m} \\sum_{ij} \\left[ A_{ij} - \\frac{k_i k_j}{2m} \\right] \\delta(c_i, c_j)$$

where:

- $A_{ij}$ is the adjacency matrix
- $k_i$ is the degree of node $i$
- $m$ is total edges
- $\\delta(c_i, c_j) = 1$ if nodes are in same community

### 6. Differential Analysis

#### 6.1 Subclone-Specific Alterations

Alterations are classified as:

- **Clonal**: Present in all subclones
- **Shared**: Present in multiple (not all) subclones  
- **Subclone-specific**: Present in single subclone

#### 6.2 Pathway Enrichment

**fGSEA** (fast Gene Set Enrichment Analysis) identifies enriched pathways:

$$ES = \\max_{j=1}^{n} \\left| \\sum_{i=1}^{j} \\frac{|r_i|^p}{N_R} - \\frac{j - \\text{hit}_j}{N - N_H} \\right|$$

## Performance Characteristics

### Benchmarking Results

Based on analysis of 106 samples (93,332 cells):

| Metric | SCEVAN | InferCNV | CopyKAT |
|--------|--------|----------|---------|
| F1 Score | **0.90** | 0.63 | 0.65 |
| Precision | 0.89 | 0.60 | 0.62 |
| Recall | 0.91 | 0.66 | 0.68 |

### Computational Complexity

- **Time**: O(n × g × log(g)) where n = cells, g = genes
- **Space**: O(n × g) for count matrix storage
- **Parallelization**: Linear speedup with cores

## Parameter Guidelines

| Parameter | Default | Description | Tuning Advice |
|-----------|---------|-------------|---------------|
| `beta_vega` | 0.5 | Segmentation granularity | Increase for noisy data |
| `par_cores` | 20 | Parallel cores | Set to available CPUs |
| `ngenes_chr` | 5 | Min genes per chromosome | Lower for sparse data |
| `perc_genes` | 0.1 | Gene expression threshold | Lower for low-coverage |

## References

1. De Falco, A., et al. (2023). A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. *Nature Communications*, 14, 1074.

2. Mardis, E.R., & Wilson, R.K. (2009). Cancer genome sequencing: a review. *Human Molecular Genetics*, 18(R2), R163-R168.

3. Blondel, V.D., et al. (2008). Fast unfolding of communities in large networks. *Journal of Statistical Mechanics*, P10008.

## Session Info

```{r session-info}
sessionInfo()
```