--- title: "Algorithm and Methodology" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Algorithm and Methodology} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, eval = FALSE ) ``` ## Theoretical Background ### Copy Number Variations in Cancer Copy Number Variations (CNVs) are structural alterations in the genome where segments of DNA are duplicated (amplifications) or deleted (losses). In cancer: - **Amplifications** often harbor oncogenes (e.g., *MYC*, *ERBB2*) - **Deletions** frequently affect tumor suppressor genes (e.g., *TP53*, *CDKN2A*) - CNV patterns define **tumor subclones** with distinct evolutionary trajectories ### Expression-Based CNV Inference The central premise of expression-based CNV detection is: $$\text{Gene Expression} \propto \text{Gene Copy Number}$$ Genes in amplified regions tend to show elevated expression, while genes in deleted regions show reduced expression. By analyzing expression patterns across genomically ordered genes, we can infer underlying CNVs. ## Algorithm Pipeline The fastCNV algorithm consists of six main steps: ```{r pipeline-diagram, echo=FALSE, eval=TRUE, out.width="100%", fig.cap="Overview of the fastCNV analysis pipeline."} knitr::include_graphics("figures/pipeline_diagram.png") ``` **Pipeline Steps:** 1. **Gene Ordering**: Sort genes by chromosome and genomic position 2. **Expression Smoothing**: Apply sliding window across ordered genes 3. **Reference Normalization**: Center scores using reference (normal) cells 4. **Score Computation**: Calculate CNV scores per window 5. **Thresholding**: Apply quantile-based noise filtering 6. **Clustering**: Identify CNV-based subpopulations ### Step 1: Gene Ordering Genes are ordered by their genomic coordinates: 1. **Chromosome ordering**: 1, 2, ..., 22, X 2. **Position ordering**: By transcription start site (TSS) ```{r gene-ordering} # Gene metadata contains genomic coordinates data("geneMetadata", package = "fastCNV") head(geneMetadata) ``` This ordering ensures that adjacent genes in our analysis are also adjacent in the genome. ### Step 2: Sliding Window Smoothing Raw gene expression is noisy. We apply a sliding window approach: $$S_w = \frac{1}{|W|} \sum_{g \in W} E_g$$ Where: - $S_w$ = smoothed score for window $w$ - $W$ = set of genes in the window - $E_g$ = normalized expression of gene $g$ - $|W|$ = window size (default: 150 genes) **Why smoothing?** - Reduces single-gene noise - Captures regional expression patterns - Mimics the resolution of array-based CNV detection ```{r window-params} # Window size affects resolution vs. noise trade-off # Smaller windows → higher resolution, more noise # Larger windows → smoother profiles, lower resolution result <- fastCNV( seuratObj = seurat_obj, sampleName = "Sample1", referenceVar = "cell_type", referenceLabel = c("Normal"), windowSize = 150 # Default: 150 genes per window ) ``` ### Step 3: Reference Normalization The key innovation in expression-based CNV detection is **reference normalization**: $$CNV_{cell} = S_{cell} - \bar{S}_{reference}$$ Reference cells (typically non-malignant cells like fibroblasts, immune cells) provide the baseline expression expected without CNVs. **Important considerations:** - Reference cells should be **diploid** (normal copy number) - Multiple reference cell types improve robustness - For tumors, use adjacent normal tissue or known normal populations ```{r reference-selection} # Good reference cell types: # - Fibroblasts # - Endothelial cells # - Immune cells (T cells, B cells, macrophages) # - Normal epithelial cells (if available) result <- fastCNV( seuratObj = seurat_obj, sampleName = "Sample1", referenceVar = "cell_type", referenceLabel = c("Fibroblast", "T_cell", "Endothelial") ) ``` ### Step 4: Score Computation For each cell and each genomic window, we compute: $$CNV\_score_{c,w} = \bar{E}_{c,w} - \bar{E}_{ref,w}$$ Where: - $\bar{E}_{c,w}$ = mean expression of cell $c$ in window $w$ - $\bar{E}_{ref,w}$ = mean expression of reference cells in window $w$ **Interpretation:** - Positive scores → potential amplification - Negative scores → potential deletion - Near-zero scores → normal copy number ### Step 5: Quantile-Based Thresholding To reduce false positives from biological noise: 1. Compute quantiles (default: 1st and 99th percentiles) 2. Scores within quantile range → set to 0 (no CNV) 3. Extreme scores → retained as putative CNVs ```{r thresholding} # thresholdPercentile controls stringency # Higher values → more stringent (fewer CNV calls) # Lower values → more sensitive (more CNV calls) result <- fastCNV( seuratObj = seurat_obj, sampleName = "Sample1", referenceVar = "cell_type", referenceLabel = c("Normal"), thresholdPercentile = 0.01 # Default: 1st/99th percentile ) ``` ### Step 6: Hierarchical Clustering Cells are clustered based on their CNV profiles: 1. Compute pairwise distances between CNV profiles 2. Build hierarchical clustering tree 3. Cut tree to define subclones ```{r clustering} # Automatic clustering result <- CNVCluster( seuratObj = result, referenceVar = "cell_type", tumorLabel = "Tumor", k = NULL, # Auto-determine number of clusters plotDendrogram = TRUE, plotElbowPlot = TRUE ) # Manual specification result <- CNVCluster( seuratObj = result, referenceVar = "cell_type", tumorLabel = "Tumor", k = 4 # Force 4 clusters ) ``` ## Mathematical Details ### Distance Metric The default distance metric is correlation distance: $$d(x, y) = 1 - \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$ This metric is robust to: - Differences in overall expression level - Technical batch effects ### CNV Classification Final CNV calls are classified as: | Classification | Criteria | |----------------|----------| | **Amplification** | Score > threshold | | **Deletion** | Score < -threshold | | **Neutral** | -threshold ≤ Score ≤ threshold | ```{r classification} # Classify CNV calls result <- CNVClassification( seuratObj = result, referenceVar = "cell_type", referenceLabel = c("Normal") ) ``` ## Performance Optimization ### Vectorized Operations fastCNV 2.0 uses vectorized operations for efficiency: ```{r vectorized} # Old approach (slow) # for (i in 1:n) { result[i] <- compute(data[i]) } # New approach (fast) # result <- vectorized_compute(data) ``` ### Memory Management For large datasets, fastCNV implements strategic garbage collection: ```{r memory} # Automatic memory cleanup result <- fastCNV( seuratObj = large_seurat, sampleName = "Large_sample", referenceVar = "cell_type", referenceLabel = c("Normal") ) # gc() is called at key checkpoints ``` ## Algorithm Parameters Summary | Parameter | Description | Default | Impact | |-----------|-------------|---------|--------| | `windowSize` | Genes per window | 150 | Resolution vs. noise | | `topNGenes` | Genes to analyze | 7000 | Coverage vs. speed | | `thresholdPercentile` | Noise filter | 0.01 | Sensitivity vs. specificity | | `aggregFactor` | Spatial binning | 10 | Coverage vs. resolution | ## Validation and Quality Control ### Checking Results ```{r qc} # 1. Verify reference cells are diploid # Reference cells should show minimal CNV signal ref_cells <- which(result$cell_type %in% c("Normal")) mean_ref_signal <- mean(abs(result@assays$CNVScores[, ref_cells])) message("Mean reference CNV signal: ", round(mean_ref_signal, 4)) # 2. Check known CNV regions # If you know expected CNVs, verify they are detected # e.g., chromosome 7 gain in glioblastoma # 3. Compare with DNA-based CNV calls (if available) ``` ## References 1. Patel AP, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. *Science*. 2. Tirosh I, et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. *Science*. 3. Fan J, et al. (2018). Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. *Genome Research*. ## Session Info ```{r session-info, eval=TRUE} sessionInfo() ```