Algorithm and Methodology

Theoretical Background

Copy Number Variations in Cancer

Copy Number Variations (CNVs) are structural alterations in the genome where segments of DNA are duplicated (amplifications) or deleted (losses). In cancer:

Amplifications often harbor oncogenes (e.g., MYC, ERBB2)
Deletions frequently affect tumor suppressor genes (e.g., TP53, CDKN2A)
CNV patterns define tumor subclones with distinct evolutionary trajectories

Expression-Based CNV Inference

The central premise of expression-based CNV detection is:

\[\text{Gene Expression} \propto \text{Gene Copy Number}\]

Genes in amplified regions tend to show elevated expression, while genes in deleted regions show reduced expression. By analyzing expression patterns across genomically ordered genes, we can infer underlying CNVs.

Algorithm Pipeline

The fastCNV algorithm consists of six main steps:

Overview of the fastCNV analysis pipeline.

Pipeline Steps:

Gene Ordering: Sort genes by chromosome and genomic position
Expression Smoothing: Apply sliding window across ordered genes
Reference Normalization: Center scores using reference (normal) cells
Score Computation: Calculate CNV scores per window
Thresholding: Apply quantile-based noise filtering
Clustering: Identify CNV-based subpopulations

Step 1: Gene Ordering

Genes are ordered by their genomic coordinates:

Chromosome ordering: 1, 2, …, 22, X
Position ordering: By transcription start site (TSS)

# Gene metadata contains genomic coordinates
data("geneMetadata", package = "fastCNV")
head(geneMetadata)

This ordering ensures that adjacent genes in our analysis are also adjacent in the genome.

Step 2: Sliding Window Smoothing

Raw gene expression is noisy. We apply a sliding window approach:

\[S_w = \frac{1}{|W|} \sum_{g \in W} E_g\]

Where: - \(S_w\) = smoothed score for window \(w\) - \(W\) = set of genes in the window - \(E_g\) = normalized expression of gene \(g\) - \(|W|\) = window size (default: 150 genes)

Why smoothing?

Reduces single-gene noise
Captures regional expression patterns
Mimics the resolution of array-based CNV detection

# Window size affects resolution vs. noise trade-off
# Smaller windows → higher resolution, more noise
# Larger windows → smoother profiles, lower resolution

result <- fastCNV(
  seuratObj = seurat_obj,
  sampleName = "Sample1",
  referenceVar = "cell_type",
  referenceLabel = c("Normal"),
  windowSize = 150  # Default: 150 genes per window
)

Step 3: Reference Normalization

The key innovation in expression-based CNV detection is reference normalization:

\[CNV_{cell} = S_{cell} - \bar{S}_{reference}\]

Reference cells (typically non-malignant cells like fibroblasts, immune cells) provide the baseline expression expected without CNVs.

Important considerations:

Reference cells should be diploid (normal copy number)
Multiple reference cell types improve robustness
For tumors, use adjacent normal tissue or known normal populations

# Good reference cell types:
# - Fibroblasts
# - Endothelial cells
# - Immune cells (T cells, B cells, macrophages)
# - Normal epithelial cells (if available)

result <- fastCNV(
  seuratObj = seurat_obj,
  sampleName = "Sample1",
  referenceVar = "cell_type",
  referenceLabel = c("Fibroblast", "T_cell", "Endothelial")
)

Step 4: Score Computation

For each cell and each genomic window, we compute:

\[CNV\_score_{c,w} = \bar{E}_{c,w} - \bar{E}_{ref,w}\]

Where: - \(\bar{E}_{c,w}\) = mean expression of cell \(c\) in window \(w\) - \(\bar{E}_{ref,w}\) = mean expression of reference cells in window \(w\)

Interpretation: - Positive scores → potential amplification - Negative scores → potential deletion - Near-zero scores → normal copy number

Step 5: Quantile-Based Thresholding

To reduce false positives from biological noise:

Compute quantiles (default: 1st and 99th percentiles)
Scores within quantile range → set to 0 (no CNV)
Extreme scores → retained as putative CNVs

# thresholdPercentile controls stringency
# Higher values → more stringent (fewer CNV calls)
# Lower values → more sensitive (more CNV calls)

result <- fastCNV(
  seuratObj = seurat_obj,
  sampleName = "Sample1",
  referenceVar = "cell_type",
  referenceLabel = c("Normal"),
  thresholdPercentile = 0.01  # Default: 1st/99th percentile
)

Step 6: Hierarchical Clustering

Cells are clustered based on their CNV profiles:

Compute pairwise distances between CNV profiles
Build hierarchical clustering tree
Cut tree to define subclones

# Automatic clustering
result <- CNVCluster(
  seuratObj = result,
  referenceVar = "cell_type",
  tumorLabel = "Tumor",
  k = NULL,  # Auto-determine number of clusters
  plotDendrogram = TRUE,
  plotElbowPlot = TRUE
)

# Manual specification
result <- CNVCluster(
  seuratObj = result,
  referenceVar = "cell_type",
  tumorLabel = "Tumor",
  k = 4  # Force 4 clusters
)

Mathematical Details

Distance Metric

The default distance metric is correlation distance:

\[d(x, y) = 1 - \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]

This metric is robust to: - Differences in overall expression level - Technical batch effects

CNV Classification

Final CNV calls are classified as:

Classification	Criteria
Amplification	Score > threshold
Deletion	Score < -threshold
Neutral	-threshold ≤ Score ≤ threshold

# Classify CNV calls
result <- CNVClassification(
  seuratObj = result,
  referenceVar = "cell_type",
  referenceLabel = c("Normal")
)

Performance Optimization

Vectorized Operations

fastCNV 2.0 uses vectorized operations for efficiency:

# Old approach (slow)
# for (i in 1:n) { result[i] <- compute(data[i]) }

# New approach (fast)
# result <- vectorized_compute(data)

Memory Management

For large datasets, fastCNV implements strategic garbage collection:

# Automatic memory cleanup
result <- fastCNV(
  seuratObj = large_seurat,
  sampleName = "Large_sample",
  referenceVar = "cell_type",
  referenceLabel = c("Normal")
)
# gc() is called at key checkpoints

Algorithm Parameters Summary

Parameter	Description	Default	Impact
`windowSize`	Genes per window	150	Resolution vs. noise
`topNGenes`	Genes to analyze	7000	Coverage vs. speed
`thresholdPercentile`	Noise filter	0.01	Sensitivity vs. specificity
`aggregFactor`	Spatial binning	10	Coverage vs. resolution

Validation and Quality Control

Checking Results

# 1. Verify reference cells are diploid
# Reference cells should show minimal CNV signal
ref_cells <- which(result$cell_type %in% c("Normal"))
mean_ref_signal <- mean(abs(result@assays$CNVScores[, ref_cells]))
message("Mean reference CNV signal: ", round(mean_ref_signal, 4))

# 2. Check known CNV regions
# If you know expected CNVs, verify they are detected
# e.g., chromosome 7 gain in glioblastoma

# 3. Compare with DNA-based CNV calls (if available)

References

Patel AP, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science.
Tirosh I, et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science.
Fan J, et al. (2018). Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Research.

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39    R6_2.6.1         fastmap_1.2.0    xfun_0.59       
#>  [5] maketools_1.3.2  cachem_1.1.0     knitr_1.51       htmltools_0.5.9 
#>  [9] buildtools_1.0.0 lifecycle_1.0.5  cli_3.6.6        sass_0.4.10     
#> [13] jquerylib_0.1.4  compiler_4.6.0   sys_3.4.3        tools_4.6.0     
#> [17] evaluate_1.0.5   bslib_0.11.0     yaml_2.3.12      otel_0.2.0      
#> [21] jsonlite_2.0.0   rlang_1.2.0