Copy Number Variations (CNVs) are structural alterations in the genome where segments of DNA are duplicated (amplifications) or deleted (losses). In cancer:
The central premise of expression-based CNV detection is:
\[\text{Gene Expression} \propto \text{Gene Copy Number}\]
Genes in amplified regions tend to show elevated expression, while genes in deleted regions show reduced expression. By analyzing expression patterns across genomically ordered genes, we can infer underlying CNVs.
The fastCNV algorithm consists of six main steps:
Overview of the fastCNV analysis pipeline.
Pipeline Steps:
Genes are ordered by their genomic coordinates:
# Gene metadata contains genomic coordinates
data("geneMetadata", package = "fastCNV")
head(geneMetadata)This ordering ensures that adjacent genes in our analysis are also adjacent in the genome.
Raw gene expression is noisy. We apply a sliding window approach:
\[S_w = \frac{1}{|W|} \sum_{g \in W} E_g\]
Where: - \(S_w\) = smoothed score for window \(w\) - \(W\) = set of genes in the window - \(E_g\) = normalized expression of gene \(g\) - \(|W|\) = window size (default: 150 genes)
Why smoothing?
# Window size affects resolution vs. noise trade-off
# Smaller windows → higher resolution, more noise
# Larger windows → smoother profiles, lower resolution
result <- fastCNV(
seuratObj = seurat_obj,
sampleName = "Sample1",
referenceVar = "cell_type",
referenceLabel = c("Normal"),
windowSize = 150 # Default: 150 genes per window
)The key innovation in expression-based CNV detection is reference normalization:
\[CNV_{cell} = S_{cell} - \bar{S}_{reference}\]
Reference cells (typically non-malignant cells like fibroblasts, immune cells) provide the baseline expression expected without CNVs.
Important considerations:
# Good reference cell types:
# - Fibroblasts
# - Endothelial cells
# - Immune cells (T cells, B cells, macrophages)
# - Normal epithelial cells (if available)
result <- fastCNV(
seuratObj = seurat_obj,
sampleName = "Sample1",
referenceVar = "cell_type",
referenceLabel = c("Fibroblast", "T_cell", "Endothelial")
)For each cell and each genomic window, we compute:
\[CNV\_score_{c,w} = \bar{E}_{c,w} - \bar{E}_{ref,w}\]
Where: - \(\bar{E}_{c,w}\) = mean expression of cell \(c\) in window \(w\) - \(\bar{E}_{ref,w}\) = mean expression of reference cells in window \(w\)
Interpretation: - Positive scores → potential amplification - Negative scores → potential deletion - Near-zero scores → normal copy number
To reduce false positives from biological noise:
# thresholdPercentile controls stringency
# Higher values → more stringent (fewer CNV calls)
# Lower values → more sensitive (more CNV calls)
result <- fastCNV(
seuratObj = seurat_obj,
sampleName = "Sample1",
referenceVar = "cell_type",
referenceLabel = c("Normal"),
thresholdPercentile = 0.01 # Default: 1st/99th percentile
)Cells are clustered based on their CNV profiles:
# Automatic clustering
result <- CNVCluster(
seuratObj = result,
referenceVar = "cell_type",
tumorLabel = "Tumor",
k = NULL, # Auto-determine number of clusters
plotDendrogram = TRUE,
plotElbowPlot = TRUE
)
# Manual specification
result <- CNVCluster(
seuratObj = result,
referenceVar = "cell_type",
tumorLabel = "Tumor",
k = 4 # Force 4 clusters
)The default distance metric is correlation distance:
\[d(x, y) = 1 - \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]
This metric is robust to: - Differences in overall expression level - Technical batch effects
Final CNV calls are classified as:
| Classification | Criteria |
|---|---|
| Amplification | Score > threshold |
| Deletion | Score < -threshold |
| Neutral | -threshold ≤ Score ≤ threshold |
fastCNV 2.0 uses vectorized operations for efficiency:
| Parameter | Description | Default | Impact |
|---|---|---|---|
windowSize |
Genes per window | 150 | Resolution vs. noise |
topNGenes |
Genes to analyze | 7000 | Coverage vs. speed |
thresholdPercentile |
Noise filter | 0.01 | Sensitivity vs. specificity |
aggregFactor |
Spatial binning | 10 | Coverage vs. resolution |
# 1. Verify reference cells are diploid
# Reference cells should show minimal CNV signal
ref_cells <- which(result$cell_type %in% c("Normal"))
mean_ref_signal <- mean(abs(result@assays$CNVScores[, ref_cells]))
message("Mean reference CNV signal: ", round(mean_ref_signal, 4))
# 2. Check known CNV regions
# If you know expected CNVs, verify they are detected
# e.g., chromosome 7 gain in glioblastoma
# 3. Compare with DNA-based CNV calls (if available)Patel AP, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science.
Tirosh I, et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science.
Fan J, et al. (2018). Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. Genome Research.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0 xfun_0.59
#> [5] maketools_1.3.2 cachem_1.1.0 knitr_1.51 htmltools_0.5.9
#> [9] buildtools_1.0.0 lifecycle_1.0.5 cli_3.6.6 sass_0.4.10
#> [13] jquerylib_0.1.4 compiler_4.6.0 sys_3.4.3 tools_4.6.0
#> [17] evaluate_1.0.5 bslib_0.11.0 yaml_2.3.12 otel_0.2.0
#> [21] jsonlite_2.0.0 rlang_1.2.0