Single-Sample CNA Analysis

Introduction

This vignette demonstrates a complete single-sample CNA analysis workflow using SCEVAN. We analyze a glioblastoma sample (MGH106) from the publicly available dataset GSE131928.

Setup

Load Required Packages

library(SCEVAN)
library(ggplot2)
library(dplyr)

Download Example Data

# Load pre-processed MGH106 glioblastoma data
load(url("https://www.dropbox.com/s/b9udpvhnc2ez9pc/MGH106_data.RData?raw=1"))

# Examine data structure
cat("Dimensions:", dim(count_mtx), "\n")
cat("Genes:", nrow(count_mtx), "\n")
cat("Cells:", ncol(count_mtx), "\n")

# Preview
count_mtx[1:5, 1:5]

Running the Pipeline

Basic Analysis

The pipelineCNA() function performs the complete analysis:

results <- pipelineCNA(
  count_mtx,
  sample = "MGH106",
  par_cores = 4,           # Adjust based on your system
  organism = "human",
  SUBCLONES = TRUE,        # Enable subclone detection
  beta_vega = 0.5,         # Segmentation parameter
  ClonalCN = TRUE,         # Infer clonal profile
  plotTree = TRUE          # Generate phylogenetic tree
)

Understanding Parameters

Parameter	Value	Rationale
`sample`	“MGH106”	Prefix for output files
`par_cores`	4	Parallel processing
`SUBCLONES`	TRUE	Detect clonal subpopulations
`beta_vega`	0.5	Standard segmentation granularity
`ClonalCN`	TRUE	Generate clonal CN profile

Examining Results

Cell Classification Summary

# View classification results
head(results)

# Summary statistics
table(results$class)

# Subclone distribution (if detected)
if("subclone" %in% colnames(results)) {
  table(results$subclone, useNA = "ifany")
}

Output Files

The pipeline generates several output files:

# List all output files
list.files("./output", pattern = "MGH106")

File Pattern	Description
`*heatmap.png`	CNA heatmap with classifications
`*_CNAmtx.RData`	CNA matrix for downstream analysis
`*_CN.seg`	Segmentation file
`*OncoHeat.png`	OncoPrint-style visualization
`*CloneTree.png`	Phylogenetic tree

Output Visualizations

SCEVAN generates several visualization files in the output directory:

1. Classification Heatmap (`*heatmap.png`)

The main heatmap shows copy number profiles across all cells:

Rows: Genes ordered by genomic position
Columns: Cells clustered by CNA profile
Color: Blue = deletion, Red = amplification
Top bar: Cell classification (tumor/normal)

2. Subclone Heatmap (`*heatmap_subclones.png`)

When subclones are detected, this heatmap colors cells by their subclone assignment.

3. Clonal Tree (`*CloneTree.png`)

The phylogenetic tree shows evolutionary relationships between subclones.

4. Consensus Plot (`*consensus.png`)

Compact visualization showing the consensus copy number profile for each subclone.

5. OncoPrint Visualization (`*OncoHeat.png`)

OncoPrint-style plot highlighting:

Subclone-specific alterations: Present in only one subclone
Shared alterations: Present in multiple (not all) subclones
Clonal alterations: Present in all subclones

Downstream Analysis

Load CNA Matrix

# Load the CNA matrix for custom analyses
load("./output/MGH106_CNAmtx.RData")
load("./output/MGH106_count_mtx_annot.RData")

dim(CNA_mtx_relat)

Extract Region-Specific CN Values

# Define a genomic region of interest (e.g., chromosome 7)
chr7_cn <- apply(
  CNA_mtx_relat[count_mtx_annot$seqnames == 7, ], 
  2, 
  mean
)

# Create data frame for plotting
cn_df <- data.frame(
  cell = names(chr7_cn),
  chr7_cn = chr7_cn,
  class = results[names(chr7_cn), "class"]
)

# Visualize
ggplot(cn_df, aes(x = class, y = chr7_cn, fill = class)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Chromosome 7 Copy Number by Cell Type",
    x = "Cell Classification",
    y = "Mean CN Ratio"
  ) +
  scale_fill_manual(values = c("tumor" = "#E74C3C", "normal" = "#3498DB"))

Differential Expression in CNA Regions

# Identify genes in amplified regions
chr7_genes <- rownames(count_mtx_annot)[count_mtx_annot$seqnames == 7]

# Compare expression between tumor subclones
if("subclone" %in% colnames(results)) {
  tumor_cells <- rownames(results)[results$class == "tumor"]
  
  # Get subclone assignments
  subclone_1 <- rownames(results)[results$subclone == 1 & !is.na(results$subclone)]
  subclone_2 <- rownames(results)[results$subclone == 2 & !is.na(results$subclone)]
  
  # Calculate fold changes for chr7 genes
  if(length(subclone_1) > 0 & length(subclone_2) > 0) {
    expr_s1 <- rowMeans(count_mtx[chr7_genes, subclone_1])
    expr_s2 <- rowMeans(count_mtx[chr7_genes, subclone_2])
    
    fc <- log2((expr_s1 + 1) / (expr_s2 + 1))
    
    # Top differentially expressed genes
    head(sort(fc, decreasing = TRUE), 10)
  }
}

Advanced Options

Using Known Normal Cells

If you have prior knowledge of normal cells:

# Specify known normal cell barcodes
known_normals <- c("cell_1", "cell_2", "cell_3")

results <- pipelineCNA(
  count_mtx,
  sample = "MGH106_custom",
  norm_cell = known_normals,
  FIXED_NORMAL_CELLS = FALSE  # Still allow automatic detection
)

Adjusting Segmentation

For noisy data, increase beta_vega:

# Coarser segmentation for noisy data
results_coarse <- pipelineCNA(
  count_mtx,
  sample = "MGH106_coarse",
  beta_vega = 1.0  # More conservative segmentation
)

Adding Custom Gene Sets

# Define custom normal cell signatures
custom_signatures <- list(
  Astrocytes = c("GFAP", "AQP4", "SLC1A2", "SLC1A3"),
  Oligodendrocytes = c("MBP", "MOG", "PLP1", "OLIG2"),
  Microglia = c("CX3CR1", "P2RY12", "TMEM119", "AIF1")
)

results <- pipelineCNA(
  count_mtx,
  sample = "MGH106_custom_sig",
  AdditionalGeneSets = custom_signatures,
  SCEVANsignatures = TRUE  # Also use built-in signatures
)

Best Practices

Quality Control Checklist

Check cell coverage: Ensure sufficient genes per cell
Verify normal detection: Review confident normal cells
Validate subclones: Check modularity scores
Examine segmentation: Adjust beta if needed

Recommended Workflow

# 1. Initial run with defaults
results_default <- pipelineCNA(count_mtx, sample = "sample_v1")

# 2. Review outputs and classifications
table(results_default$class)

# 3. Adjust parameters if needed
results_tuned <- pipelineCNA(
  count_mtx, 
  sample = "sample_v2",
  beta_vega = 0.7  # Adjusted based on review
)

# 4. Final analysis
results_final <- pipelineCNA(
  count_mtx,
  sample = "sample_final",
  SUBCLONES = TRUE,
  plotTree = TRUE
)

Troubleshooting

Common Issues

Issue	Solution
No normal cells detected	Add custom signatures or provide known normals
Too many/few segments	Adjust `beta_vega` parameter
Memory errors	Reduce `par_cores` or subsample data
Missing genes	Verify gene name format (Symbol vs Ensembl)

Session Info

sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39    R6_2.6.1         fastmap_1.2.0    xfun_0.59       
#>  [5] maketools_1.3.2  cachem_1.1.0     knitr_1.51       htmltools_0.5.9 
#>  [9] buildtools_1.0.0 lifecycle_1.0.5  cli_3.6.6        sass_0.4.10     
#> [13] jquerylib_0.1.4  compiler_4.6.1   sys_3.4.3        tools_4.6.1     
#> [17] evaluate_1.0.5   bslib_0.11.0     yaml_2.3.12      otel_0.2.0      
#> [21] jsonlite_2.0.0   rlang_1.2.0