---
title: "Algorithm and Methodology"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Algorithm and Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  eval = FALSE
)
```

## Theoretical Background

### Copy Number Variations in Cancer

Copy Number Variations (CNVs) are structural alterations in the genome where segments of DNA are duplicated (amplifications) or deleted (losses). In cancer:

- **Amplifications** often harbor oncogenes (e.g., *MYC*, *ERBB2*)
- **Deletions** frequently affect tumor suppressor genes (e.g., *TP53*, *CDKN2A*)
- CNV patterns define **tumor subclones** with distinct evolutionary trajectories

### Expression-Based CNV Inference

The central premise of expression-based CNV detection is:

$$\text{Gene Expression} \propto \text{Gene Copy Number}$$

Genes in amplified regions tend to show elevated expression, while genes in deleted regions show reduced expression. By analyzing expression patterns across genomically ordered genes, we can infer underlying CNVs.

## Algorithm Pipeline

The fastCNV algorithm consists of six main steps:

```{r pipeline-diagram, echo=FALSE, eval=TRUE, out.width="100%", fig.cap="Overview of the fastCNV analysis pipeline."}
knitr::include_graphics("figures/pipeline_diagram.png")
```

**Pipeline Steps:**

1. **Gene Ordering**: Sort genes by chromosome and genomic position
2. **Expression Smoothing**: Apply sliding window across ordered genes
3. **Reference Normalization**: Center scores using reference (normal) cells
4. **Score Computation**: Calculate CNV scores per window
5. **Thresholding**: Apply quantile-based noise filtering
6. **Clustering**: Identify CNV-based subpopulations

### Step 1: Gene Ordering

Genes are ordered by their genomic coordinates:

1. **Chromosome ordering**: 1, 2, ..., 22, X
2. **Position ordering**: By transcription start site (TSS)

```{r gene-ordering}
# Gene metadata contains genomic coordinates
data("geneMetadata", package = "fastCNV")
head(geneMetadata)
```

This ordering ensures that adjacent genes in our analysis are also adjacent in the genome.

### Step 2: Sliding Window Smoothing

Raw gene expression is noisy. We apply a sliding window approach:

$$S_w = \frac{1}{|W|} \sum_{g \in W} E_g$$

Where:
- $S_w$ = smoothed score for window $w$
- $W$ = set of genes in the window
- $E_g$ = normalized expression of gene $g$
- $|W|$ = window size (default: 150 genes)

**Why smoothing?**

- Reduces single-gene noise
- Captures regional expression patterns
- Mimics the resolution of array-based CNV detection

```{r window-params}
# Window size affects resolution vs. noise trade-off
# Smaller windows → higher resolution, more noise
# Larger windows → smoother profiles, lower resolution

result <- fastCNV(
  seuratObj = seurat_obj,
  sampleName = "Sample1",
  referenceVar = "cell_type",
  referenceLabel = c("Normal"),
  windowSize = 150  # Default: 150 genes per window
)
```

### Step 3: Reference Normalization

The key innovation in expression-based CNV detection is **reference normalization**:

$$CNV_{cell} = S_{cell} - \bar{S}_{reference}$$

Reference cells (typically non-malignant cells like fibroblasts, immune cells) provide the baseline expression expected without CNVs.

**Important considerations:**

- Reference cells should be **diploid** (normal copy number)
- Multiple reference cell types improve robustness
- For tumors, use adjacent normal tissue or known normal populations

```{r reference-selection}
# Good reference cell types:
# - Fibroblasts
# - Endothelial cells
# - Immune cells (T cells, B cells, macrophages)
# - Normal epithelial cells (if available)

result <- fastCNV(
  seuratObj = seurat_obj,
  sampleName = "Sample1",
  referenceVar = "cell_type",
  referenceLabel = c("Fibroblast", "T_cell", "Endothelial")
)
```

### Step 4: Score Computation

For each cell and each genomic window, we compute:

$$CNV\_score_{c,w} = \bar{E}_{c,w} - \bar{E}_{ref,w}$$

Where:
- $\bar{E}_{c,w}$ = mean expression of cell $c$ in window $w$
- $\bar{E}_{ref,w}$ = mean expression of reference cells in window $w$

**Interpretation:**
- Positive scores → potential amplification
- Negative scores → potential deletion
- Near-zero scores → normal copy number

### Step 5: Quantile-Based Thresholding

To reduce false positives from biological noise:

1. Compute quantiles (default: 1st and 99th percentiles)
2. Scores within quantile range → set to 0 (no CNV)
3. Extreme scores → retained as putative CNVs

```{r thresholding}
# thresholdPercentile controls stringency
# Higher values → more stringent (fewer CNV calls)
# Lower values → more sensitive (more CNV calls)

result <- fastCNV(
  seuratObj = seurat_obj,
  sampleName = "Sample1",
  referenceVar = "cell_type",
  referenceLabel = c("Normal"),
  thresholdPercentile = 0.01  # Default: 1st/99th percentile
)
```

### Step 6: Hierarchical Clustering

Cells are clustered based on their CNV profiles:

1. Compute pairwise distances between CNV profiles
2. Build hierarchical clustering tree
3. Cut tree to define subclones

```{r clustering}
# Automatic clustering
result <- CNVCluster(
  seuratObj = result,
  referenceVar = "cell_type",
  tumorLabel = "Tumor",
  k = NULL,  # Auto-determine number of clusters
  plotDendrogram = TRUE,
  plotElbowPlot = TRUE
)

# Manual specification
result <- CNVCluster(
  seuratObj = result,
  referenceVar = "cell_type",
  tumorLabel = "Tumor",
  k = 4  # Force 4 clusters
)
```

## Mathematical Details

### Distance Metric

The default distance metric is correlation distance:

$$d(x, y) = 1 - \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$

This metric is robust to:
- Differences in overall expression level
- Technical batch effects

### CNV Classification

Final CNV calls are classified as:

| Classification | Criteria |
|----------------|----------|
| **Amplification** | Score > threshold |
| **Deletion** | Score < -threshold |
| **Neutral** | -threshold ≤ Score ≤ threshold |

```{r classification}
# Classify CNV calls
result <- CNVClassification(
  seuratObj = result,
  referenceVar = "cell_type",
  referenceLabel = c("Normal")
)
```

## Performance Optimization

### Vectorized Operations

fastCNV 2.0 uses vectorized operations for efficiency:

```{r vectorized}
# Old approach (slow)
# for (i in 1:n) { result[i] <- compute(data[i]) }

# New approach (fast)
# result <- vectorized_compute(data)
```

### Memory Management

For large datasets, fastCNV implements strategic garbage collection:

```{r memory}
# Automatic memory cleanup
result <- fastCNV(
  seuratObj = large_seurat,
  sampleName = "Large_sample",
  referenceVar = "cell_type",
  referenceLabel = c("Normal")
)
# gc() is called at key checkpoints
```

## Algorithm Parameters Summary

| Parameter | Description | Default | Impact |
|-----------|-------------|---------|--------|
| `windowSize` | Genes per window | 150 | Resolution vs. noise |
| `topNGenes` | Genes to analyze | 7000 | Coverage vs. speed |
| `thresholdPercentile` | Noise filter | 0.01 | Sensitivity vs. specificity |
| `aggregFactor` | Spatial binning | 10 | Coverage vs. resolution |

## Validation and Quality Control

### Checking Results

```{r qc}
# 1. Verify reference cells are diploid
# Reference cells should show minimal CNV signal
ref_cells <- which(result$cell_type %in% c("Normal"))
mean_ref_signal <- mean(abs(result@assays$CNVScores[, ref_cells]))
message("Mean reference CNV signal: ", round(mean_ref_signal, 4))

# 2. Check known CNV regions
# If you know expected CNVs, verify they are detected
# e.g., chromosome 7 gain in glioblastoma

# 3. Compare with DNA-based CNV calls (if available)
```

## References

1. Patel AP, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. *Science*.

2. Tirosh I, et al. (2016). Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. *Science*.

3. Fan J, et al. (2018). Linking transcriptional and genetic tumor heterogeneity through allele analysis of single-cell RNA-seq data. *Genome Research*.

## Session Info

```{r session-info, eval=TRUE}
sessionInfo()
```