---
title: "Single-Sample CNA Analysis"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Single-Sample CNA Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 6,
  out.width = "100%",
  eval = FALSE
)
```

## Introduction

This vignette demonstrates a complete single-sample CNA analysis workflow using SCEVAN. We analyze a glioblastoma sample (MGH106) from the publicly available dataset GSE131928.

## Setup

### Load Required Packages

```{r load-packages}
library(SCEVAN)
library(ggplot2)
library(dplyr)
```

### Download Example Data

```{r load-data}
# Load pre-processed MGH106 glioblastoma data
load(url("https://www.dropbox.com/s/b9udpvhnc2ez9pc/MGH106_data.RData?raw=1"))

# Examine data structure
cat("Dimensions:", dim(count_mtx), "\n")
cat("Genes:", nrow(count_mtx), "\n")
cat("Cells:", ncol(count_mtx), "\n")

# Preview
count_mtx[1:5, 1:5]
```

## Running the Pipeline

### Basic Analysis

The `pipelineCNA()` function performs the complete analysis:

```{r run-basic}
results <- pipelineCNA(
  count_mtx,
  sample = "MGH106",
  par_cores = 4,           # Adjust based on your system
  organism = "human",
  SUBCLONES = TRUE,        # Enable subclone detection
  beta_vega = 0.5,         # Segmentation parameter
  ClonalCN = TRUE,         # Infer clonal profile
  plotTree = TRUE          # Generate phylogenetic tree
)
```

### Understanding Parameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `sample` | "MGH106" | Prefix for output files |
| `par_cores` | 4 | Parallel processing |
| `SUBCLONES` | TRUE | Detect clonal subpopulations |
| `beta_vega` | 0.5 | Standard segmentation granularity |
| `ClonalCN` | TRUE | Generate clonal CN profile |

## Examining Results

### Cell Classification Summary

```{r examine-results}
# View classification results
head(results)

# Summary statistics
table(results$class)

# Subclone distribution (if detected)
if("subclone" %in% colnames(results)) {
  table(results$subclone, useNA = "ifany")
}
```

### Output Files

The pipeline generates several output files:

```{r list-outputs}
# List all output files
list.files("./output", pattern = "MGH106")
```

| File Pattern | Description |
|--------------|-------------|
| `*heatmap.png` | CNA heatmap with classifications |
| `*_CNAmtx.RData` | CNA matrix for downstream analysis |
| `*_CN.seg` | Segmentation file |
| `*OncoHeat.png` | OncoPrint-style visualization |
| `*CloneTree.png` | Phylogenetic tree |

## Output Visualizations

SCEVAN generates several visualization files in the output directory:

### 1. Classification Heatmap (`*heatmap.png`)

The main heatmap shows copy number profiles across all cells:

- **Rows**: Genes ordered by genomic position
- **Columns**: Cells clustered by CNA profile
- **Color**: Blue = deletion, Red = amplification
- **Top bar**: Cell classification (tumor/normal)

### 2. Subclone Heatmap (`*heatmap_subclones.png`)

When subclones are detected, this heatmap colors cells by their subclone assignment.

### 3. Clonal Tree (`*CloneTree.png`)

The phylogenetic tree shows evolutionary relationships between subclones.

### 4. Consensus Plot (`*consensus.png`)

Compact visualization showing the consensus copy number profile for each subclone.

### 5. OncoPrint Visualization (`*OncoHeat.png`)

OncoPrint-style plot highlighting:

- **Subclone-specific alterations**: Present in only one subclone
- **Shared alterations**: Present in multiple (not all) subclones
- **Clonal alterations**: Present in all subclones

## Downstream Analysis

### Load CNA Matrix

```{r load-cna}
# Load the CNA matrix for custom analyses
load("./output/MGH106_CNAmtx.RData")
load("./output/MGH106_count_mtx_annot.RData")

dim(CNA_mtx_relat)
```

### Extract Region-Specific CN Values

```{r extract-region}
# Define a genomic region of interest (e.g., chromosome 7)
chr7_cn <- apply(
  CNA_mtx_relat[count_mtx_annot$seqnames == 7, ], 
  2, 
  mean
)

# Create data frame for plotting
cn_df <- data.frame(
  cell = names(chr7_cn),
  chr7_cn = chr7_cn,
  class = results[names(chr7_cn), "class"]
)

# Visualize
ggplot(cn_df, aes(x = class, y = chr7_cn, fill = class)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Chromosome 7 Copy Number by Cell Type",
    x = "Cell Classification",
    y = "Mean CN Ratio"
  ) +
  scale_fill_manual(values = c("tumor" = "#E74C3C", "normal" = "#3498DB"))
```

### Differential Expression in CNA Regions

```{r de-analysis}
# Identify genes in amplified regions
chr7_genes <- rownames(count_mtx_annot)[count_mtx_annot$seqnames == 7]

# Compare expression between tumor subclones
if("subclone" %in% colnames(results)) {
  tumor_cells <- rownames(results)[results$class == "tumor"]
  
  # Get subclone assignments
  subclone_1 <- rownames(results)[results$subclone == 1 & !is.na(results$subclone)]
  subclone_2 <- rownames(results)[results$subclone == 2 & !is.na(results$subclone)]
  
  # Calculate fold changes for chr7 genes
  if(length(subclone_1) > 0 & length(subclone_2) > 0) {
    expr_s1 <- rowMeans(count_mtx[chr7_genes, subclone_1])
    expr_s2 <- rowMeans(count_mtx[chr7_genes, subclone_2])
    
    fc <- log2((expr_s1 + 1) / (expr_s2 + 1))
    
    # Top differentially expressed genes
    head(sort(fc, decreasing = TRUE), 10)
  }
}
```

## Advanced Options

### Using Known Normal Cells

If you have prior knowledge of normal cells:

```{r known-normals}
# Specify known normal cell barcodes
known_normals <- c("cell_1", "cell_2", "cell_3")

results <- pipelineCNA(
  count_mtx,
  sample = "MGH106_custom",
  norm_cell = known_normals,
  FIXED_NORMAL_CELLS = FALSE  # Still allow automatic detection
)
```

### Adjusting Segmentation

For noisy data, increase `beta_vega`:

```{r adjust-beta}
# Coarser segmentation for noisy data
results_coarse <- pipelineCNA(
  count_mtx,
  sample = "MGH106_coarse",
  beta_vega = 1.0  # More conservative segmentation
)
```

### Adding Custom Gene Sets

```{r custom-genesets}
# Define custom normal cell signatures
custom_signatures <- list(
  Astrocytes = c("GFAP", "AQP4", "SLC1A2", "SLC1A3"),
  Oligodendrocytes = c("MBP", "MOG", "PLP1", "OLIG2"),
  Microglia = c("CX3CR1", "P2RY12", "TMEM119", "AIF1")
)

results <- pipelineCNA(
  count_mtx,
  sample = "MGH106_custom_sig",
  AdditionalGeneSets = custom_signatures,
  SCEVANsignatures = TRUE  # Also use built-in signatures
)
```

## Best Practices

### Quality Control Checklist

1. **Check cell coverage**: Ensure sufficient genes per cell
2. **Verify normal detection**: Review confident normal cells
3. **Validate subclones**: Check modularity scores
4. **Examine segmentation**: Adjust beta if needed

### Recommended Workflow

```{r workflow, eval=FALSE}
# 1. Initial run with defaults
results_default <- pipelineCNA(count_mtx, sample = "sample_v1")

# 2. Review outputs and classifications
table(results_default$class)

# 3. Adjust parameters if needed
results_tuned <- pipelineCNA(
  count_mtx, 
  sample = "sample_v2",
  beta_vega = 0.7  # Adjusted based on review
)

# 4. Final analysis
results_final <- pipelineCNA(
  count_mtx,
  sample = "sample_final",
  SUBCLONES = TRUE,
  plotTree = TRUE
)
```

## Troubleshooting

### Common Issues

| Issue | Solution |
|-------|----------|
| No normal cells detected | Add custom signatures or provide known normals |
| Too many/few segments | Adjust `beta_vega` parameter |
| Memory errors | Reduce `par_cores` or subsample data |
| Missing genes | Verify gene name format (Symbol vs Ensembl) |

## Session Info

```{r session-info, eval=TRUE}
sessionInfo()
```