---
title: "Algorithm and Methodology"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_caption: true
vignette: >
  %\VignetteIndexEntry{Algorithm and Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  warning = FALSE,
  message = FALSE
)
```

## Overview

scPharm employs a multi-step computational pipeline to identify pharmacological subpopulations at single-cell resolution. This vignette describes the underlying algorithms and statistical methods.

```{r load-packages, echo=FALSE}
library(ggplot2)
library(dplyr)
```

## Workflow Architecture

```{r workflow-diagram, echo=FALSE, fig.cap="scPharm analytical workflow", fig.width=10, fig.height=6}
# Create workflow diagram
workflow_data <- data.frame(
  step = factor(c("Input", "CNV Detection", "MCA Embedding", 
                  "Cell Signatures", "GSEA", "Classification"),
                levels = c("Input", "CNV Detection", "MCA Embedding", 
                          "Cell Signatures", "GSEA", "Classification")),
  x = c(1, 2, 3, 4, 5, 6),
  y = rep(1, 6),
  description = c(
    "scRNA-seq\nSeurat Object",
    "CopyKAT\nTumor Detection",
    "Multiple\nCorrespondence\nAnalysis",
    "Gene Set\nExtraction",
    "Drug Sensitivity\nEnrichment",
    "Sensitive/Resistant\nLabeling"
  )
)

ggplot(workflow_data, aes(x = x, y = y)) +
  geom_segment(aes(x = x, xend = x + 0.7, y = y, yend = y),
               arrow = arrow(length = unit(0.3, "cm")),
               data = workflow_data[1:5, ],
               color = "#3498db", size = 1.5) +
  geom_point(size = 20, color = "#2c3e50", shape = 21, fill = "#ecf0f1", stroke = 2) +
  geom_text(aes(label = step), vjust = -2.5, fontface = "bold", size = 4) +
  geom_text(aes(label = description), vjust = 1.5, size = 3, lineheight = 0.9) +
  theme_void() +
  theme(plot.margin = margin(40, 20, 40, 20)) +
  coord_cartesian(ylim = c(0.5, 1.5), xlim = c(0.5, 6.5))
```

## Step 1: Copy Number Variation Detection

For tissue samples containing mixed tumor and normal cells, scPharm employs an integrated CopyKAT-based algorithm to distinguish malignant cells.

### Algorithm Overview

The CNV detection follows these steps:

1. **Gene Annotation**: Map genes to chromosomal positions using hg20 annotations
2. **Expression Binning**: Aggregate expression into genomic bins (220 bins per chromosome)
3. **Baseline Normalization**: Use normal cells or synthetic baseline as reference
4. **Smoothing**: Apply Kalman filtering to reduce noise
5. **Clustering**: Hierarchical clustering to identify aneuploid populations

### Mathematical Framework

For a gene $g$ at genomic position $p$, the relative copy number is estimated as:

$$\text{CNV}_{g} = \log_2\left(\frac{E_g^{tumor}}{E_g^{baseline}} + 1\right)$$

where $E_g$ represents the normalized expression level.

```{r cnv-illustration, echo=FALSE, fig.cap="CNV profile across chromosomes", fig.width=10, fig.height=4}
# Simulate CNV profile
set.seed(42)
n_bins <- 220 * 22
chrom_lengths <- rep(220, 22)
chrom_starts <- c(0, cumsum(chrom_lengths)[-22])

cnv_profile <- data.frame(
  bin = 1:n_bins,
  chromosome = rep(1:22, each = 220),
  cnv = rnorm(n_bins, 0, 0.3)
)

# Add amplifications and deletions
cnv_profile$cnv[500:600] <- cnv_profile$cnv[500:600] + 0.8  # Amp on chr3
cnv_profile$cnv[1500:1650] <- cnv_profile$cnv[1500:1650] - 0.6  # Del on chr8
cnv_profile$cnv[3500:3700] <- cnv_profile$cnv[3500:3700] + 1.0  # Amp on chr17

ggplot(cnv_profile, aes(x = bin, y = cnv)) +
  geom_point(aes(color = factor(chromosome %% 2)), size = 0.5, alpha = 0.6) +
  geom_hline(yintercept = c(-0.5, 0.5), linetype = "dashed", color = "red", alpha = 0.5) +
  geom_hline(yintercept = 0, color = "black") +
  scale_color_manual(values = c("#3498db", "#2ecc71"), guide = "none") +
  scale_x_continuous(breaks = chrom_starts + 110, labels = 1:22) +
  labs(x = "Chromosome", y = "Log2 CNV Ratio", 
       title = "Copy Number Variation Profile",
       subtitle = "Showing amplifications (chr3, chr17) and deletion (chr8)") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold"))
```

## Step 2: Multiple Correspondence Analysis (MCA)

MCA is used for dimensionality reduction, preserving the correspondence between genes and cells.

### Why MCA?

Unlike PCA which assumes continuous data, MCA is designed for categorical/count data and provides:

- Joint embedding of cells and genes
- Natural handling of sparse scRNA-seq data
- Interpretable gene-cell associations

### Mathematical Formulation

Given a normalized expression matrix $\mathbf{X}$ with $n$ cells and $p$ genes:

1. **Construct indicator matrix** $\mathbf{Z}$ by discretizing expression levels
2. **Compute correspondence matrix**: 
   $$\mathbf{S} = \mathbf{D}_r^{-1/2}(\mathbf{Z} - \mathbf{rc}^T)\mathbf{D}_c^{-1/2}$$
3. **Perform SVD**: $\mathbf{S} = \mathbf{U\Sigma V}^T$
4. **Extract coordinates**:
   - Cell coordinates: $\mathbf{F} = \mathbf{D}_r^{-1/2}\mathbf{U\Sigma}$
   - Gene coordinates: $\mathbf{G} = \mathbf{D}_c^{-1/2}\mathbf{V\Sigma}$

```{r mca-visualization, echo=FALSE, fig.cap="MCA embedding showing cell-gene correspondence", fig.width=8, fig.height=6}
# Simulate MCA embedding
set.seed(123)
n_points <- 300
cell_data <- data.frame(
  MCA1 = c(rnorm(100, -2, 0.8), rnorm(100, 2, 0.8), rnorm(100, 0, 1)),
  MCA2 = c(rnorm(100, 1, 0.8), rnorm(100, 1, 0.8), rnorm(100, -2, 0.8)),
  type = rep(c("Sensitive", "Resistant", "Other"), each = 100)
)

gene_data <- data.frame(
  MCA1 = c(rnorm(20, -2.5, 0.3), rnorm(20, 2.5, 0.3), rnorm(10, 0, 0.5)),
  MCA2 = c(rnorm(20, 1.5, 0.3), rnorm(20, 1.5, 0.3), rnorm(10, -2.5, 0.5)),
  gene_type = c(rep("Sensitivity genes", 20), rep("Resistance genes", 20), 
                rep("Other genes", 10))
)

ggplot() +
  geom_point(data = cell_data, aes(x = MCA1, y = MCA2, color = type), 
             size = 2, alpha = 0.6) +
  geom_point(data = gene_data, aes(x = MCA1, y = MCA2, shape = gene_type),
             size = 3, color = "black", alpha = 0.8) +
  scale_color_manual(values = c("#e74c3c", "#27ae60", "#95a5a6"),
                     name = "Cell Type") +
  scale_shape_manual(values = c(17, 15, 8), name = "Gene Type") +
  labs(x = "MCA Component 1", y = "MCA Component 2",
       title = "MCA Joint Embedding",
       subtitle = "Cells and genes in shared coordinate space") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title = element_text(face = "bold"))
```

### C++ Implementation

scPharm implements MCA using RcppArmadillo for computational efficiency:

```cpp
// Simplified MCA computation (actual implementation in src/mca.cpp)
arma::mat SparseMCAStep1(arma::sp_mat& X) {
    // Column sums and total
    arma::vec col_sum = arma::vec(arma::sum(X, 0).t());
    double total = arma::accu(X);
    
    // Compute standardized residuals
    arma::mat Z = compute_residuals(X, col_sum, total);
    
    return Z;
}
```

## Step 3: Cell Identity Signatures

For each cell, we extract a gene signature based on its position in MCA space.

### Distance Metric

The association between cell $i$ and gene $j$ is computed as:

$$d_{ij} = \sqrt{\sum_{k=1}^{K} (F_{ik} - G_{jk})^2}$$

where $K$ is the number of MCA components.

### Signature Extraction

For each cell, the top $n$ genes (by distance) form its identity signature:

$$\text{Signature}_i = \{g_1, g_2, ..., g_n\} \text{ where } d_{i,g_1} < d_{i,g_2} < ... < d_{i,g_n}$$

## Step 4: Gene Set Enrichment Analysis

GSEA quantifies the enrichment of drug sensitivity genes within each cell's signature.

### GSEA Algorithm

1. **Rank genes** by their association with the cell
2. **Calculate running enrichment score** (ES):
   $$ES = \max_{j}\left(\sum_{g_i \in S, i \leq j} \sqrt{\frac{N-|S|}{|S|}} - \sum_{g_i \notin S, i \leq j} \sqrt{\frac{|S|}{N-|S|}}\right)$$
3. **Normalize** to obtain NES (Normalized Enrichment Score)

```{r gsea-illustration, echo=FALSE, fig.cap="GSEA running enrichment score", fig.width=8, fig.height=5}
# Simulate GSEA running score
set.seed(456)
n_genes <- 500
gene_rank <- 1:n_genes

# Drug sensitivity genes concentrated at top
sensitivity_genes <- sort(sample(1:100, 30))

# Calculate running score
es <- numeric(n_genes)
hit_score <- sqrt((n_genes - 30) / 30)
miss_score <- sqrt(30 / (n_genes - 30))

running_score <- 0
for (i in 1:n_genes) {
  if (i %in% sensitivity_genes) {
    running_score <- running_score + hit_score
  } else {
    running_score <- running_score - miss_score
  }
  es[i] <- running_score
}

gsea_data <- data.frame(
  rank = gene_rank,
  enrichment_score = es,
  is_hit = gene_rank %in% sensitivity_genes
)

ggplot(gsea_data) +
  geom_line(aes(x = rank, y = enrichment_score), color = "#2980b9", size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_segment(data = gsea_data[gsea_data$is_hit, ],
               aes(x = rank, xend = rank, y = -2, yend = -1.5),
               color = "#e74c3c", alpha = 0.7) +
  geom_hline(yintercept = max(es), color = "#27ae60", linetype = "dashed") +
  annotate("text", x = 400, y = max(es) + 0.5, 
           label = paste("ES =", round(max(es), 2)), 
           color = "#27ae60", fontface = "bold") +
  labs(x = "Gene Rank", y = "Enrichment Score",
       title = "GSEA Running Enrichment Score",
       subtitle = "Red bars indicate drug sensitivity genes") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))
```

### Drug Sensitivity Gene Sets

Drug sensitivity gene sets are derived from GDSC2 database:

- For each drug, genes are ranked by correlation with IC50
- Top correlated genes → **Sensitivity signature**
- Bottom correlated genes → **Resistance signature**

## Step 5: Cell Classification

### Gaussian Mixture Model

Cell NES values are modeled as a mixture of Gaussian distributions:

$$P(\text{NES}) = \pi_S \cdot \mathcal{N}(\mu_S, \sigma_S^2) + \pi_R \cdot \mathcal{N}(\mu_R, \sigma_R^2) + \pi_O \cdot \mathcal{N}(\mu_O, \sigma_O^2)$$

```{r gmm-illustration, echo=FALSE, fig.cap="Gaussian Mixture Model for cell classification", fig.width=8, fig.height=5}
# Simulate NES distribution
set.seed(789)
sensitive_nes <- rnorm(200, 1.5, 0.4)
resistant_nes <- rnorm(150, -1.2, 0.5)
other_nes <- rnorm(150, 0, 0.6)

all_nes <- c(sensitive_nes, resistant_nes, other_nes)
cell_type <- c(rep("Sensitive", 200), rep("Resistant", 150), rep("Other", 150))

nes_data <- data.frame(NES = all_nes, Type = cell_type)

# Density curves
x_seq <- seq(-3, 3.5, length.out = 200)
density_data <- data.frame(
  x = rep(x_seq, 3),
  y = c(dnorm(x_seq, 1.5, 0.4) * 200/500,
        dnorm(x_seq, -1.2, 0.5) * 150/500,
        dnorm(x_seq, 0, 0.6) * 150/500),
  Type = rep(c("Sensitive", "Resistant", "Other"), each = 200)
)

ggplot() +
  geom_histogram(data = nes_data, aes(x = NES, y = after_stat(density), fill = Type),
                 alpha = 0.4, bins = 40, position = "identity") +
  geom_line(data = density_data, aes(x = x, y = y, color = Type), size = 1.2) +
  geom_vline(xintercept = c(-0.5, 0.5), linetype = "dashed", alpha = 0.5) +
  scale_fill_manual(values = c("#95a5a6", "#e74c3c", "#27ae60")) +
  scale_color_manual(values = c("#95a5a6", "#e74c3c", "#27ae60")) +
  labs(x = "Normalized Enrichment Score (NES)", y = "Density",
       title = "Cell Classification via Gaussian Mixture Model",
       subtitle = "Dashed lines indicate classification thresholds") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title = element_text(face = "bold"))
```

### Classification Criteria

Cells are classified based on NES thresholds:

| Classification | Criterion |
|----------------|-----------|
| Sensitive | NES > threshold_s (default: 1.0) |
| Resistant | NES < threshold_r (default: -1.0) |
| Other | threshold_r ≤ NES ≤ threshold_s |

## Step 6: Drug Scoring Metrics

### Drug Prioritization Score (Dr)

$$Dr = 1 - \left(\alpha \cdot \frac{n_S}{n_{tumor}} + \beta \cdot \overline{\text{NES}}_S\right)$$

where:
- $n_S$ = number of sensitive cells
- $n_{tumor}$ = total tumor cells
- $\overline{\text{NES}}_S$ = mean NES of sensitive cells
- $\alpha, \beta$ = weighting parameters

**Interpretation**: Lower Dr indicates better drug candidate.

### Drug Side Effect Score (Dse)

$$Dse = \frac{n_{S,adj}}{n_{adj}} \cdot \overline{\text{NES}}_{S,adj}$$

where subscript $adj$ denotes adjacent (non-tumor) cells.

**Interpretation**: Higher Dse indicates more potential side effects.

## Computational Complexity

| Step | Complexity | Notes |
|------|------------|-------|
| CNV Detection | O(n × g) | n = cells, g = genes |
| MCA | O(min(n,g)³) | SVD computation |
| Cell Signatures | O(n × k) | k = MCA components |
| GSEA | O(n × d × g) | d = drugs |
| Classification | O(n × d) | GMM fitting |

## References

1. Gao R, et al. (2021). Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. *Nature Biotechnology*.

2. Subramanian A, et al. (2005). Gene set enrichment analysis: A knowledge-based approach. *PNAS*.

3. Yang W, et al. (2013). Genomics of Drug Sensitivity in Cancer (GDSC). *Nucleic Acids Research*.

4. Cortal A, et al. (2021). Gene signature extraction and cell identity recognition at the single-cell level with CelliD. *Nature Biotechnology*.

## Session Info

```{r session-info}
sessionInfo()
```