---
title: "Algorithm Principles"
author: "Zaoqu Liu, Robert K. Suter, Nagi G. Ayad"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Algorithm Principles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 6,
  eval = TRUE
)
library(ggplot2)
library(dplyr)
```

## Overview

scFOCAL implements a multi-step computational framework that integrates single-cell transcriptomics with pharmacological perturbation data. This vignette details the mathematical principles underlying each algorithm.

## 1. Drug-Cell Connectivity Score

### Concept

The core innovation of scFOCAL is the **Drug-Cell Connectivity Score**, which quantifies the transcriptional relationship between individual cells and drug perturbation signatures.

### Mathematical Formulation

For each cell $c$ and drug $d$, we compute the Spearman rank correlation:

$$\rho_{c,d} = 1 - \frac{6 \sum_{i=1}^{n} (R_{x_i} - R_{y_i})^2}{n(n^2 - 1)}$$

Where:

- $R_{x_i}$ = rank of gene $i$ expression in cell $c$
- $R_{y_i}$ = rank of gene $i$ in drug signature $d$
- $n$ = number of overlapping genes

### Interpretation

```{r connectivity-concept, echo=FALSE, fig.height=4, fig.width=7}
# Visualization of connectivity concept
set.seed(42)
df <- data.frame(
  Connectivity = seq(-1, 1, length.out = 100),
  Density = dnorm(seq(-1, 1, length.out = 100), mean = 0, sd = 0.3)
)

ggplot(df, aes(x = Connectivity, y = Density)) +
  geom_area(fill = "steelblue", alpha = 0.6) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
  annotate("text", x = -0.6, y = 1.2, label = "Sensitive\n(Discordant)", 
           color = "#2166AC", fontface = "bold", size = 4) +
  annotate("text", x = 0.6, y = 1.2, label = "Resistant\n(Concordant)", 
           color = "#B2182B", fontface = "bold", size = 4) +
  annotate("segment", x = -0.8, xend = -0.3, y = 0.8, yend = 0.8,
           arrow = arrow(length = unit(0.2, "cm")), color = "#2166AC") +
  annotate("segment", x = 0.3, xend = 0.8, y = 0.8, yend = 0.8,
           arrow = arrow(length = unit(0.2, "cm")), color = "#B2182B") +
  labs(
    title = "Drug-Cell Connectivity Score Distribution",
    subtitle = "Negative values indicate drug sensitivity, positive values indicate resistance",
    x = "Connectivity Score (ρ)",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 10, color = "gray40")
  )
```

| Score Range | Interpretation | Biological Meaning |
|-------------|----------------|-------------------|
| ρ < -0.3 | Strong discordance | High drug sensitivity |
| -0.3 ≤ ρ < 0 | Moderate discordance | Moderate sensitivity |
| 0 ≤ ρ < 0.3 | Moderate concordance | Moderate resistance |
| ρ ≥ 0.3 | Strong concordance | High drug resistance |

## 2. Fisher's Z-Transformation

### Purpose

To enable statistical comparison of correlation coefficients across conditions, we apply Fisher's Z-transformation:

$$Z = \frac{1}{2} \ln\left(\frac{1+\rho}{1-\rho}\right) = \text{arctanh}(\rho)$$

### Properties

```{r fisher-z-demo, fig.height=4, fig.width=7}
# Demonstrate Fisher Z transformation
rho <- seq(-0.95, 0.95, by = 0.01)
z <- 0.5 * log((1 + rho) / (1 - rho))

df <- data.frame(rho = rho, z = z)

ggplot(df, aes(x = rho, y = z)) +
  geom_line(color = "#0072B2", linewidth = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Fisher's Z-Transformation",
    subtitle = expression(Z == frac(1,2) * ln * bgroup("(", frac(1+rho, 1-rho), ")")),
    x = expression("Correlation coefficient (" * rho * ")"),
    y = "Fisher's Z"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12)
  )
```

**Key advantages:**

1. **Normalizes the sampling distribution** - Z-values follow an approximately normal distribution
2. **Stabilizes variance** - Variance becomes approximately constant across different correlation values
3. **Enables parametric testing** - Allows use of standard statistical tests (t-tests, ANOVA)

### Variance Property

The variance of Z is approximately:

$$\text{Var}(Z) \approx \frac{1}{n-3}$$

Where $n$ is the number of genes used in correlation.

## 3. Differential Connectivity Analysis

### Linear Model Framework

scFOCAL uses limma (Linear Models for Microarray Data) for differential connectivity analysis:

$$Z_{ij} = \mu + \beta_1 \cdot \text{Subject}_j + \beta_2 \cdot \text{Sensitivity}_i + \epsilon_{ij}$$

Where:

- $Z_{ij}$ = Z-transformed connectivity for cell $i$ in subject $j$
- $\text{Subject}_j$ = blocking factor for subject/replicate
- $\text{Sensitivity}_i$ = binary indicator (sensitive vs resistant)
- $\epsilon_{ij}$ = residual error

### Empirical Bayes Moderation

limma applies empirical Bayes moderation to improve variance estimates:

```{r ebayes-concept, echo=FALSE, fig.height=4, fig.width=7}
# Conceptual visualization of empirical Bayes
set.seed(123)
n_genes <- 1000
true_var <- rgamma(n_genes, shape = 4, rate = 2)
observed_var <- rchisq(n_genes, df = 5) * true_var / 5

df <- data.frame(
  Observed = observed_var,
  Moderated = (true_var + mean(observed_var)) / 2
)

ggplot() +
  geom_density(data = df, aes(x = Observed, fill = "Observed"), alpha = 0.5) +
  geom_density(data = df, aes(x = Moderated, fill = "Moderated"), alpha = 0.5) +
  scale_fill_manual(values = c("Observed" = "#E69F00", "Moderated" = "#56B4E9"),
                    name = "Variance") +
  labs(
    title = "Empirical Bayes Variance Moderation",
    subtitle = "Borrowing information across compounds stabilizes variance estimates",
    x = "Variance Estimate",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "top"
  )
```

The moderated t-statistic provides:

- **Increased statistical power** for compounds with few cells
- **Robust inference** even with small sample sizes
- **Proper multiple testing correction** via FDR adjustment

## 4. Disease Signature Reversal Score

### Concept

The reversal score quantifies how effectively a drug can reverse disease-associated gene expression changes.

### Algorithm

For disease signature $D$ and drug signature $d$:

$$\text{Reversal Score} = \frac{N_{\text{discordant}}}{N_{\text{concordant}}}$$

Where:

- $N_{\text{discordant}}$ = genes where $\text{sign}(D_g) \neq \text{sign}(d_g)$
- $N_{\text{concordant}}$ = genes where $\text{sign}(D_g) = \text{sign}(d_g)$

```{r reversal-concept, echo=FALSE, fig.height=4, fig.width=7}
# Illustration of reversal scoring
df <- data.frame(
  Gene = paste0("Gene", 1:10),
  Disease = c(2.1, -1.5, 1.8, -0.9, 2.5, -1.2, 0.7, -2.0, 1.3, -0.6),
  Drug = c(-1.8, 1.2, -1.5, 0.8, -2.2, 1.0, -0.5, 1.7, -1.1, 0.4)
)

df$Interaction <- ifelse(df$Disease * df$Drug < 0, "Discordant\n(Reversal)", "Concordant")

ggplot(df, aes(x = Disease, y = Drug, color = Interaction)) +
  geom_point(size = 4) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  geom_text(aes(label = Gene), hjust = -0.2, vjust = -0.5, size = 3) +
  scale_color_manual(values = c("Discordant\n(Reversal)" = "#2166AC", 
                                 "Concordant" = "#B2182B")) +
  labs(
    title = "Disease Signature Reversal Concept",
    subtitle = "Discordant genes indicate therapeutic potential",
    x = "Disease Signature (log2FC)",
    y = "Drug Signature (log2FC)"
  ) +
  theme_minimal() +
  xlim(-3, 3) + ylim(-3, 3) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  )
```

### Interpretation

| Reversal Score | Interpretation |
|----------------|----------------|
| > 2.0 | Strong reversal potential |
| 1.0 - 2.0 | Moderate reversal |
| < 1.0 | Limited reversal |

## 5. MAST Differential Expression

### Model Structure

For disease signature computation, scFOCAL employs MAST (Model-based Analysis of Single-cell Transcriptomics):

$$\text{logit}(P(Z_g > 0)) = X\beta_D + W\alpha_D$$

$$E[Y_g | Z_g = 1] = X\beta_C + W\alpha_C$$

Where:

- $Z_g$ = indicator for gene detection
- $Y_g$ = continuous expression level
- $X$ = design matrix (cell type, condition)
- $W$ = cellular detection rate (technical covariate)

### Advantages for Single-Cell Data

1. **Handles zero-inflation** - Explicitly models dropout events
2. **Controls for technical variation** - Includes cellular detection rate
3. **Subject-level effects** - Can incorporate random effects for replicates

## 6. Computational Complexity

### Time Complexity Analysis

| Operation | Complexity | Typical Runtime |
|-----------|------------|-----------------|
| Drug-Cell Connectivity | O(n × m × g) | ~20 min for 10K cells |
| MAST Differential Expression | O(n × g) | ~5 min per comparison |
| Reversal Scoring | O(d × g) | < 1 min |
| Differential Connectivity | O(n × d) | ~2 min |

Where: n = cells, m = compounds (1679), g = genes (~1000), d = drug signatures

### Memory Requirements

- **Minimum**: 8 GB RAM
- **Recommended**: 16 GB RAM for datasets > 50K cells
- **Large datasets**: Consider chunked processing

## Summary

```{r summary-flowchart, echo=FALSE, fig.height=5, fig.width=8}
# Create a conceptual flowchart
library(ggplot2)

boxes <- data.frame(
  x = c(1, 2, 3, 4, 5),
  y = rep(2, 5),
  label = c("scRNA-seq\nData", "Disease\nSignatures", "Drug-Cell\nConnectivity", 
            "Z-Transform\n& limma", "Combination\nDiscovery"),
  color = c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#CC79A7")
)

arrows <- data.frame(
  x = c(1.3, 2.3, 3.3, 4.3),
  xend = c(1.7, 2.7, 3.7, 4.7),
  y = rep(2, 4),
  yend = rep(2, 4)
)

ggplot() +
  geom_rect(data = boxes, aes(xmin = x - 0.4, xmax = x + 0.4,
                               ymin = y - 0.5, ymax = y + 0.5, fill = color),
            color = "black", alpha = 0.8) +
  geom_text(data = boxes, aes(x = x, y = y, label = label), 
            size = 3.5, fontface = "bold") +
  geom_segment(data = arrows, aes(x = x, xend = xend, y = y, yend = yend),
               arrow = arrow(length = unit(0.3, "cm")), linewidth = 1) +
  scale_fill_identity() +
  theme_void() +
  labs(title = "scFOCAL Algorithm Pipeline") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 16))
```

scFOCAL's algorithmic framework provides:

1. **Robust statistical inference** through Fisher Z-transformation and empirical Bayes
2. **Biological interpretability** via connectivity and reversal scores
3. **Scalability** to large single-cell datasets
4. **Integration** of orthogonal pharmacological and transcriptomic data

## References

1. Subramanian A, *et al*. A Next Generation Connectivity Map. *Cell* (2017)
2. Finak G, *et al*. MAST: a flexible statistical framework. *Genome Biology* (2015)
3. Ritchie ME, *et al*. limma powers differential expression. *Nucleic Acids Res* (2015)
4. Fisher RA. On the probable error of a coefficient of correlation. *Metron* (1921)

## Session Info

```{r session-info}
sessionInfo()
```