---
title: "Statistical Framework"
author: "Zaoqu Liu, Robert K. Suter, Nagi G. Ayad"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Statistical Framework}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 6,
  eval = TRUE
)
library(ggplot2)
library(dplyr)
library(tidyr)
```

## Introduction

This vignette provides a comprehensive overview of the statistical methods implemented in scFOCAL. Understanding these methods is essential for proper interpretation of results and experimental design.

## 1. Spearman Rank Correlation

### Theoretical Foundation

scFOCAL uses Spearman's rank correlation coefficient to measure the monotonic relationship between gene expression and drug signatures. Unlike Pearson correlation, Spearman is:

- **Robust to outliers**
- **Non-parametric** (no distributional assumptions)
- **Captures monotonic relationships** (not just linear)

### Formula

$$\rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2-1)}$$

Where $d_i = R(X_i) - R(Y_i)$ is the difference between ranks.

### Demonstration

```{r spearman-demo, fig.height=4, fig.width=8}
set.seed(42)

# Generate example data
n <- 50
x <- rnorm(n)
y_linear <- x + rnorm(n, sd = 0.5)
y_monotonic <- sign(x) * abs(x)^0.5 + rnorm(n, sd = 0.3)
y_outlier <- y_linear
y_outlier[c(1, 2)] <- c(10, -10)  # Add outliers

# Calculate correlations
results <- data.frame(
  Scenario = c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"),
  Pearson = c(cor(x, y_linear), cor(x, y_monotonic), cor(x, y_outlier)),
  Spearman = c(cor(x, y_linear, method = "spearman"),
               cor(x, y_monotonic, method = "spearman"),
               cor(x, y_outlier, method = "spearman"))
)

# Visualization
df_plot <- data.frame(
  x = rep(x, 3),
  y = c(y_linear, y_monotonic, y_outlier),
  Scenario = factor(rep(c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"), 
                        each = n))
)

ggplot(df_plot, aes(x = x, y = y)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  facet_wrap(~Scenario, scales = "free_y") +
  labs(
    title = "Pearson vs Spearman Correlation",
    subtitle = "Spearman is more robust to outliers and non-linear relationships",
    x = "Variable X",
    y = "Variable Y"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))
```

```{r correlation-table}
knitr::kable(results, digits = 3,
             caption = "Comparison of Pearson and Spearman correlations")
```

## 2. Fisher's Z-Transformation

### Why Transform?

The sampling distribution of correlation coefficients is:

1. **Skewed** near ±1
2. **Variance depends on the true correlation**
3. **Not suitable for standard parametric tests**

Fisher's Z-transformation addresses these issues.

### Statistical Properties

```{r fisher-properties, fig.height=5, fig.width=8}
# Simulate sampling distributions
set.seed(123)
n_samples <- 1000
sample_size <- 30

simulate_cors <- function(true_rho, n, n_samples) {
  cors <- sapply(1:n_samples, function(i) {
    sigma <- matrix(c(1, true_rho, true_rho, 1), 2, 2)
    data <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = sigma)
    cor(data[,1], data[,2])
  })
  return(cors)
}

# Different true correlations
rhos <- c(0, 0.5, 0.8)
results_list <- lapply(rhos, function(r) {
  cors <- simulate_cors(r, sample_size, n_samples)
  z <- 0.5 * log((1 + cors) / (1 - cors))
  data.frame(
    Correlation = cors,
    FisherZ = z,
    TrueRho = paste("ρ =", r)
  )
})

df_all <- do.call(rbind, results_list)

# Plot correlation distributions
p1 <- ggplot(df_all, aes(x = Correlation, fill = TrueRho)) +
  geom_density(alpha = 0.5) +
  labs(title = "A) Raw Correlation Distributions",
       x = "Sample Correlation", y = "Density") +
  theme_minimal() +
  theme(legend.position = "top")

# Plot Fisher Z distributions
p2 <- ggplot(df_all, aes(x = FisherZ, fill = TrueRho)) +
  geom_density(alpha = 0.5) +
  labs(title = "B) Fisher Z Distributions",
       x = "Fisher's Z", y = "Density") +
  theme_minimal() +
  theme(legend.position = "top")

gridExtra::grid.arrange(p1, p2, ncol = 2)
```

**Key observation**: After transformation, distributions are:

- More symmetric
- Have similar variance
- Suitable for combining across studies

### Standard Error

The standard error of Fisher's Z is approximately:

$$SE(Z) = \frac{1}{\sqrt{n-3}}$$

This enables construction of confidence intervals:

$$Z \pm 1.96 \times SE(Z)$$

## 3. limma Linear Models

### Design Matrix Construction

For differential connectivity analysis, scFOCAL constructs a design matrix incorporating:

```{r design-matrix}
# Example design matrix
n_cells <- 20
cell_data <- data.frame(
  Cell = paste0("Cell_", 1:n_cells),
  Subject = rep(c("S1", "S2", "S3", "S4"), each = 5),
  Sensitivity = c(rep("Sensitive", 10), rep("Resistant", 10))
)

# Create design matrix
design <- model.matrix(~ Subject + Sensitivity, data = cell_data)
colnames(design) <- c("Intercept", "Subject_S2", "Subject_S3", "Subject_S4", "Sensitive")

knitr::kable(head(design, 10), caption = "Example Design Matrix (first 10 cells)")
```

### Model Fitting

The linear model for each compound $c$:

$$Z_{ic} = \beta_0 + \sum_j \beta_j \text{Subject}_{ij} + \beta_S \text{Sensitivity}_i + \epsilon_{ic}$$

### Empirical Bayes

```{r ebayes-simulation, fig.height=4, fig.width=7}
# Simulate the benefit of empirical Bayes
set.seed(42)
n_compounds <- 500
true_effects <- rnorm(n_compounds, mean = 0, sd = 0.5)
n_cells_per_group <- 10

# Simulate with different sample sizes
simulate_effect <- function(true_effect, n) {
  sensitive <- rnorm(n, mean = true_effect, sd = 1)
  resistant <- rnorm(n, mean = 0, sd = 1)
  t.test(sensitive, resistant)$statistic
}

t_stats <- sapply(true_effects, simulate_effect, n = n_cells_per_group)

# Simple posterior (shrinkage toward zero)
shrinkage <- 0.7  # Simplified representation
moderated_t <- shrinkage * t_stats

df_ebayes <- data.frame(
  True = true_effects,
  Ordinary = t_stats,
  Moderated = moderated_t
)

ggplot(df_ebayes) +
  geom_point(aes(x = True, y = Ordinary), alpha = 0.3, color = "red") +
  geom_point(aes(x = True, y = Moderated), alpha = 0.3, color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(
    title = "Empirical Bayes Moderation Effect",
    subtitle = "Blue (moderated) estimates are shrunk toward zero, reducing false positives",
    x = "True Effect Size",
    y = "Estimated t-statistic"
  ) +
  theme_minimal() +
  annotate("text", x = -1.5, y = 3, label = "Ordinary t-test", color = "red") +
  annotate("text", x = -1.5, y = 2.5, label = "Moderated t-test", color = "blue")
```

## 4. Multiple Testing Correction

### The Problem

With 1,679 compounds tested simultaneously, controlling false positives is critical.

```{r multiple-testing, fig.height=4, fig.width=7}
# Demonstrate multiple testing
set.seed(42)
n_tests <- 1679
n_true_effects <- 50

# Generate p-values
true_p <- c(runif(n_true_effects, 0, 0.01),  # True effects
            runif(n_tests - n_true_effects, 0, 1))  # Null

# Apply corrections
p_bonferroni <- p.adjust(true_p, method = "bonferroni")
p_fdr <- p.adjust(true_p, method = "fdr")

results_mult <- data.frame(
  Raw = true_p,
  Bonferroni = p_bonferroni,
  FDR = p_fdr,
  TrueEffect = c(rep("Yes", n_true_effects), 
                  rep("No", n_tests - n_true_effects))
)

# Count discoveries at alpha = 0.05
discoveries <- data.frame(
  Method = c("Raw p-value", "Bonferroni", "FDR (BH)"),
  `Total Discoveries` = c(sum(true_p < 0.05),
                           sum(p_bonferroni < 0.05),
                           sum(p_fdr < 0.05)),
  `True Positives` = c(sum(true_p[1:n_true_effects] < 0.05),
                        sum(p_bonferroni[1:n_true_effects] < 0.05),
                        sum(p_fdr[1:n_true_effects] < 0.05)),
  check.names = FALSE
)
discoveries$`False Positives` <- discoveries$`Total Discoveries` - discoveries$`True Positives`

knitr::kable(discoveries, 
             caption = "Multiple Testing Correction Comparison (α = 0.05)")
```

### FDR Control

scFOCAL uses the **Benjamini-Hochberg procedure** for FDR control:

1. Order p-values: $p_{(1)} \leq p_{(2)} \leq ... \leq p_{(m)}$
2. Find largest $k$ where $p_{(k)} \leq \frac{k}{m} \cdot \alpha$
3. Reject hypotheses $H_{(1)}, ..., H_{(k)}$

This controls the **expected proportion of false discoveries** at level $\alpha$.

## 5. Effect Size Interpretation

### Log Fold Change in Z-space

```{r effect-sizes, fig.height=4, fig.width=7}
# Effect size interpretation guide
effect_sizes <- data.frame(
  logFC = c(-0.5, -0.3, -0.1, 0, 0.1, 0.3, 0.5),
  Interpretation = c("Strong negative", "Moderate negative", "Weak negative",
                     "No effect", "Weak positive", "Moderate positive", "Strong positive"),
  Meaning = c("Much lower connectivity in resistant",
              "Lower connectivity in resistant",
              "Slightly lower in resistant",
              "No difference",
              "Slightly higher in resistant",
              "Higher connectivity in resistant",
              "Much higher connectivity in resistant")
)

knitr::kable(effect_sizes, 
             caption = "Interpretation of logFC values in differential connectivity")
```

### Cohen's d Equivalent

For practical significance assessment:

$$d = \frac{\text{logFC}}{\text{pooled SD}} \approx \frac{\text{logFC}}{0.3}$$

| Cohen's d | Interpretation |
|-----------|----------------|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |

## 6. Power Analysis

### Sample Size Considerations

```{r power-analysis, fig.height=4, fig.width=7}
# Power calculation
power_calc <- function(n, effect, alpha = 0.05) {
  se <- sqrt(2/n)
  z_crit <- qnorm(1 - alpha/2)
  power <- pnorm(effect/se - z_crit) + pnorm(-effect/se - z_crit)
  return(power)
}

n_range <- seq(10, 500, by = 10)
effects <- c(0.1, 0.2, 0.3, 0.5)

power_df <- expand.grid(n = n_range, effect = effects)
power_df$power <- mapply(power_calc, power_df$n, power_df$effect)
power_df$effect <- factor(power_df$effect)

ggplot(power_df, aes(x = n, y = power, color = effect)) +
  geom_line(linewidth = 1.2) +
  geom_hline(yintercept = 0.8, linetype = "dashed", color = "gray50") +
  scale_color_brewer(palette = "Set1", name = "Effect Size\n(logFC)") +
  labs(
    title = "Statistical Power vs Sample Size",
    subtitle = "Dashed line indicates 80% power threshold",
    x = "Number of Cells per Group",
    y = "Statistical Power"
  ) +
  theme_minimal() +
  annotate("text", x = 400, y = 0.85, label = "80% power", color = "gray40")
```

### Recommended Sample Sizes

| Effect Size (logFC) | Min Cells per Group | Recommended |
|---------------------|---------------------|-------------|
| 0.5 (large) | 30 | 50 |
| 0.3 (medium) | 80 | 100 |
| 0.2 (small) | 200 | 250 |
| 0.1 (very small) | 800 | 1000 |

## 7. Quality Control Metrics

### Gene Coverage

scFOCAL reports the overlap between your data and L1000 genes:

```{r gene-coverage, fig.height=3, fig.width=6}
# Example gene coverage visualization
coverage_data <- data.frame(
  Category = c("L1000 Genes (978)", "In Your Data", "Used for Analysis"),
  Count = c(978, 850, 850),
  Percentage = c(100, 86.9, 86.9)
)

ggplot(coverage_data, aes(x = Category, y = Count, fill = Category)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(Count, " (", round(Percentage, 1), "%)")),
            vjust = -0.5) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Gene Coverage Quality Control",
    x = "",
    y = "Number of Genes"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 1100)
```

### Connectivity Distribution QC

Healthy connectivity distributions should be:

- Centered near zero
- Roughly symmetric
- Not strongly bimodal

## Summary Statistics Table

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| Connectivity | $\rho_s(cell, drug)$ | Transcriptional similarity |
| Fisher's Z | $\frac{1}{2}\ln\frac{1+\rho}{1-\rho}$ | Normalized correlation |
| logFC | $\bar{Z}_{resistant} - \bar{Z}_{sensitive}$ | Differential connectivity |
| adj.P.Val | BH-corrected p-value | Statistical significance |
| Reversal Score | $\frac{N_{disc}}{N_{conc}}$ | Therapeutic potential |

## Best Practices

1. **Ensure adequate cell numbers** per condition (>50 cells recommended)
2. **Check gene coverage** before analysis (>70% L1000 overlap)
3. **Use FDR-adjusted p-values** for discovery
4. **Consider effect sizes** alongside p-values
5. **Validate computationally identified candidates** experimentally

## Session Info

```{r session-info}
sessionInfo()
```