---
title: "Statistical Framework and Algorithm"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    theme: flatly
    highlight: tango
    mathjax: default
vignette: >
  %\VignetteIndexEntry{Statistical Framework and Algorithm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 8,
  fig.height = 6,
  fig.align = "center"
)
```

# Overview

**tradeSeq** implements a flexible statistical framework based on **Generalized Additive Models (GAMs)** for trajectory-based differential expression analysis. This vignette provides a detailed explanation of the statistical methodology underlying the package.

# The Statistical Model

## Negative Binomial GAM

For each gene $g$, tradeSeq fits a negative binomial generalized additive model of the form:

$$\log(\mu_{gc}) = \eta_{gc} = \sum_{l=1}^{L} \mathbf{1}_{cl} \cdot s_l(t_{cl}) + \mathbf{U}_c \boldsymbol{\beta} + O_c$$

Where:

- $\mu_{gc}$: Expected count for gene $g$ in cell $c$
- $L$: Number of lineages in the trajectory
- $\mathbf{1}_{cl}$: Indicator for cell $c$ belonging to lineage $l$
- $s_l(t_{cl})$: Smooth function of pseudotime for lineage $l$
- $t_{cl}$: Pseudotime of cell $c$ in lineage $l$
- $\mathbf{U}_c$: Design matrix for additional covariates
- $O_c$: Offset term (typically log library size)

## Cubic Regression Splines

The smooth functions $s_l(t)$ are modeled using **cubic regression splines** with a shared set of knots across lineages. This approach:

- Provides flexibility to capture non-linear expression patterns
- Ensures smoothness through penalization
- Allows for identifiability by sharing basis functions

```{r spline-basis, echo=FALSE, fig.height=4}
library(ggplot2)
library(mgcv)

# Generate example spline basis
x <- seq(0, 1, length.out = 100)
nknots <- 6
knots <- seq(0, 1, length.out = nknots)

# Create design matrix
sm <- smoothCon(s(x, bs = "cr", k = nknots), data = data.frame(x = x))[[1]]
X <- sm$X

# Plot basis functions
df <- data.frame(
  x = rep(x, ncol(X)),
  y = c(X),
  basis = factor(rep(1:ncol(X), each = length(x)))
)

ggplot(df, aes(x = x, y = y, color = basis)) +
  geom_line(linewidth = 1) +
  labs(
    title = "Cubic Regression Spline Basis Functions",
    x = "Pseudotime",
    y = "Basis Function Value",
    color = "Basis"
  ) +
  theme_minimal() +
  theme(legend.position = "right")
```

## Knot Selection

The number and placement of knots determine the flexibility of the fitted curves. tradeSeq provides:

- **Default**: 6 knots, placed at quantiles of the pseudotime distribution
- **`evaluateK()` function**: Helps select optimal number of knots using AIC

```{r eval-k-concept, echo=FALSE, fig.height=4}
# Conceptual plot of knot evaluation
k_values <- 3:10
aic_values <- c(1200, 1050, 950, 920, 910, 905, 908, 915)

df_k <- data.frame(k = k_values, AIC = aic_values)

ggplot(df_k, aes(x = k, y = AIC)) +
  geom_line(linewidth = 1, color = "#2C3E50") +
  geom_point(size = 3, color = "#E74C3C") +
  geom_vline(xintercept = 7, linetype = "dashed", color = "#27AE60") +
  annotate("text", x = 7.3, y = 1100, label = "Optimal k", color = "#27AE60") +
  labs(
    title = "Knot Selection via AIC",
    x = "Number of Knots (k)",
    y = "Akaike Information Criterion (AIC)"
  ) +
  theme_minimal()
```

# Statistical Tests

tradeSeq implements several statistical tests based on **Wald statistics** computed from the fitted GAM coefficients.

## Association Test

Tests whether gene expression changes significantly along pseudotime within each lineage.

**Null Hypothesis**: Expression is constant along pseudotime
$$H_0: s_l(t) = c \text{ for all } t$$

**Test Statistic**: Wald statistic comparing consecutive knot coefficients

```{r association-diagram, echo=FALSE, fig.height=3}
library(ggplot2)

# Simulated expression curves
t <- seq(0, 1, length.out = 100)
expr_assoc <- 2 + 3 * sin(2 * pi * t) + rnorm(100, 0, 0.3)
expr_null <- rep(2, 100) + rnorm(100, 0, 0.3)

df <- data.frame(
  pseudotime = rep(t, 2),
  expression = c(expr_assoc, expr_null),
  type = rep(c("Associated (p < 0.001)", "Not Associated (p = 0.8)"), each = 100)
)

ggplot(df, aes(x = pseudotime, y = expression, color = type)) +
  geom_point(alpha = 0.5, size = 1) +
  geom_smooth(method = "loess", se = FALSE, linewidth = 1.5) +
  facet_wrap(~type) +
  labs(
    title = "Association Test: Expression vs Pseudotime",
    x = "Pseudotime",
    y = "Expression"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_color_manual(values = c("#E74C3C", "#3498DB"))
```

## Differential End Test

Tests whether genes are differentially expressed at the endpoints of different lineages.

**Null Hypothesis**: Expression at endpoints is equal across lineages
$$H_0: s_1(t_{max,1}) = s_2(t_{max,2})$$

```{r diffend-diagram, echo=FALSE, fig.height=4}
t1 <- seq(0, 1, length.out = 50)
t2 <- seq(0, 0.8, length.out = 50)

df <- data.frame(
  pseudotime = c(t1, t2),
  expression = c(2 + 2*t1 + rnorm(50, 0, 0.2), 
                 2 + 0.5*t2 + rnorm(50, 0, 0.2)),
  lineage = rep(c("Lineage 1", "Lineage 2"), c(50, 50))
)

ggplot(df, aes(x = pseudotime, y = expression, color = lineage)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, linewidth = 1.5) +
  geom_point(data = data.frame(
    pseudotime = c(1, 0.8),
    expression = c(4, 2.4),
    lineage = c("Lineage 1", "Lineage 2")
  ), size = 5, shape = 18) +
  annotate("segment", x = 0.85, xend = 0.95, y = 4, yend = 4,
           arrow = arrow(length = unit(0.2, "cm")), color = "#E74C3C") +
  annotate("segment", x = 0.85, xend = 0.95, y = 2.4, yend = 2.4,
           arrow = arrow(length = unit(0.2, "cm")), color = "#E74C3C") +
  annotate("text", x = 0.9, y = 3.2, label = "Compare\nEndpoints", 
           color = "#E74C3C", size = 3) +
  labs(
    title = "Differential End Test",
    x = "Pseudotime",
    y = "Expression",
    color = "Lineage"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("#3498DB", "#2ECC71"))
```

## Pattern Test

Tests whether genes have different expression patterns (shapes) between lineages.

**Null Hypothesis**: Expression patterns are identical across lineages
$$H_0: s_1(t) = s_2(t) \text{ for all } t$$

```{r pattern-diagram, echo=FALSE, fig.height=4}
t <- seq(0, 1, length.out = 100)

df <- data.frame(
  pseudotime = rep(t, 2),
  expression = c(2 + 2*sin(2*pi*t), 2 + 2*cos(2*pi*t)),
  lineage = rep(c("Lineage 1", "Lineage 2"), each = 100)
)

ggplot(df, aes(x = pseudotime, y = expression, color = lineage)) +
  geom_line(linewidth = 1.5) +
  labs(
    title = "Pattern Test: Different Expression Patterns",
    x = "Pseudotime",
    y = "Expression",
    color = "Lineage"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("#3498DB", "#E74C3C"))
```

## Early DE Test

Tests for differential expression in a specific region of the trajectory (e.g., near the branching point).

```{r earlyde-diagram, echo=FALSE, fig.height=4}
t <- seq(0, 1, length.out = 100)

df <- data.frame(
  pseudotime = rep(t, 2),
  expression = c(2 + t + 2*t^2, 2 + t - t^2),
  lineage = rep(c("Lineage 1", "Lineage 2"), each = 100)
)

ggplot(df, aes(x = pseudotime, y = expression, color = lineage)) +
  geom_line(linewidth = 1.5) +
  annotate("rect", xmin = 0.2, xmax = 0.5, ymin = -Inf, ymax = Inf,
           alpha = 0.2, fill = "#F39C12") +
  annotate("text", x = 0.35, y = 4.5, label = "Test Region", 
           color = "#F39C12", fontface = "bold") +
  labs(
    title = "Early DE Test: Region-Specific Testing",
    x = "Pseudotime",
    y = "Expression",
    color = "Lineage"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("#3498DB", "#E74C3C"))
```

# Wald Test Framework

All tests in tradeSeq are based on the **Wald test** framework:

$$W = (\mathbf{L}\hat{\boldsymbol{\beta}})^T [\mathbf{L}\hat{\mathbf{V}}\mathbf{L}^T]^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})$$

Where:

- $\hat{\boldsymbol{\beta}}$: Estimated coefficients from the GAM
- $\hat{\mathbf{V}}$: Estimated variance-covariance matrix
- $\mathbf{L}$: Contrast matrix defining the hypothesis

Under $H_0$, $W \sim \chi^2_r$ where $r$ is the rank of $\mathbf{L}$.

# Fold Change Thresholds

tradeSeq supports testing against a log2 fold change threshold:

$$H_0: |\mathbf{L}\boldsymbol{\beta}| \leq \log(2^{\text{l2fc}})$$

This is useful when biological significance requires not just statistical significance but also a meaningful effect size.

```{r fc-threshold, echo=FALSE, fig.height=4}
fc <- seq(-3, 3, length.out = 100)
pval_no_thresh <- 2 * pnorm(-abs(fc))
pval_thresh <- 2 * pnorm(-pmax(0, abs(fc) - 1))

df <- data.frame(
  fc = rep(fc, 2),
  pval = c(pval_no_thresh, pval_thresh),
  type = rep(c("No Threshold", "l2fc = 1"), each = 100)
)

ggplot(df, aes(x = fc, y = -log10(pval), color = type)) +
  geom_line(linewidth = 1.5) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray50") +
  annotate("text", x = 2, y = -log10(0.05) + 0.5, label = "p = 0.05") +
  labs(
    title = "Effect of Fold Change Threshold on P-values",
    x = "Log2 Fold Change",
    y = "-log10(p-value)",
    color = "Threshold"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("#3498DB", "#E74C3C"))
```

# Multiple Conditions

When analyzing data with multiple experimental conditions, tradeSeq fits condition-specific smoothers for each lineage:

$$\log(\mu_{gc}) = \sum_{l=1}^{L} \sum_{k=1}^{K} \mathbf{1}_{clk} \cdot s_{lk}(t_{cl}) + O_c$$

Where $K$ is the number of conditions. This allows testing for:

- Condition-specific trajectory effects
- Differential expression between conditions within lineages

# Computational Considerations

## Parallelization

tradeSeq supports parallel computation through the **BiocParallel** framework:

```{r parallel-example, eval=FALSE}
library(BiocParallel)

# Use multiple cores
sce <- fitGAM(counts, sds = crv, 
              parallel = TRUE, 
              BPPARAM = MulticoreParam(workers = 4))
```

## Memory Efficiency

For large datasets, consider:

1. Filtering lowly expressed genes before fitting
2. Using sparse matrix input
3. Fitting models in batches

# References

Van den Berge K, Roux de Bézieux H, Street K, Saelens W, Cannoodt R, Saeys Y, Dudoit S, Clement L. (2020). **Trajectory-based differential expression analysis for single-cell sequencing data.** *Nature Communications*, 11:1201. [DOI: 10.1038/s41467-020-14766-3](https://doi.org/10.1038/s41467-020-14766-3)

# Session Info

```{r session-info}
sessionInfo()
```