--- title: "Statistical Framework and Algorithm" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_depth: 3 toc_float: true theme: flatly highlight: tango mathjax: default vignette: > %\VignetteIndexEntry{Statistical Framework and Algorithm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 6, fig.align = "center" ) ``` # Overview **tradeSeq** implements a flexible statistical framework based on **Generalized Additive Models (GAMs)** for trajectory-based differential expression analysis. This vignette provides a detailed explanation of the statistical methodology underlying the package. # The Statistical Model ## Negative Binomial GAM For each gene $g$, tradeSeq fits a negative binomial generalized additive model of the form: $$\log(\mu_{gc}) = \eta_{gc} = \sum_{l=1}^{L} \mathbf{1}_{cl} \cdot s_l(t_{cl}) + \mathbf{U}_c \boldsymbol{\beta} + O_c$$ Where: - $\mu_{gc}$: Expected count for gene $g$ in cell $c$ - $L$: Number of lineages in the trajectory - $\mathbf{1}_{cl}$: Indicator for cell $c$ belonging to lineage $l$ - $s_l(t_{cl})$: Smooth function of pseudotime for lineage $l$ - $t_{cl}$: Pseudotime of cell $c$ in lineage $l$ - $\mathbf{U}_c$: Design matrix for additional covariates - $O_c$: Offset term (typically log library size) ## Cubic Regression Splines The smooth functions $s_l(t)$ are modeled using **cubic regression splines** with a shared set of knots across lineages. This approach: - Provides flexibility to capture non-linear expression patterns - Ensures smoothness through penalization - Allows for identifiability by sharing basis functions ```{r spline-basis, echo=FALSE, fig.height=4} library(ggplot2) library(mgcv) # Generate example spline basis x <- seq(0, 1, length.out = 100) nknots <- 6 knots <- seq(0, 1, length.out = nknots) # Create design matrix sm <- smoothCon(s(x, bs = "cr", k = nknots), data = data.frame(x = x))[[1]] X <- sm$X # Plot basis functions df <- data.frame( x = rep(x, ncol(X)), y = c(X), basis = factor(rep(1:ncol(X), each = length(x))) ) ggplot(df, aes(x = x, y = y, color = basis)) + geom_line(linewidth = 1) + labs( title = "Cubic Regression Spline Basis Functions", x = "Pseudotime", y = "Basis Function Value", color = "Basis" ) + theme_minimal() + theme(legend.position = "right") ``` ## Knot Selection The number and placement of knots determine the flexibility of the fitted curves. tradeSeq provides: - **Default**: 6 knots, placed at quantiles of the pseudotime distribution - **`evaluateK()` function**: Helps select optimal number of knots using AIC ```{r eval-k-concept, echo=FALSE, fig.height=4} # Conceptual plot of knot evaluation k_values <- 3:10 aic_values <- c(1200, 1050, 950, 920, 910, 905, 908, 915) df_k <- data.frame(k = k_values, AIC = aic_values) ggplot(df_k, aes(x = k, y = AIC)) + geom_line(linewidth = 1, color = "#2C3E50") + geom_point(size = 3, color = "#E74C3C") + geom_vline(xintercept = 7, linetype = "dashed", color = "#27AE60") + annotate("text", x = 7.3, y = 1100, label = "Optimal k", color = "#27AE60") + labs( title = "Knot Selection via AIC", x = "Number of Knots (k)", y = "Akaike Information Criterion (AIC)" ) + theme_minimal() ``` # Statistical Tests tradeSeq implements several statistical tests based on **Wald statistics** computed from the fitted GAM coefficients. ## Association Test Tests whether gene expression changes significantly along pseudotime within each lineage. **Null Hypothesis**: Expression is constant along pseudotime $$H_0: s_l(t) = c \text{ for all } t$$ **Test Statistic**: Wald statistic comparing consecutive knot coefficients ```{r association-diagram, echo=FALSE, fig.height=3} library(ggplot2) # Simulated expression curves t <- seq(0, 1, length.out = 100) expr_assoc <- 2 + 3 * sin(2 * pi * t) + rnorm(100, 0, 0.3) expr_null <- rep(2, 100) + rnorm(100, 0, 0.3) df <- data.frame( pseudotime = rep(t, 2), expression = c(expr_assoc, expr_null), type = rep(c("Associated (p < 0.001)", "Not Associated (p = 0.8)"), each = 100) ) ggplot(df, aes(x = pseudotime, y = expression, color = type)) + geom_point(alpha = 0.5, size = 1) + geom_smooth(method = "loess", se = FALSE, linewidth = 1.5) + facet_wrap(~type) + labs( title = "Association Test: Expression vs Pseudotime", x = "Pseudotime", y = "Expression" ) + theme_minimal() + theme(legend.position = "none") + scale_color_manual(values = c("#E74C3C", "#3498DB")) ``` ## Differential End Test Tests whether genes are differentially expressed at the endpoints of different lineages. **Null Hypothesis**: Expression at endpoints is equal across lineages $$H_0: s_1(t_{max,1}) = s_2(t_{max,2})$$ ```{r diffend-diagram, echo=FALSE, fig.height=4} t1 <- seq(0, 1, length.out = 50) t2 <- seq(0, 0.8, length.out = 50) df <- data.frame( pseudotime = c(t1, t2), expression = c(2 + 2*t1 + rnorm(50, 0, 0.2), 2 + 0.5*t2 + rnorm(50, 0, 0.2)), lineage = rep(c("Lineage 1", "Lineage 2"), c(50, 50)) ) ggplot(df, aes(x = pseudotime, y = expression, color = lineage)) + geom_point(alpha = 0.6) + geom_smooth(method = "loess", se = FALSE, linewidth = 1.5) + geom_point(data = data.frame( pseudotime = c(1, 0.8), expression = c(4, 2.4), lineage = c("Lineage 1", "Lineage 2") ), size = 5, shape = 18) + annotate("segment", x = 0.85, xend = 0.95, y = 4, yend = 4, arrow = arrow(length = unit(0.2, "cm")), color = "#E74C3C") + annotate("segment", x = 0.85, xend = 0.95, y = 2.4, yend = 2.4, arrow = arrow(length = unit(0.2, "cm")), color = "#E74C3C") + annotate("text", x = 0.9, y = 3.2, label = "Compare\nEndpoints", color = "#E74C3C", size = 3) + labs( title = "Differential End Test", x = "Pseudotime", y = "Expression", color = "Lineage" ) + theme_minimal() + scale_color_manual(values = c("#3498DB", "#2ECC71")) ``` ## Pattern Test Tests whether genes have different expression patterns (shapes) between lineages. **Null Hypothesis**: Expression patterns are identical across lineages $$H_0: s_1(t) = s_2(t) \text{ for all } t$$ ```{r pattern-diagram, echo=FALSE, fig.height=4} t <- seq(0, 1, length.out = 100) df <- data.frame( pseudotime = rep(t, 2), expression = c(2 + 2*sin(2*pi*t), 2 + 2*cos(2*pi*t)), lineage = rep(c("Lineage 1", "Lineage 2"), each = 100) ) ggplot(df, aes(x = pseudotime, y = expression, color = lineage)) + geom_line(linewidth = 1.5) + labs( title = "Pattern Test: Different Expression Patterns", x = "Pseudotime", y = "Expression", color = "Lineage" ) + theme_minimal() + scale_color_manual(values = c("#3498DB", "#E74C3C")) ``` ## Early DE Test Tests for differential expression in a specific region of the trajectory (e.g., near the branching point). ```{r earlyde-diagram, echo=FALSE, fig.height=4} t <- seq(0, 1, length.out = 100) df <- data.frame( pseudotime = rep(t, 2), expression = c(2 + t + 2*t^2, 2 + t - t^2), lineage = rep(c("Lineage 1", "Lineage 2"), each = 100) ) ggplot(df, aes(x = pseudotime, y = expression, color = lineage)) + geom_line(linewidth = 1.5) + annotate("rect", xmin = 0.2, xmax = 0.5, ymin = -Inf, ymax = Inf, alpha = 0.2, fill = "#F39C12") + annotate("text", x = 0.35, y = 4.5, label = "Test Region", color = "#F39C12", fontface = "bold") + labs( title = "Early DE Test: Region-Specific Testing", x = "Pseudotime", y = "Expression", color = "Lineage" ) + theme_minimal() + scale_color_manual(values = c("#3498DB", "#E74C3C")) ``` # Wald Test Framework All tests in tradeSeq are based on the **Wald test** framework: $$W = (\mathbf{L}\hat{\boldsymbol{\beta}})^T [\mathbf{L}\hat{\mathbf{V}}\mathbf{L}^T]^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})$$ Where: - $\hat{\boldsymbol{\beta}}$: Estimated coefficients from the GAM - $\hat{\mathbf{V}}$: Estimated variance-covariance matrix - $\mathbf{L}$: Contrast matrix defining the hypothesis Under $H_0$, $W \sim \chi^2_r$ where $r$ is the rank of $\mathbf{L}$. # Fold Change Thresholds tradeSeq supports testing against a log2 fold change threshold: $$H_0: |\mathbf{L}\boldsymbol{\beta}| \leq \log(2^{\text{l2fc}})$$ This is useful when biological significance requires not just statistical significance but also a meaningful effect size. ```{r fc-threshold, echo=FALSE, fig.height=4} fc <- seq(-3, 3, length.out = 100) pval_no_thresh <- 2 * pnorm(-abs(fc)) pval_thresh <- 2 * pnorm(-pmax(0, abs(fc) - 1)) df <- data.frame( fc = rep(fc, 2), pval = c(pval_no_thresh, pval_thresh), type = rep(c("No Threshold", "l2fc = 1"), each = 100) ) ggplot(df, aes(x = fc, y = -log10(pval), color = type)) + geom_line(linewidth = 1.5) + geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray50") + annotate("text", x = 2, y = -log10(0.05) + 0.5, label = "p = 0.05") + labs( title = "Effect of Fold Change Threshold on P-values", x = "Log2 Fold Change", y = "-log10(p-value)", color = "Threshold" ) + theme_minimal() + scale_color_manual(values = c("#3498DB", "#E74C3C")) ``` # Multiple Conditions When analyzing data with multiple experimental conditions, tradeSeq fits condition-specific smoothers for each lineage: $$\log(\mu_{gc}) = \sum_{l=1}^{L} \sum_{k=1}^{K} \mathbf{1}_{clk} \cdot s_{lk}(t_{cl}) + O_c$$ Where $K$ is the number of conditions. This allows testing for: - Condition-specific trajectory effects - Differential expression between conditions within lineages # Computational Considerations ## Parallelization tradeSeq supports parallel computation through the **BiocParallel** framework: ```{r parallel-example, eval=FALSE} library(BiocParallel) # Use multiple cores sce <- fitGAM(counts, sds = crv, parallel = TRUE, BPPARAM = MulticoreParam(workers = 4)) ``` ## Memory Efficiency For large datasets, consider: 1. Filtering lowly expressed genes before fitting 2. Using sparse matrix input 3. Fitting models in batches # References Van den Berge K, Roux de Bézieux H, Street K, Saelens W, Cannoodt R, Saeys Y, Dudoit S, Clement L. (2020). **Trajectory-based differential expression analysis for single-cell sequencing data.** *Nature Communications*, 11:1201. [DOI: 10.1038/s41467-020-14766-3](https://doi.org/10.1038/s41467-020-14766-3) # Session Info ```{r session-info} sessionInfo() ```