Statistical Framework and Algorithm

Overview

tradeSeq implements a flexible statistical framework based on Generalized Additive Models (GAMs) for trajectory-based differential expression analysis. This vignette provides a detailed explanation of the statistical methodology underlying the package.

The Statistical Model

Negative Binomial GAM

For each gene \(g\), tradeSeq fits a negative binomial generalized additive model of the form:

\[\log(\mu_{gc}) = \eta_{gc} = \sum_{l=1}^{L} \mathbf{1}_{cl} \cdot s_l(t_{cl}) + \mathbf{U}_c \boldsymbol{\beta} + O_c\]

Where:

  • \(\mu_{gc}\): Expected count for gene \(g\) in cell \(c\)
  • \(L\): Number of lineages in the trajectory
  • \(\mathbf{1}_{cl}\): Indicator for cell \(c\) belonging to lineage \(l\)
  • \(s_l(t_{cl})\): Smooth function of pseudotime for lineage \(l\)
  • \(t_{cl}\): Pseudotime of cell \(c\) in lineage \(l\)
  • \(\mathbf{U}_c\): Design matrix for additional covariates
  • \(O_c\): Offset term (typically log library size)

Cubic Regression Splines

The smooth functions \(s_l(t)\) are modeled using cubic regression splines with a shared set of knots across lineages. This approach:

  • Provides flexibility to capture non-linear expression patterns
  • Ensures smoothness through penalization
  • Allows for identifiability by sharing basis functions

Knot Selection

The number and placement of knots determine the flexibility of the fitted curves. tradeSeq provides:

  • Default: 6 knots, placed at quantiles of the pseudotime distribution
  • evaluateK() function: Helps select optimal number of knots using AIC

Statistical Tests

tradeSeq implements several statistical tests based on Wald statistics computed from the fitted GAM coefficients.

Association Test

Tests whether gene expression changes significantly along pseudotime within each lineage.

Null Hypothesis: Expression is constant along pseudotime \[H_0: s_l(t) = c \text{ for all } t\]

Test Statistic: Wald statistic comparing consecutive knot coefficients

Differential End Test

Tests whether genes are differentially expressed at the endpoints of different lineages.

Null Hypothesis: Expression at endpoints is equal across lineages \[H_0: s_1(t_{max,1}) = s_2(t_{max,2})\]

Pattern Test

Tests whether genes have different expression patterns (shapes) between lineages.

Null Hypothesis: Expression patterns are identical across lineages \[H_0: s_1(t) = s_2(t) \text{ for all } t\]

Early DE Test

Tests for differential expression in a specific region of the trajectory (e.g., near the branching point).

Wald Test Framework

All tests in tradeSeq are based on the Wald test framework:

\[W = (\mathbf{L}\hat{\boldsymbol{\beta}})^T [\mathbf{L}\hat{\mathbf{V}}\mathbf{L}^T]^{-1} (\mathbf{L}\hat{\boldsymbol{\beta}})\]

Where:

  • \(\hat{\boldsymbol{\beta}}\): Estimated coefficients from the GAM
  • \(\hat{\mathbf{V}}\): Estimated variance-covariance matrix
  • \(\mathbf{L}\): Contrast matrix defining the hypothesis

Under \(H_0\), \(W \sim \chi^2_r\) where \(r\) is the rank of \(\mathbf{L}\).

Fold Change Thresholds

tradeSeq supports testing against a log2 fold change threshold:

\[H_0: |\mathbf{L}\boldsymbol{\beta}| \leq \log(2^{\text{l2fc}})\]

This is useful when biological significance requires not just statistical significance but also a meaningful effect size.

Multiple Conditions

When analyzing data with multiple experimental conditions, tradeSeq fits condition-specific smoothers for each lineage:

\[\log(\mu_{gc}) = \sum_{l=1}^{L} \sum_{k=1}^{K} \mathbf{1}_{clk} \cdot s_{lk}(t_{cl}) + O_c\]

Where \(K\) is the number of conditions. This allows testing for:

  • Condition-specific trajectory effects
  • Differential expression between conditions within lineages

Computational Considerations

Parallelization

tradeSeq supports parallel computation through the BiocParallel framework:

library(BiocParallel)

# Use multiple cores
sce <- fitGAM(counts, sds = crv, 
              parallel = TRUE, 
              BPPARAM = MulticoreParam(workers = 4))

Memory Efficiency

For large datasets, consider:

  1. Filtering lowly expressed genes before fitting
  2. Using sparse matrix input
  3. Fitting models in batches

References

Van den Berge K, Roux de Bézieux H, Street K, Saelens W, Cannoodt R, Saeys Y, Dudoit S, Clement L. (2020). Trajectory-based differential expression analysis for single-cell sequencing data. Nature Communications, 11:1201. DOI: 10.1038/s41467-020-14766-3

Session Info

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## Random number generation:
##  RNG:     Mersenne-Twister 
##  Normal:  Inversion 
##  Sample:  Rounding 
##  
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggplot2_4.0.3               mgcv_1.9-4                 
##  [3] nlme_3.1-169                slingshot_2.21.0           
##  [5] TrajectoryUtils_1.21.0      princurve_2.1.6            
##  [7] SingleCellExperiment_1.35.1 SummarizedExperiment_1.43.0
##  [9] Biobase_2.73.1              GenomicRanges_1.65.0       
## [11] Seqinfo_1.3.0               IRanges_2.47.1             
## [13] S4Vectors_0.51.2            BiocGenerics_0.59.3        
## [15] generics_0.1.4              MatrixGenerics_1.25.0      
## [17] matrixStats_1.5.0           RColorBrewer_1.1-3         
## [19] tradeSeq_1.13.13            knitr_1.51                 
## [21] rmarkdown_2.31             
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        xfun_0.57           bslib_0.11.0       
##  [4] lattice_0.22-9      vctrs_0.7.3         tools_4.6.0        
##  [7] parallel_4.6.0      tibble_3.3.1        pkgconfig_2.0.3    
## [10] Matrix_1.7-5        S7_0.2.2            lifecycle_1.0.5    
## [13] compiler_4.6.0      farver_2.1.2        statmod_1.5.2      
## [16] codetools_0.2-20    htmltools_0.5.9     sys_3.4.3          
## [19] buildtools_1.0.0    sass_0.4.10         yaml_2.3.12        
## [22] pillar_1.11.1       jquerylib_0.1.4     BiocParallel_1.47.0
## [25] limma_3.69.1        DelayedArray_0.39.2 cachem_1.1.0       
## [28] viridis_0.6.5       abind_1.4-8         locfit_1.5-9.12    
## [31] tidyselect_1.2.1    digest_0.6.39       dplyr_1.2.1        
## [34] labeling_0.4.3      maketools_1.3.2     splines_4.6.0      
## [37] fastmap_1.2.0       grid_4.6.0          cli_3.6.6          
## [40] SparseArray_1.13.2  magrittr_2.0.5      S4Arrays_1.13.0    
## [43] withr_3.0.2         edgeR_4.11.1        scales_1.4.0       
## [46] XVector_0.53.0      igraph_2.3.1        gridExtra_2.3      
## [49] pbapply_1.7-4       evaluate_1.0.5      viridisLite_0.4.3  
## [52] rlang_1.2.0         Rcpp_1.1.1-1.1      glue_1.8.1         
## [55] jsonlite_2.0.0      R6_2.6.1