Algorithm and Statistical Methods

Overview

This vignette describes the statistical methods and algorithms implemented in iTALK for intercellular communication analysis.

Ligand-Receptor Database

Database Structure

iTALK includes a curated database of 2,649 ligand-receptor pairs categorized into four functional groups:

library(iTALK)

# Access the internal database
db <- iTALK:::database

cat("Total L-R pairs:", nrow(db), "\n\n")
#> Total L-R pairs: 2649
cat("Categories:\n")
#> Categories:
print(table(db$Classification))
#> 
#>    checkpoint      cytokine growth factor         other 
#>            32           327           227          2063

Database Sources

The database integrates information from:

  1. Cytokines - Zhang et al., 2007; Cameron et al., 2000-2013
  2. Checkpoint molecules - Auslander et al., 2018
  3. Ligand-receptor pairs - Ramilowski et al., Nature Communications, 2015
# Database columns
cat("Database columns:", paste(colnames(db), collapse = ", "), "\n\n")
#> Database columns: Pair.Name, Ligand.ApprovedSymbol, Ligand.Name, Receptor.ApprovedSymbol, Receptor.Name, Classification

# Example entries
head(db[, c("Ligand.ApprovedSymbol", "Receptor.ApprovedSymbol", "Classification")])
#>   Ligand.ApprovedSymbol Receptor.ApprovedSymbol Classification
#> 1                   A2M                    LRP1          other
#> 2                 AANAT                  MTNR1A          other
#> 3                 AANAT                  MTNR1B          other
#> 4                   ACE                   AGTR2          other
#> 5                   ACE                  BDKRB2          other
#> 6                ADAM10                     AXL          other

Statistical Methods for Differential Expression

Method Comparison

iTALK supports seven statistical methods for identifying differentially expressed genes:

Method Model Best For Reference
Wilcox Wilcoxon rank-sum General purpose -
DESeq2 Negative binomial Bulk RNA-seq Love et al., 2014
edgeR Negative binomial + EB Bulk RNA-seq McCarthy et al., 2012
MAST Two-part hurdle scRNA-seq Finak et al., 2015
monocle Census-based Trajectory analysis Qiu et al., 2017
SCDE Bayesian scRNA-seq Kharchenko et al., 2014
DEsingle Zero-inflated NB Dropout modeling Miao et al., 2018

Wilcoxon Rank-Sum Test

The default method uses the Wilcoxon rank-sum test (Mann-Whitney U test):

\[W = \sum_{i=1}^{n_1} R_i - \frac{n_1(n_1+1)}{2}\]

Where \(R_i\) is the rank of observation \(i\) in the combined sample.

Log Fold Change Calculation:

For raw counts: \[\log_2 FC = \log_2 \frac{\bar{x}_1 + \epsilon}{\bar{x}_2 + \epsilon}\]

Where \(\epsilon = 10^{-9}\) prevents division by zero.

For log-transformed data: \[\log_2 FC = \bar{x}_1 - \bar{x}_2\]

# Example: Wilcoxon test
deg_result <- DEG(
  data = expression_data,
  method = "Wilcox",
  q_cut = 0.05,
  contrast = c("Treatment", "Control")
)

DESeq2 Model

DESeq2 uses a negative binomial generalized linear model:

\[K_{ij} \sim NB(\mu_{ij}, \alpha_i)\]

\[\mu_{ij} = s_j q_{ij}\]

\[\log_2 q_{ij} = \sum_r x_{jr} \beta_{ir}\]

Where: - \(K_{ij}\) = read count for gene \(i\) in sample \(j\) - \(s_j\) = size factor for sample \(j\) - \(\alpha_i\) = dispersion for gene \(i\) - \(\beta_{ir}\) = log2 fold change

MAST Hurdle Model

MAST uses a two-part hurdle model for scRNA-seq data:

Part 1 (Discrete): Logistic regression for detection \[\log \frac{P(Z_g = 1)}{P(Z_g = 0)} = X\beta_D\]

Part 2 (Continuous): Linear model for expression given detection \[E[Y_g | Z_g = 1] = X\beta_C\]

The combined test statistic follows a \(\chi^2\) distribution with 2 degrees of freedom.

L-R Pair Detection Algorithm

Workflow

┌─────────────────┐
│  Gene List(s)   │
│  (expr/DEG)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Species Check   │──► Auto-conversion if needed
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Database Match  │
│ Ligand ∩ Recep  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Annotate Pairs  │
│ + Cell Types    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Filter & Output │
└─────────────────┘

Matching Algorithm

For “mean count” datatype:

# Pseudocode
ligand_genes <- intersect(input_genes, database$Ligand)
receptor_genes <- intersect(input_genes, database$Receptor)
pairs <- database[database$Ligand %in% ligand_genes & 
                  database$Receptor %in% receptor_genes, ]

For “DEG” datatype, additional filtering removes pairs where both ligand and receptor are not differentially expressed (logFC = 0.0001 placeholder).

Multiple Testing Correction

All p-values are adjusted using the Benjamini-Hochberg procedure:

\[q_i = \min\left(1, \frac{p_i \cdot m}{i}\right)\]

Where: - \(p_i\) = i-th smallest p-value - \(m\) = total number of tests - \(q_i\) = adjusted p-value (FDR)

# Demonstration of BH correction
set.seed(42)
p_values <- c(runif(50, 0, 0.01), runif(50, 0.01, 1))  # Mix of significant and non-significant

q_values <- p.adjust(p_values, method = "BH")

# Visualize
par(mfrow = c(1, 2))
hist(p_values, breaks = 20, main = "Raw P-values", xlab = "P-value", col = "steelblue")
hist(q_values, breaks = 20, main = "BH-adjusted Q-values", xlab = "Q-value", col = "coral")

Communication Strength Metrics

Expression-based Scoring

For mean count analysis, communication strength is proportional to:

\[S_{LR} = \bar{E}_L \times \bar{E}_R\]

Where \(\bar{E}_L\) and \(\bar{E}_R\) are mean expression levels of ligand and receptor.

Fold Change-based Scoring

For DEG analysis:

\[S_{LR} = |FC_L| \times |FC_R| \times \text{sign}(FC_L \times FC_R)\]

This captures both magnitude and directionality of changes.

Network Construction

Cell-cell communication networks are constructed using:

  1. Nodes = Cell types
  2. Edges = Number of L-R pairs between cell types
  3. Edge weights = Communication frequency or strength
library(igraph)

# Example network construction
edges <- data.frame(
  from = c("T_cell", "T_cell", "Macrophage", "Dendritic"),
  to = c("Macrophage", "B_cell", "T_cell", "T_cell"),
  weight = c(15, 8, 12, 6)
)

g <- graph_from_data_frame(edges, directed = TRUE)
E(g)$width <- E(g)$weight / 3

plot(g, 
     vertex.size = 30,
     vertex.color = "lightblue",
     edge.arrow.size = 0.5,
     main = "Cell Communication Network")

Computational Complexity

Operation Time Complexity Space Complexity
rawParse O(n × g) O(c × g)
FindLR O(g × d) O(p)
DEG (Wilcox) O(n × g) O(g)
LRPlot O(p) O(p)
NetView O(c²) O(c²)

Where: - n = number of cells - g = number of genes - c = number of cell types - d = database size - p = number of pairs

References

  1. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15:550.

  2. Finak G, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology. 2015;16:278.

  3. Wang Y, et al. iTALK: an R Package to Characterize and Illustrate Intercellular Communication. bioRxiv. 2019;507871.

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] igraph_2.3.2   iTALK_0.1.1    rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         generics_0.1.4      tidyr_1.3.2        
#>  [4] shape_1.4.6.1       stringi_1.8.7       hms_1.1.4          
#>  [7] digest_0.6.39       magrittr_2.0.5      evaluate_1.0.5     
#> [10] grid_4.6.0          RColorBrewer_1.1-3  circlize_0.4.18    
#> [13] fastmap_1.2.0       jsonlite_2.0.0      progress_1.2.3     
#> [16] GlobalOptions_0.1.4 purrr_1.2.2         scales_1.4.0       
#> [19] pbapply_1.7-4       randomcoloR_1.1.0.1 jquerylib_0.1.4    
#> [22] cli_3.6.6           crayon_1.5.3        rlang_1.2.0        
#> [25] cachem_1.1.0        yaml_2.3.12         otel_0.2.0         
#> [28] Rtsne_0.17          parallel_4.6.0      tools_4.6.0        
#> [31] dplyr_1.2.1         colorspace_2.1-2    ggplot2_4.0.3      
#> [34] curl_7.1.0          buildtools_1.0.0    vctrs_0.7.3        
#> [37] R6_2.6.1            lifecycle_1.0.5     stringr_1.6.0      
#> [40] V8_8.2.0            cluster_2.1.8.2     pkgconfig_2.0.3    
#> [43] pillar_1.11.1       bslib_0.11.0        gtable_0.3.6       
#> [46] glue_1.8.1          Rcpp_1.1.1-1.1      xfun_0.59          
#> [49] tibble_3.3.1        tidyselect_1.2.1    sys_3.4.3          
#> [52] knitr_1.51          farver_2.1.2        htmltools_0.5.9    
#> [55] maketools_1.3.2     compiler_4.6.0      prettyunits_1.2.0  
#> [58] S7_0.2.2