--- title: "Algorithm and Statistical Methods" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Algorithm and Statistical Methods} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE ) ``` ## Overview This vignette describes the statistical methods and algorithms implemented in iTALK for intercellular communication analysis. ## Ligand-Receptor Database ### Database Structure iTALK includes a curated database of **2,649 ligand-receptor pairs** categorized into four functional groups: ```{r database} library(iTALK) # Access the internal database db <- iTALK:::database cat("Total L-R pairs:", nrow(db), "\n\n") cat("Categories:\n") print(table(db$Classification)) ``` ### Database Sources The database integrates information from: 1. **Cytokines** - Zhang et al., 2007; Cameron et al., 2000-2013 2. **Checkpoint molecules** - Auslander et al., 2018 3. **Ligand-receptor pairs** - Ramilowski et al., Nature Communications, 2015 ```{r db_structure} # Database columns cat("Database columns:", paste(colnames(db), collapse = ", "), "\n\n") # Example entries head(db[, c("Ligand.ApprovedSymbol", "Receptor.ApprovedSymbol", "Classification")]) ``` ## Statistical Methods for Differential Expression ### Method Comparison iTALK supports seven statistical methods for identifying differentially expressed genes: | Method | Model | Best For | Reference | |--------|-------|----------|-----------| | **Wilcox** | Wilcoxon rank-sum | General purpose | - | | **DESeq2** | Negative binomial | Bulk RNA-seq | Love et al., 2014 | | **edgeR** | Negative binomial + EB | Bulk RNA-seq | McCarthy et al., 2012 | | **MAST** | Two-part hurdle | scRNA-seq | Finak et al., 2015 | | **monocle** | Census-based | Trajectory analysis | Qiu et al., 2017 | | **SCDE** | Bayesian | scRNA-seq | Kharchenko et al., 2014 | | **DEsingle** | Zero-inflated NB | Dropout modeling | Miao et al., 2018 | ### Wilcoxon Rank-Sum Test The default method uses the Wilcoxon rank-sum test (Mann-Whitney U test): $$W = \sum_{i=1}^{n_1} R_i - \frac{n_1(n_1+1)}{2}$$ Where $R_i$ is the rank of observation $i$ in the combined sample. **Log Fold Change Calculation:** For raw counts: $$\log_2 FC = \log_2 \frac{\bar{x}_1 + \epsilon}{\bar{x}_2 + \epsilon}$$ Where $\epsilon = 10^{-9}$ prevents division by zero. For log-transformed data: $$\log_2 FC = \bar{x}_1 - \bar{x}_2$$ ```{r wilcox_demo, eval=FALSE} # Example: Wilcoxon test deg_result <- DEG( data = expression_data, method = "Wilcox", q_cut = 0.05, contrast = c("Treatment", "Control") ) ``` ### DESeq2 Model DESeq2 uses a negative binomial generalized linear model: $$K_{ij} \sim NB(\mu_{ij}, \alpha_i)$$ $$\mu_{ij} = s_j q_{ij}$$ $$\log_2 q_{ij} = \sum_r x_{jr} \beta_{ir}$$ Where: - $K_{ij}$ = read count for gene $i$ in sample $j$ - $s_j$ = size factor for sample $j$ - $\alpha_i$ = dispersion for gene $i$ - $\beta_{ir}$ = log2 fold change ### MAST Hurdle Model MAST uses a two-part hurdle model for scRNA-seq data: **Part 1 (Discrete):** Logistic regression for detection $$\log \frac{P(Z_g = 1)}{P(Z_g = 0)} = X\beta_D$$ **Part 2 (Continuous):** Linear model for expression given detection $$E[Y_g | Z_g = 1] = X\beta_C$$ The combined test statistic follows a $\chi^2$ distribution with 2 degrees of freedom. ## L-R Pair Detection Algorithm ### Workflow ``` ┌─────────────────┐ │ Gene List(s) │ │ (expr/DEG) │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Species Check │──► Auto-conversion if needed └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Database Match │ │ Ligand ∩ Recep │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Annotate Pairs │ │ + Cell Types │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Filter & Output │ └─────────────────┘ ``` ### Matching Algorithm For "mean count" datatype: ```r # Pseudocode ligand_genes <- intersect(input_genes, database$Ligand) receptor_genes <- intersect(input_genes, database$Receptor) pairs <- database[database$Ligand %in% ligand_genes & database$Receptor %in% receptor_genes, ] ``` For "DEG" datatype, additional filtering removes pairs where both ligand and receptor are not differentially expressed (logFC = 0.0001 placeholder). ## Multiple Testing Correction All p-values are adjusted using the Benjamini-Hochberg procedure: $$q_i = \min\left(1, \frac{p_i \cdot m}{i}\right)$$ Where: - $p_i$ = i-th smallest p-value - $m$ = total number of tests - $q_i$ = adjusted p-value (FDR) ```{r bh_demo} # Demonstration of BH correction set.seed(42) p_values <- c(runif(50, 0, 0.01), runif(50, 0.01, 1)) # Mix of significant and non-significant q_values <- p.adjust(p_values, method = "BH") # Visualize par(mfrow = c(1, 2)) hist(p_values, breaks = 20, main = "Raw P-values", xlab = "P-value", col = "steelblue") hist(q_values, breaks = 20, main = "BH-adjusted Q-values", xlab = "Q-value", col = "coral") ``` ## Communication Strength Metrics ### Expression-based Scoring For mean count analysis, communication strength is proportional to: $$S_{LR} = \bar{E}_L \times \bar{E}_R$$ Where $\bar{E}_L$ and $\bar{E}_R$ are mean expression levels of ligand and receptor. ### Fold Change-based Scoring For DEG analysis: $$S_{LR} = |FC_L| \times |FC_R| \times \text{sign}(FC_L \times FC_R)$$ This captures both magnitude and directionality of changes. ## Network Construction Cell-cell communication networks are constructed using: 1. **Nodes** = Cell types 2. **Edges** = Number of L-R pairs between cell types 3. **Edge weights** = Communication frequency or strength ```{r network_demo} library(igraph) # Example network construction edges <- data.frame( from = c("T_cell", "T_cell", "Macrophage", "Dendritic"), to = c("Macrophage", "B_cell", "T_cell", "T_cell"), weight = c(15, 8, 12, 6) ) g <- graph_from_data_frame(edges, directed = TRUE) E(g)$width <- E(g)$weight / 3 plot(g, vertex.size = 30, vertex.color = "lightblue", edge.arrow.size = 0.5, main = "Cell Communication Network") ``` ## Computational Complexity | Operation | Time Complexity | Space Complexity | |-----------|-----------------|------------------| | rawParse | O(n × g) | O(c × g) | | FindLR | O(g × d) | O(p) | | DEG (Wilcox) | O(n × g) | O(g) | | LRPlot | O(p) | O(p) | | NetView | O(c²) | O(c²) | Where: - n = number of cells - g = number of genes - c = number of cell types - d = database size - p = number of pairs ## References 1. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. *Genome Biology*. 2014;15:550. 2. Finak G, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. *Genome Biology*. 2015;16:278. 3. Wang Y, et al. iTALK: an R Package to Characterize and Illustrate Intercellular Communication. *bioRxiv*. 2019;507871. ## Session Info ```{r session} sessionInfo() ```