---
title: "Algorithm and Statistical Methods"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Algorithm and Statistical Methods}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  warning = FALSE,
  message = FALSE
)
```

## Overview

This vignette describes the statistical methods and algorithms implemented in iTALK for intercellular communication analysis.

## Ligand-Receptor Database

### Database Structure

iTALK includes a curated database of **2,649 ligand-receptor pairs** categorized into four functional groups:

```{r database}
library(iTALK)

# Access the internal database
db <- iTALK:::database

cat("Total L-R pairs:", nrow(db), "\n\n")
cat("Categories:\n")
print(table(db$Classification))
```

### Database Sources

The database integrates information from:

1. **Cytokines** - Zhang et al., 2007; Cameron et al., 2000-2013
2. **Checkpoint molecules** - Auslander et al., 2018
3. **Ligand-receptor pairs** - Ramilowski et al., Nature Communications, 2015

```{r db_structure}
# Database columns
cat("Database columns:", paste(colnames(db), collapse = ", "), "\n\n")

# Example entries
head(db[, c("Ligand.ApprovedSymbol", "Receptor.ApprovedSymbol", "Classification")])
```

## Statistical Methods for Differential Expression

### Method Comparison

iTALK supports seven statistical methods for identifying differentially expressed genes:

| Method | Model | Best For | Reference |
|--------|-------|----------|-----------|
| **Wilcox** | Wilcoxon rank-sum | General purpose | - |
| **DESeq2** | Negative binomial | Bulk RNA-seq | Love et al., 2014 |
| **edgeR** | Negative binomial + EB | Bulk RNA-seq | McCarthy et al., 2012 |
| **MAST** | Two-part hurdle | scRNA-seq | Finak et al., 2015 |
| **monocle** | Census-based | Trajectory analysis | Qiu et al., 2017 |
| **SCDE** | Bayesian | scRNA-seq | Kharchenko et al., 2014 |
| **DEsingle** | Zero-inflated NB | Dropout modeling | Miao et al., 2018 |

### Wilcoxon Rank-Sum Test

The default method uses the Wilcoxon rank-sum test (Mann-Whitney U test):

$$W = \sum_{i=1}^{n_1} R_i - \frac{n_1(n_1+1)}{2}$$

Where $R_i$ is the rank of observation $i$ in the combined sample.

**Log Fold Change Calculation:**

For raw counts:
$$\log_2 FC = \log_2 \frac{\bar{x}_1 + \epsilon}{\bar{x}_2 + \epsilon}$$

Where $\epsilon = 10^{-9}$ prevents division by zero.

For log-transformed data:
$$\log_2 FC = \bar{x}_1 - \bar{x}_2$$

```{r wilcox_demo, eval=FALSE}
# Example: Wilcoxon test
deg_result <- DEG(
  data = expression_data,
  method = "Wilcox",
  q_cut = 0.05,
  contrast = c("Treatment", "Control")
)
```

### DESeq2 Model

DESeq2 uses a negative binomial generalized linear model:

$$K_{ij} \sim NB(\mu_{ij}, \alpha_i)$$

$$\mu_{ij} = s_j q_{ij}$$

$$\log_2 q_{ij} = \sum_r x_{jr} \beta_{ir}$$

Where:
- $K_{ij}$ = read count for gene $i$ in sample $j$
- $s_j$ = size factor for sample $j$
- $\alpha_i$ = dispersion for gene $i$
- $\beta_{ir}$ = log2 fold change

### MAST Hurdle Model

MAST uses a two-part hurdle model for scRNA-seq data:

**Part 1 (Discrete):** Logistic regression for detection
$$\log \frac{P(Z_g = 1)}{P(Z_g = 0)} = X\beta_D$$

**Part 2 (Continuous):** Linear model for expression given detection
$$E[Y_g | Z_g = 1] = X\beta_C$$

The combined test statistic follows a $\chi^2$ distribution with 2 degrees of freedom.

## L-R Pair Detection Algorithm

### Workflow

```
┌─────────────────┐
│  Gene List(s)   │
│  (expr/DEG)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Species Check   │──► Auto-conversion if needed
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Database Match  │
│ Ligand ∩ Recep  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Annotate Pairs  │
│ + Cell Types    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Filter & Output │
└─────────────────┘
```

### Matching Algorithm

For "mean count" datatype:

```r
# Pseudocode
ligand_genes <- intersect(input_genes, database$Ligand)
receptor_genes <- intersect(input_genes, database$Receptor)
pairs <- database[database$Ligand %in% ligand_genes & 
                  database$Receptor %in% receptor_genes, ]
```

For "DEG" datatype, additional filtering removes pairs where both ligand and receptor are not differentially expressed (logFC = 0.0001 placeholder).

## Multiple Testing Correction

All p-values are adjusted using the Benjamini-Hochberg procedure:

$$q_i = \min\left(1, \frac{p_i \cdot m}{i}\right)$$

Where:
- $p_i$ = i-th smallest p-value
- $m$ = total number of tests
- $q_i$ = adjusted p-value (FDR)

```{r bh_demo}
# Demonstration of BH correction
set.seed(42)
p_values <- c(runif(50, 0, 0.01), runif(50, 0.01, 1))  # Mix of significant and non-significant

q_values <- p.adjust(p_values, method = "BH")

# Visualize
par(mfrow = c(1, 2))
hist(p_values, breaks = 20, main = "Raw P-values", xlab = "P-value", col = "steelblue")
hist(q_values, breaks = 20, main = "BH-adjusted Q-values", xlab = "Q-value", col = "coral")
```

## Communication Strength Metrics

### Expression-based Scoring

For mean count analysis, communication strength is proportional to:

$$S_{LR} = \bar{E}_L \times \bar{E}_R$$

Where $\bar{E}_L$ and $\bar{E}_R$ are mean expression levels of ligand and receptor.

### Fold Change-based Scoring

For DEG analysis:

$$S_{LR} = |FC_L| \times |FC_R| \times \text{sign}(FC_L \times FC_R)$$

This captures both magnitude and directionality of changes.

## Network Construction

Cell-cell communication networks are constructed using:

1. **Nodes** = Cell types
2. **Edges** = Number of L-R pairs between cell types
3. **Edge weights** = Communication frequency or strength

```{r network_demo}
library(igraph)

# Example network construction
edges <- data.frame(
  from = c("T_cell", "T_cell", "Macrophage", "Dendritic"),
  to = c("Macrophage", "B_cell", "T_cell", "T_cell"),
  weight = c(15, 8, 12, 6)
)

g <- graph_from_data_frame(edges, directed = TRUE)
E(g)$width <- E(g)$weight / 3

plot(g, 
     vertex.size = 30,
     vertex.color = "lightblue",
     edge.arrow.size = 0.5,
     main = "Cell Communication Network")
```

## Computational Complexity

| Operation | Time Complexity | Space Complexity |
|-----------|-----------------|------------------|
| rawParse | O(n × g) | O(c × g) |
| FindLR | O(g × d) | O(p) |
| DEG (Wilcox) | O(n × g) | O(g) |
| LRPlot | O(p) | O(p) |
| NetView | O(c²) | O(c²) |

Where:
- n = number of cells
- g = number of genes
- c = number of cell types
- d = database size
- p = number of pairs

## References

1. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. *Genome Biology*. 2014;15:550.

2. Finak G, et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. *Genome Biology*. 2015;16:278.

3. Wang Y, et al. iTALK: an R Package to Characterize and Illustrate Intercellular Communication. *bioRxiv*. 2019;507871.

## Session Info

```{r session}
sessionInfo()
```