---
title: "Algorithm & Methodology"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 10
    fig_height: 8
vignette: >
  %\VignetteIndexEntry{Algorithm & Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 8,
  warning = FALSE,
  message = FALSE,
  echo = FALSE
)
library(ggplot2)
library(dplyr)
```

## Overview

MultiNicheNet is a computational framework designed for differential cell-cell communication (CCC) analysis in multi-sample, multi-condition single-cell RNA sequencing experiments. This document provides a comprehensive overview of the algorithmic foundations and methodological principles underlying MultiNicheNet.

## Theoretical Framework

### The Cell-Cell Communication Inference Problem

Cell-cell communication (CCC) involves the transmission of biological signals between cells through ligand-receptor (L-R) interactions. In single-cell transcriptomics, we infer potential CCC events by:

1. Identifying expressed ligands in "sender" cell types
2. Identifying expressed receptors in "receiver" cell types
3. Predicting downstream signaling effects

```{r workflow-diagram, fig.height=6, fig.width=10}
# Create workflow diagram
library(ggplot2)

workflow_data <- data.frame(
  step = c("Input\nscRNA-seq", "Pseudobulk\nAggregation", "Differential\nExpression", 
           "Ligand\nActivity", "Multi-Criteria\nPrioritization", "Output\nCCC Events"),
  x = 1:6,
  y = rep(1, 6)
)

ggplot(workflow_data, aes(x = x, y = y)) +
  geom_point(size = 20, color = "#0d6efd", alpha = 0.8) +
  geom_text(aes(label = step), size = 3, fontface = "bold", color = "white") +
  geom_segment(aes(x = x + 0.3, xend = x + 0.7, y = y, yend = y),
               data = workflow_data[1:5,],
               arrow = arrow(length = unit(0.3, "cm")),
               color = "#6c757d", linewidth = 1.5) +
  theme_void() +
  labs(title = "MultiNicheNet Analysis Workflow") +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))
```

### Why Multi-Sample Analysis?

Traditional cell-level differential expression analysis suffers from several limitations:

```{r comparison-table}
comparison_df <- data.frame(
  Aspect = c("Statistical Unit", "Sample Variability", "False Positive Rate", 
             "Complex Designs", "Batch Effects"),
  `Cell-Level` = c("Individual cells", "Ignored", "Inflated", "Limited", "Problematic"),
  `Sample-Level (MultiNicheNet)` = c("Samples/patients", "Properly modeled", "Controlled", 
                                      "Fully supported", "Can be corrected"),
  check.names = FALSE
)

knitr::kable(comparison_df, caption = "Comparison of Cell-Level vs Sample-Level Analysis")
```

## Core Algorithms

### 1. Pseudobulk Aggregation

For each cell type $c$ and sample $s$, we aggregate single-cell expression profiles:

$$\bar{X}_{g,c,s} = \frac{1}{n_{c,s}} \sum_{i \in \text{cells}(c,s)} X_{g,i}$$

where:
- $X_{g,i}$ is the expression of gene $g$ in cell $i$
- $n_{c,s}$ is the number of cells of type $c$ in sample $s$

**Benefits:**
- Reduces technical noise through averaging
- Enables proper statistical inference at sample level
- Respects experimental design structure

```{r pseudobulk-concept, fig.height=5, fig.width=9}
set.seed(42)

# Simulate cell-level data
n_cells <- 100
cell_data <- data.frame(
  x = rnorm(n_cells, 0, 1),
  y = rnorm(n_cells, 0, 1),
  sample = rep(c("Sample 1", "Sample 2", "Sample 3", "Sample 4"), each = 25),
  expression = c(rnorm(25, 5, 2), rnorm(25, 7, 2), rnorm(25, 4, 2), rnorm(25, 8, 2))
)

# Create pseudobulk
pb_data <- cell_data %>%
  group_by(sample) %>%
  summarize(mean_expr = mean(expression), .groups = "drop")

p1 <- ggplot(cell_data, aes(x = sample, y = expression, fill = sample)) +
  geom_jitter(width = 0.2, alpha = 0.5, size = 2) +
  geom_boxplot(alpha = 0.3, outlier.shape = NA) +
  theme_minimal() +
  labs(title = "Cell-Level Expression", y = "Expression", x = "") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

p2 <- ggplot(pb_data, aes(x = sample, y = mean_expr, fill = sample)) +
  geom_col(alpha = 0.8, width = 0.6) +
  geom_point(size = 4, color = "black") +
  theme_minimal() +
  labs(title = "Pseudobulk Expression", y = "Mean Expression", x = "") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold"))

gridExtra::grid.arrange(p1, p2, ncol = 2)
```

### 2. Differential Expression Analysis

MultiNicheNet employs the **muscat** framework for differential state analysis. For each gene $g$ in cell type $c$:

$$\log_2(\text{CPM}_{g,c,s} + 1) = \beta_0 + \beta_1 \cdot \text{condition}_s + \epsilon_{g,c,s}$$

**Statistical Testing:**
- Uses negative binomial generalized linear models (edgeR)
- Accounts for library size differences
- Supports complex designs with covariates

**Empirical P-value Correction:**
MultiNicheNet implements an empirical null distribution approach to control for multiple testing across many cell types:

$$p_{\text{empirical}} = \frac{\#\{|t_{\text{null}}| \geq |t_{\text{observed}}|\}}{N_{\text{null}}}$$

```{r pvalue-distribution, fig.height=4, fig.width=8}
set.seed(123)

# Simulate p-value distributions
null_dist <- rbeta(10000, 1, 1)  # Uniform under null
observed_dist <- c(rbeta(7000, 1, 1), rbeta(3000, 0.3, 2))  # Mix of null and signal

df_pvals <- data.frame(
  pvalue = c(null_dist, observed_dist),
  type = rep(c("Expected (Null)", "Observed"), each = 10000)
)

ggplot(df_pvals, aes(x = pvalue, fill = type)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  facet_wrap(~type) +
  theme_minimal() +
  labs(title = "P-value Distribution: Null vs Observed",
       x = "P-value", y = "Frequency") +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold")) +
  scale_fill_manual(values = c("#6c757d", "#0d6efd"))
```

### 3. NicheNet Ligand Activity Inference

MultiNicheNet integrates the **NicheNet** ligand-target prior knowledge model to infer ligand activities based on downstream gene expression changes.

**Ligand-Target Matrix:**
The ligand-target matrix $\mathbf{W}$ contains regulatory potential scores:

$$W_{l,t} = P(\text{gene } t \text{ regulated by ligand } l)$$

**Activity Score Calculation:**
For a set of differentially expressed target genes $G_{\text{DE}}$:

$$\text{Activity}_l = \text{AUROC}(W_{l,\cdot}, G_{\text{DE}})$$

This measures how well ligand $l$'s predicted targets are enriched in the observed DE genes.

```{r ligand-activity-concept, fig.height=5, fig.width=9}
set.seed(42)

# Simulate ligand activity data
ligands <- paste0("Ligand_", 1:10)
activity_scores <- c(0.85, 0.78, 0.72, 0.68, 0.65, 0.55, 0.52, 0.48, 0.45, 0.42)
n_targets <- c(150, 120, 100, 90, 85, 60, 55, 50, 45, 40)

ligand_df <- data.frame(
  ligand = factor(ligands, levels = rev(ligands)),
  activity = activity_scores,
  targets = n_targets
)

ggplot(ligand_df, aes(x = activity, y = ligand, fill = activity)) +
  geom_col(width = 0.7) +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "red", linewidth = 1) +
  scale_fill_gradient2(low = "#6c757d", mid = "#ffc107", high = "#0d6efd", 
                       midpoint = 0.6, name = "Activity\nScore") +
  theme_minimal() +
  labs(title = "Ligand Activity Ranking",
       subtitle = "Based on target gene enrichment (AUROC)",
       x = "Activity Score (AUROC)", y = "") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
  annotate("text", x = 0.52, y = 1, label = "Random\n(0.5)", color = "red", size = 3)
```

### 4. Multi-Criteria Prioritization

The key innovation of MultiNicheNet is integrating multiple biological criteria into a unified prioritization score.

**Prioritization Components:**

| Criterion | Symbol | Description |
|-----------|--------|-------------|
| Ligand DE | $S_{\text{DE}}^L$ | Differential expression of ligand |
| Receptor DE | $S_{\text{DE}}^R$ | Differential expression of receptor |
| Ligand specificity | $S_{\text{spec}}^L$ | Cell-type specificity of ligand |
| Receptor specificity | $S_{\text{spec}}^R$ | Cell-type specificity of receptor |
| Expression fraction | $S_{\text{frac}}$ | Fraction of samples expressing L-R |
| Ligand activity | $S_{\text{act}}$ | NicheNet activity score |

**Unified Score:**

$$S_{\text{priority}} = \prod_{i} S_i^{w_i}$$

where $w_i$ are scenario-specific weights.

```{r prioritization-heatmap, fig.height=6, fig.width=10}
set.seed(123)

# Simulate prioritization scores
n_interactions <- 15
interactions <- paste0("L", 1:n_interactions, "_R", 1:n_interactions)

criteria_matrix <- data.frame(
  interaction = interactions,
  `Ligand DE` = runif(n_interactions, 0.3, 1),
  `Receptor DE` = runif(n_interactions, 0.3, 1),
  `Ligand Specificity` = runif(n_interactions, 0.2, 1),
  `Receptor Specificity` = runif(n_interactions, 0.2, 1),
  `Expression Fraction` = runif(n_interactions, 0.4, 1),
  `Ligand Activity` = runif(n_interactions, 0.3, 0.9),
  check.names = FALSE
)

# Calculate priority score
criteria_matrix$Priority <- apply(criteria_matrix[,-1], 1, function(x) prod(x^(1/6)))
criteria_matrix <- criteria_matrix %>% arrange(desc(Priority))
criteria_matrix$interaction <- factor(criteria_matrix$interaction, 
                                       levels = rev(criteria_matrix$interaction))

# Reshape for plotting
library(tidyr)
criteria_long <- criteria_matrix %>%
  pivot_longer(-interaction, names_to = "criterion", values_to = "score") %>%
  mutate(criterion = factor(criterion, levels = c("Ligand DE", "Receptor DE", 
                                                    "Ligand Specificity", "Receptor Specificity",
                                                    "Expression Fraction", "Ligand Activity", "Priority")))

ggplot(criteria_long, aes(x = criterion, y = interaction, fill = score)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = round(score, 2)), size = 2.5, color = "black") +
  scale_fill_gradient2(low = "#f8f9fa", mid = "#ffc107", high = "#0d6efd",
                       midpoint = 0.5, name = "Score") +
  theme_minimal() +
  labs(title = "Multi-Criteria Prioritization Matrix",
       x = "", y = "Ligand-Receptor Interaction") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, face = "bold"))
```

## Biological Scenarios

MultiNicheNet supports different biological scenarios with pre-defined weight configurations:

```{r scenarios, fig.height=4, fig.width=8}
scenarios <- data.frame(
  Scenario = c("Full (Default)", "DE-focused", "Specificity-focused", "Activity-focused"),
  `Ligand DE` = c(1, 2, 0.5, 0.5),
  `Receptor DE` = c(1, 2, 0.5, 0.5),
  `Specificity` = c(1, 0.5, 2, 0.5),
  `Activity` = c(1, 0.5, 0.5, 2),
  check.names = FALSE
)

scenarios_long <- scenarios %>%
  pivot_longer(-Scenario, names_to = "criterion", values_to = "weight")

ggplot(scenarios_long, aes(x = criterion, y = Scenario, fill = weight)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = weight), size = 4, fontface = "bold") +
  scale_fill_gradient(low = "#e9ecef", high = "#0d6efd", name = "Weight") +
  theme_minimal() +
  labs(title = "Prioritization Weight Scenarios",
       x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, face = "bold"))
```

## Mathematical Details

### Expression Fraction Calculation

For a ligand-receptor pair in sender cell type $s$ and receiver cell type $r$:

$$F_{L,s} = \frac{\#\{\text{samples where } \bar{X}_{L,s} > \theta\}}{N_{\text{samples}}}$$

where $\theta$ is the expression threshold.

### Cell-Type Specificity Score

Specificity is calculated using the Gini coefficient or entropy-based measures:

$$\text{Specificity}_g = 1 - H(p_{g,1}, p_{g,2}, ..., p_{g,C}) / \log(C)$$

where $p_{g,c}$ is the proportion of gene $g$ expression in cell type $c$.

### Prioritization Score Aggregation

The final prioritization score uses quantile normalization followed by geometric mean:

$$S_{\text{final}} = \left(\prod_{i=1}^{K} Q_i^{w_i}\right)^{1/\sum w_i}$$

where $Q_i$ is the quantile-normalized score for criterion $i$.

## Comparison with Other Methods

```{r method-comparison, fig.height=5, fig.width=9}
methods <- data.frame(
  Method = c("CellPhoneDB", "CellChat", "NicheNet", "LIANA", "MultiNicheNet"),
  `Multi-Sample` = c(0, 0, 0, 1, 1),
  `DE Integration` = c(0, 1, 0, 0, 1),
  `Activity Inference` = c(0, 0, 1, 0, 1),
  `Complex Design` = c(0, 0, 0, 0, 1),
  `Prioritization` = c(0, 1, 1, 1, 1),
  check.names = FALSE
)

methods_long <- methods %>%
  pivot_longer(-Method, names_to = "Feature", values_to = "Support") %>%
  mutate(Support = factor(Support, levels = c(0, 1), labels = c("No", "Yes")))

ggplot(methods_long, aes(x = Feature, y = Method, fill = Support)) +
  geom_tile(color = "white", linewidth = 1) +
  scale_fill_manual(values = c("No" = "#dee2e6", "Yes" = "#0d6efd")) +
  theme_minimal() +
  labs(title = "Feature Comparison: CCC Analysis Methods",
       x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5, face = "bold"))
```

## Performance Considerations

### Computational Complexity

| Operation | Complexity | Typical Time |
|-----------|------------|--------------|
| Pseudobulk aggregation | O(n cells × m genes) | Seconds |
| DE analysis | O(k cell types × m genes) | Minutes |
| Ligand activity | O(l ligands × t targets) | Minutes |
| Prioritization | O(p pairs × c criteria) | Seconds |

### Parallelization

MultiNicheNet supports parallel processing for:
- Ligand activity inference across receivers
- Permutation-based empirical p-values
- Multiple contrast calculations

## References

1. **MultiNicheNet**: Browaeys, R. et al. bioRxiv (2023). https://doi.org/10.1101/2023.06.13.544751

2. **NicheNet**: Browaeys, R. et al. Nat Methods 17, 159–162 (2020). https://doi.org/10.1038/s41592-019-0667-5

3. **muscat**: Crowell, H.L. et al. Nat Commun 11, 6077 (2020). https://doi.org/10.1038/s41467-020-19894-4

4. **edgeR**: Robinson, M.D. et al. Bioinformatics 26, 139–140 (2010).

---

*Maintained by [Zaoqu Liu](https://github.com/Zaoqu-Liu)*