---
title: "Theory & Methods"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Theory & Methods}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  fig.align = "center",
  out.width = "100%",
  dpi = 150,
  warning = FALSE,
  message = FALSE
)
```

## Overview

scMetaLink models metabolite-mediated cell communication as a two-step process:

```{r flowchart, echo=FALSE, out.width="100%", fig.cap="**Figure 1: scMetaLink Workflow.** Schematic overview of the analysis pipeline for inferring metabolite-mediated cell-cell communication."}
knitr::include_graphics("flowchart.png")
```

## The MetalinksDB Knowledge Base

scMetaLink uses **MetalinksDB**, a comprehensive database of metabolite-protein interactions.

```{r db_overview}
library(scMetaLink)

# Load and explore the database
db <- scMetaLink:::.load_metalinksdb()

cat("=== MetalinksDB Statistics ===\n")
cat("Metabolites:", nrow(db$metabolites), "\n")
cat("Proteins:", nrow(db$proteins), "\n")
cat("Interactions:", nrow(db$edges), "\n")
cat("Pathways:", length(unique(db$pathway$pathway)), "\n")
```

### Interaction Types

```{r interaction_types}
# Two main interaction types
cat("=== Interaction Types ===\n")
print(table(db$edges$type))

cat("\n- lr (ligand-receptor): Direct metabolite-receptor binding\n")
cat("- pd (produce-degrade): Enzymatic production/degradation\n")
```

### Mode of Regulation (MOR)

For `pd` type interactions, MOR indicates the direction:

```{r mor}
pd_edges <- db$edges[db$edges$type == "pd", ]
cat("=== Mode of Regulation for pd interactions ===\n")
print(table(pd_edges$mor))

cat("\n- mor = 1: Enzyme produces/secretes the metabolite\n")
cat("- mor = -1: Enzyme degrades/consumes the metabolite\n")
cat("- mor = 0: Enzyme binds without direction change\n")
```

### Protein Types

```{r protein_types}
cat("=== Protein Classifications ===\n")
print(table(db$proteins$protein_type, useNA = "ifany"))

cat("\n")
cat("Receptors: gpcr, lgic, nhr, vgic, catalytic_receptor, other_ic\n")
cat("Enzymes: enzyme (or NA in protein_type)\n")
cat("Transporters: transporter\n")
```

## Mathematical Framework

```{r calflow, echo=FALSE, out.width="100%", fig.cap="**Figure 2: Computational Algorithm.** Schematic illustration of the mathematical framework for computing metabolite production, sensing, and communication scores."}
knitr::include_graphics("calflow.png")
```

### 1. Metabolite Production Potential (MPP)

The production potential for metabolite $m$ in cell type $c$ is:

$$\text{MPP}_{m,c} = \frac{1}{|E_m|} \sum_{e \in E_m} \text{Score}(e, c) - \alpha \cdot \frac{1}{|D_m|} \sum_{d \in D_m} \text{Score}(d, c)$$

Where:
- $E_m$: Set of enzymes that produce metabolite $m$
- $D_m$: Set of enzymes that degrade metabolite $m$
- $\alpha = 0.5$: Degradation weight factor
- $\text{Score}(g, c)$: Gene expression score for gene $g$ in cell type $c$

#### Gene Expression Scoring

Three methods are available:

| Method | Formula | Use Case |
|--------|---------|----------|
| `mean` | $\bar{x}_g$ | Simple average expression |
| `proportion` | $P(x_g > 0)$ | Proportion of expressing cells |
| `combined` | $\bar{x}_g \times P(x_g > 0)$ | Balanced (recommended) |

```{r scoring_methods, echo=FALSE, fig.height=4}
# Illustrate scoring methods
set.seed(42)
expr <- c(rep(0, 70), rnorm(30, 2, 0.5)) # 70% dropout

methods <- data.frame(
  Method = c("mean", "proportion", "combined"),
  Formula = c("mean(expr)", "mean(expr > 0)", "mean x proportion"),
  Value = c(
    mean(expr),
    mean(expr > 0),
    mean(expr) * mean(expr > 0)
  )
)
knitr::kable(methods, digits = 3)
```

#### Trimean Option

For single-cell data with high dropout, **trimean** provides a robust alternative:

$$\text{Trimean} = \frac{Q_1 + 2Q_2 + Q_3}{4}$$

```{r trimean_example}
# Example: Trimean vs Arithmetic Mean
expr_with_outlier <- c(0, 0, 0, 1, 2, 3, 100) # Outlier = 100

cat("Data:", paste(expr_with_outlier, collapse = ", "), "\n")
cat("Arithmetic mean:", round(mean(expr_with_outlier), 2), "\n")
cat("Median:", median(expr_with_outlier), "\n")

# Trimean calculation
q <- quantile(expr_with_outlier, c(0.25, 0.5, 0.75))
trimean <- (q[1] + 2 * q[2] + q[3]) / 4
cat("Trimean:", round(trimean, 2), "\n")
cat("\nTrimean is robust to the outlier!\n")
```

### 2. Metabolite Sensing Capability (MSC)

The sensing capability for metabolite $m$ in cell type $c$ is:

$$\text{MSC}_{m,c} = \frac{1}{|R_m|} \sum_{r \in R_m} w_r \cdot \text{Score}(r, c)$$

Where:
- $R_m$: Set of receptors/transporters for metabolite $m$
- $w_r$: Affinity weight based on interaction confidence score

#### Affinity Weighting

Receptor-metabolite binding affinities from MetalinksDB are used as weights:

```{r affinity}
lr_edges <- db$edges[db$edges$type == "lr", ]
cat("Affinity score distribution (0-1000):\n")
summary(lr_edges$combined_score)
```

#### Hill Function (Optional)

To model receptor saturation kinetics, an optional Hill transformation can be applied:

$$P = \frac{E^n}{K_h^n + E^n}$$

Where:
- $E$: Expression level (normalized)
- $n$: Hill coefficient (cooperativity)
- $K_h$: Half-maximal threshold

```{r hill_demo, fig.height=4, fig.width=7, fig.cap="**Figure 1: Hill Function Transformation.** Different Hill coefficients (n) produce different response curves. n=1 gives standard Michaelis-Menten kinetics. n>1 indicates positive cooperativity (steeper curve). The half-maximal response occurs at x=Kh."}
# Hill function visualization
x <- seq(0, 1, 0.01)
hill <- function(x, n, Kh) x^n / (Kh^n + x^n)

plot(x, x,
  type = "l", lty = 2, col = "gray",
  xlab = "Expression (normalized)", ylab = "Response",
  main = "Hill Function Transformation"
)
lines(x, hill(x, n = 1, Kh = 0.5), col = "blue", lwd = 2)
lines(x, hill(x, n = 2, Kh = 0.5), col = "red", lwd = 2)
lines(x, hill(x, n = 0.5, Kh = 0.5), col = "green", lwd = 2)
legend("bottomright",
  legend = c("Linear", "n=1 (Michaelis-Menten)", "n=2 (Cooperative)", "n=0.5"),
  col = c("gray", "blue", "red", "green"),
  lty = c(2, 1, 1, 1), lwd = 2
)
abline(h = 0.5, v = 0.5, lty = 3, col = "gray")
```

### 3. Communication Score

The communication score from sender $s$ to receiver $r$ via metabolite $m$:

$$\text{Comm}_{s \to r}^m = f(\text{MPP}_{m,s}, \text{MSC}_{m,r})$$

Three aggregation methods are available:

| Method | Formula | Properties |
|--------|---------|------------|
| `geometric` | $\sqrt{\text{MPP} \times \text{MSC}}$ | Balanced, penalizes imbalance |
| `product` | $\text{MPP} \times \text{MSC}$ | Emphasizes strong bilateral signals |
| `harmonic` | $\frac{2 \times \text{MPP} \times \text{MSC}}{\text{MPP} + \text{MSC}}$ | Strongly penalizes imbalance |

```{r comm_methods, fig.height=5, fig.width=10, echo=FALSE, fig.cap="**Figure 2: Communication Score Methods.** Heat maps showing how different methods combine production (x-axis) and sensing (y-axis) scores. Geometric mean (left) provides balanced weighting. Product (center) emphasizes strong bilateral signals. Harmonic mean (right) penalizes imbalanced communication."}
# Visualize communication methods
par(mfrow = c(1, 3))

# Create grid
p <- s <- seq(0, 1, 0.05)
grid <- expand.grid(p = p, s = s)

# Geometric
z_geom <- matrix(sqrt(grid$p * grid$s), nrow = length(p))
image(p, s, z_geom,
  main = "Geometric: sqrt(P x S)",
  xlab = "Production", ylab = "Sensing", col = hcl.colors(50, "YlOrRd")
)

# Product
z_prod <- matrix(grid$p * grid$s, nrow = length(p))
image(p, s, z_prod,
  main = "Product: P x S",
  xlab = "Production", ylab = "Sensing", col = hcl.colors(50, "YlOrRd")
)

# Harmonic
z_harm <- matrix(2 * grid$p * grid$s / (grid$p + grid$s + 1e-10), nrow = length(p))
image(p, s, z_harm,
  main = "Harmonic: 2PS/(P+S)",
  xlab = "Production", ylab = "Sensing", col = hcl.colors(50, "YlOrRd")
)

par(mfrow = c(1, 1))
```

### 4. Population Size Correction (Optional)

When `population.size = TRUE`, communication is weighted by cell type abundance:

$$\text{Comm}_{s \to r}^{\text{weighted}} = \text{Comm}_{s \to r} \times \sqrt{\frac{n_s}{N} \times \frac{n_r}{N}}$$

Where $n_s, n_r$ are cell counts and $N$ is total cells.

**Rationale**: Larger cell populations contribute more to tissue-level signaling.

## Statistical Framework

### Permutation Test

To assess significance, cell type labels are randomly shuffled:

1. Shuffle cell type labels $B$ times (default: 100-1000)
2. Recalculate production, sensing, and communication for each permutation
3. Compute empirical p-value:

$$p = \frac{1 + \sum_{b=1}^{B} \mathbb{1}[\text{Comm}_b \geq \text{Comm}_{\text{obs}}]}{B + 1}$$

### Multiple Testing Correction

Three strategies for handling multiple comparisons:

| Method | Description |
|--------|-------------|
| `metabolite_stratified` | BH correction within each metabolite (recommended) |
| `BH` | Global Benjamini-Hochberg |
| `bonferroni` | Global Bonferroni (very conservative) |

**Why `metabolite_stratified`?**

- Metabolites are independent biological signals
- Global correction is overly conservative
- Per-metabolite correction preserves biological interpretability

## Data Processing Pipeline

```{r pipeline, echo=FALSE, fig.height=6, fig.width=10}
# Create flowchart-style diagram using base R
par(mar = c(1, 1, 1, 1))
plot(0, 0,
  type = "n", xlim = c(0, 10), ylim = c(0, 8),
  xlab = "", ylab = "", axes = FALSE
)

# Boxes
rect(0.5, 6.5, 2.5, 7.5, col = "#E3F2FD", border = "#1976D2", lwd = 2)
text(1.5, 7, "Expression\nMatrix", cex = 0.8)

rect(0.5, 4.5, 2.5, 5.5, col = "#E3F2FD", border = "#1976D2", lwd = 2)
text(1.5, 5, "Cell Type\nAnnotation", cex = 0.8)

rect(3.5, 5, 5.5, 6.5, col = "#FFF3E0", border = "#E65100", lwd = 2)
text(4.5, 5.75, "Cell Type\nExpression", cex = 0.8)

rect(6.5, 6.5, 8.5, 7.5, col = "#E8F5E9", border = "#388E3C", lwd = 2)
text(7.5, 7, "Production\nScores", cex = 0.8)

rect(6.5, 4, 8.5, 5, col = "#E8F5E9", border = "#388E3C", lwd = 2)
text(7.5, 4.5, "Sensing\nScores", cex = 0.8)

rect(6.5, 1.5, 8.5, 2.5, col = "#FCE4EC", border = "#C2185B", lwd = 2)
text(7.5, 2, "Communication\nScores", cex = 0.8)

rect(3.5, 0, 5.5, 1, col = "#F3E5F5", border = "#7B1FA2", lwd = 2)
text(4.5, 0.5, "Significant\nInteractions", cex = 0.8)

# Arrows
arrows(2.5, 7, 3.5, 5.75, length = 0.1, lwd = 2)
arrows(2.5, 5, 3.5, 5.75, length = 0.1, lwd = 2)
arrows(5.5, 6, 6.5, 7, length = 0.1, lwd = 2)
arrows(5.5, 5.5, 6.5, 4.5, length = 0.1, lwd = 2)
arrows(7.5, 6.5, 7.5, 5, length = 0.1, lwd = 2)
arrows(7.5, 4, 7.5, 2.5, length = 0.1, lwd = 2)
arrows(6.5, 2, 5.5, 0.5, length = 0.1, lwd = 2)

# Labels
text(3, 6.5, "Aggregate", cex = 0.7, col = "gray40")
text(6, 6.5, "MPP", cex = 0.7, col = "gray40")
text(6, 5, "MSC", cex = 0.7, col = "gray40")
text(8.2, 5.75, "Combine", cex = 0.7, col = "gray40")
text(6, 1.5, "Filter &\nCorrect", cex = 0.7, col = "gray40")

# MetalinksDB
rect(3.5, 7, 5.5, 8, col = "#FFFDE7", border = "#F57F17", lwd = 2)
text(4.5, 7.5, "MetalinksDB", cex = 0.8, font = 2)
arrows(4.5, 7, 4.5, 6.5, length = 0.1, lwd = 2, lty = 2)
```

## Key Assumptions

1. **Gene expression correlates with protein activity**: Enzyme/receptor expression levels reflect functional activity
2. **Metabolites can diffuse between cells**: Produced metabolites are available for sensing by neighboring cells
3. **Cell type labels are accurate**: Incorrect annotations will affect results
4. **Adequate cell sampling**: Each cell type needs sufficient cells for robust estimation

## Comparison with Ligand-Receptor Methods

| Aspect | Ligand-Receptor (e.g., CellChat) | Metabolite-Mediated (scMetaLink) |
|--------|----------------------------------|----------------------------------|
| **Signal type** | Proteins | Small molecules |
| **Diffusion range** | Short (juxtacrine/paracrine) | Variable (paracrine/endocrine) |
| **Database** | CellChatDB, CellPhoneDB | MetalinksDB |
| **Key genes** | Ligands, receptors | Enzymes, transporters, receptors |
| **Signal complexity** | Direct binding | Synthesis -> Secretion -> Sensing |

## References

1. Schafer, S., et al. (2023). MetalinksDB: a knowledgebase of metabolite-centric signaling. *Nature Communications*.

2. Xiao, Z., et al. (2022). Metabolite-mediated intercellular communication in tumor microenvironment. *Frontiers in Cell and Developmental Biology*.

## Next

- **[Production & Sensing Analysis](production-sensing.html)**: Detailed parameter tuning
- **[Communication Analysis](communication.html)**: Advanced communication analysis
- **[Spatial Analysis](spatial-analysis.html)**: Spatial transcriptomics support