--- title: "COMMOTR: Algorithm Theory and Mathematical Foundation" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_width: 8 fig_height: 6 vignette: > %\VignetteIndexEntry{Algorithm Theory and Mathematical Foundation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, message = FALSE, warning = FALSE ) ``` ## Introduction **COMMOTR** implements collective optimal transport (COT) for inferring spatially-resolved cell-cell communication (CCC) in spatial transcriptomics data. This vignette provides a comprehensive overview of the mathematical foundations underlying the algorithm. ## The Cell-Cell Communication Problem ### Traditional Approaches Traditional CCC inference methods typically compute a score for each cell pair based on: $$\text{Score}_{ij} = f(\text{Ligand}_i) \times g(\text{Receptor}_j)$$ However, this approach ignores **spatial constraints** — cells that are far apart are unlikely to communicate directly. ### The Optimal Transport Formulation COMMOTR formulates CCC as an **optimal transport (OT)** problem. Given: - **Source distribution** $\mathbf{a} \in \mathbb{R}^n_+$: Ligand expression across $n$ cells - **Target distribution** $\mathbf{b} \in \mathbb{R}^m_+$: Receptor expression across $m$ cells - **Cost matrix** $\mathbf{C} \in \mathbb{R}^{n \times m}_+$: Spatial distances between cells The goal is to find a **transport plan** $\mathbf{P} \in \mathbb{R}^{n \times m}_+$ that optimally transports ligand signals to receptor locations. ## Mathematical Formulation ### Classical Optimal Transport (Kantorovich Problem) The classical OT problem seeks: $$\min_{\mathbf{P} \geq 0} \langle \mathbf{C}, \mathbf{P} \rangle \quad \text{s.t.} \quad \mathbf{P}\mathbf{1} = \mathbf{a}, \quad \mathbf{P}^\top\mathbf{1} = \mathbf{b}$$ where $\langle \mathbf{C}, \mathbf{P} \rangle = \sum_{i,j} C_{ij} P_{ij}$ is the total transport cost. ### Entropy-Regularized OT Direct solution is computationally expensive ($O(n^3)$). Adding entropy regularization: $$\min_{\mathbf{P} \geq 0} \langle \mathbf{C}, \mathbf{P} \rangle + \varepsilon \cdot \text{KL}(\mathbf{P} \| \mathbf{a} \otimes \mathbf{b})$$ where: $$\text{KL}(\mathbf{P} \| \mathbf{Q}) = \sum_{i,j} P_{ij} \log\frac{P_{ij}}{Q_{ij}} - P_{ij} + Q_{ij}$$ This enables efficient $O(n^2)$ solutions via the **Sinkhorn-Knopp algorithm**. ### Unbalanced Optimal Transport In CCC, total ligand production may not equal total receptor capacity. COMMOTR uses **unbalanced OT** with KL divergence penalties: $$\min_{\mathbf{P} \geq 0} \langle \mathbf{C}, \mathbf{P} \rangle + \varepsilon \cdot \text{KL}(\mathbf{P} \| \mathbf{a} \otimes \mathbf{b}) + \rho \cdot \left( \text{KL}(\mathbf{P}\mathbf{1} \| \mathbf{a}) + \text{KL}(\mathbf{P}^\top\mathbf{1} \| \mathbf{b}) \right)$$ **Key parameters:** | Parameter | Symbol | Role | |-----------|--------|------| | Entropy regularization | $\varepsilon$ | Controls transport smoothness | | Unbalanced penalty | $\rho$ | Controls mass conservation strictness | ## The Sinkhorn Algorithm ### Dual Formulation The optimization has a dual form with variables $\mathbf{f} \in \mathbb{R}^n$ and $\mathbf{g} \in \mathbb{R}^m$. The optimal transport plan is: $$P_{ij}^* = \exp\left(\frac{f_i + g_j - C_{ij}}{\varepsilon}\right)$$ ### Iterative Updates The dual variables are updated iteratively: $$f_i \leftarrow \varepsilon \log(a_i) - \varepsilon \log\left(\sum_j e^{(f_i + g_j - C_{ij})/\varepsilon} + e^{(f_i - \rho)/\varepsilon}\right) + f_i$$ $$g_j \leftarrow \varepsilon \log(b_j) - \varepsilon \log\left(\sum_i e^{(f_i + g_j - C_{ij})/\varepsilon} + e^{(g_j - \rho)/\varepsilon}\right) + g_j$$ ### Numerical Stability COMMOTR implements the **log-sum-exp trick** to prevent overflow: $$\log\left(\sum_k e^{x_k}\right) = m + \log\left(\sum_k e^{x_k - m}\right), \quad m = \max_k x_k$$ ## Collective Optimal Transport ### The Collective Framework For multiple ligand-receptor pairs sharing the same spatial structure, COMMOTR computes a **collective** optimal transport: $$\min_{\mathbf{P}^{(1)}, \ldots, \mathbf{P}^{(K)}} \sum_{k=1}^K \langle \mathbf{C}, \mathbf{P}^{(k)} \rangle + \varepsilon \cdot H(\mathbf{P}^{(k)}) + \rho \cdot D(\mathbf{P}^{(k)})$$ This encourages consistent spatial patterns across related interactions. ### COT Strategies | Strategy | Description | Use Case | |----------|-------------|----------| | `collective` | Joint optimization of all pairs | Most consistent patterns | | `sender` | Group by sender signal | Sender-centric analysis | | `receiver` | Group by receiver signal | Receiver-centric analysis | | `pairwise` | Independent pair optimization | Fastest, most flexible | ## Spatial Distance Cost ### Distance Threshold Cells beyond distance $d_{\text{thr}}$ have infinite cost (no transport): $$C_{ij} = \begin{cases} \|\mathbf{x}_i - \mathbf{x}_j\| / d_{\text{thr}} & \text{if } \|\mathbf{x}_i - \mathbf{x}_j\| \leq d_{\text{thr}} \\ +\infty & \text{otherwise} \end{cases}$$ ### Biological Interpretation - **Small $d_{\text{thr}}$**: Only direct neighbors can communicate (contact-dependent) - **Large $d_{\text{thr}}$**: Long-range paracrine signaling allowed ## Communication Direction ### Vector Field Computation For each cell $i$, the sender direction vector is: $$\mathbf{v}_i^{\text{sender}} = \frac{\sum_j P_{ij} K_{ij} (\mathbf{x}_j - \mathbf{x}_i)}{\sum_j P_{ij} K_{ij}}$$ where $K_{ij}$ is a spatial smoothing kernel (e.g., exponential). ### Spatial Coherence (Moran's I) Moran's I measures spatial autocorrelation of directions: $$I = \frac{n}{\sum_{i,j} w_{ij}} \frac{\sum_{i,j} w_{ij} (v_i - \bar{v})(v_j - \bar{v})}{\sum_i (v_i - \bar{v})^2}$$ - $I > 0$: Spatially coherent signaling - $I \approx 0$: Random directions - $I < 0$: Spatially dispersed ## Visualization Example ```{r example_setup, eval=TRUE} library(ggplot2) # Simulate transport problem set.seed(42) n <- 20 x <- runif(n, 0, 100) y <- runif(n, 0, 100) coords <- data.frame(x = x, y = y, type = rep(c("Sender", "Receiver"), each = n/2)) # Create distance-based cost dist_mat <- as.matrix(dist(coords[, c("x", "y")])) # Visualize cost matrix cost_df <- expand.grid(i = 1:n, j = 1:n) cost_df$distance <- as.vector(dist_mat) cost_df$cost <- ifelse(cost_df$distance < 50, cost_df$distance / 50, NA) p1 <- ggplot(cost_df, aes(x = i, y = j, fill = cost)) + geom_tile() + scale_fill_viridis_c(na.value = "white", name = "Cost") + labs(title = "Distance-Based Cost Matrix", subtitle = "White = beyond threshold (infinite cost)", x = "Source cell", y = "Target cell") + coord_fixed() + theme_minimal() print(p1) ``` ```{r transport_viz, eval=TRUE} # Visualize simulated transport # (Simplified for demonstration) senders <- coords[1:10, ] receivers <- coords[11:20, ] # Create transport connections (top pairs) set.seed(123) connections <- data.frame( x_start = senders$x[sample(10, 15, replace = TRUE)], y_start = senders$y[sample(10, 15, replace = TRUE)], x_end = receivers$x[sample(10, 15, replace = TRUE)], y_end = receivers$y[sample(10, 15, replace = TRUE)], weight = runif(15, 0.1, 1) ) p2 <- ggplot() + geom_segment(data = connections, aes(x = x_start, y = y_start, xend = x_end, yend = y_end, alpha = weight, linewidth = weight), color = "#E63946", arrow = arrow(length = unit(0.15, "cm"))) + geom_point(data = senders, aes(x = x, y = y), color = "#1D3557", size = 4, shape = 16) + geom_point(data = receivers, aes(x = x, y = y), color = "#457B9D", size = 4, shape = 17) + scale_alpha_continuous(range = c(0.3, 0.9), guide = "none") + scale_linewidth_continuous(range = c(0.3, 1.5), guide = "none") + labs(title = "Optimal Transport Plan Visualization", subtitle = "Arrows show ligand→receptor transport (width = intensity)", x = "Spatial X", y = "Spatial Y") + annotate("text", x = 90, y = 95, label = "● Sender", hjust = 0, size = 3) + annotate("text", x = 90, y = 88, label = "▲ Receiver", hjust = 0, size = 3) + theme_minimal() + coord_fixed() print(p2) ``` ## Parameter Selection Guide ```{r param_viz, eval=TRUE, fig.height=5} # Illustrate effect of epsilon and rho params <- expand.grid( eps = c(0.01, 0.1, 1.0), rho = c(1, 10, 100) ) params$smoothness <- 1 / params$eps params$balance <- params$rho / (params$rho + 1) ggplot(params, aes(x = eps, y = rho)) + geom_tile(aes(fill = balance), color = "white", linewidth = 1) + geom_text(aes(label = sprintf("ε=%.2f\nρ=%d", eps, rho)), size = 3, color = "white") + scale_fill_gradient(low = "#457B9D", high = "#E63946", name = "Mass\nConservation") + scale_x_log10() + scale_y_log10() + labs(title = "Parameter Space Overview", subtitle = "ε: entropy regularization | ρ: unbalanced penalty", x = "Entropy (ε) - log scale", y = "Unbalanced penalty (ρ) - log scale") + theme_minimal() ``` ### Recommended Settings | Scenario | $\varepsilon$ | $\rho$ | |----------|---------------|--------| | Default (balanced) | 0.1 | 10 | | Sparse transport | 0.01 | 10 | | Strict mass balance | 0.1 | 100 | | Highly unbalanced | 0.1 | 1 | ## References 1. Cang, Z., Zhao, Y., et al. **Screening cell–cell communication in spatial transcriptomics via collective optimal transport.** *Nature Methods* 20, 218–228 (2023). 2. Cuturi, M. **Sinkhorn Distances: Lightspeed Computation of Optimal Transport.** *NeurIPS* (2013). 3. Chizat, L., et al. **Scaling algorithms for unbalanced optimal transport problems.** *Mathematics of Computation* 87, 2563–2609 (2018). --- *Developed by Zaoqu Liu | [GitHub](https://github.com/Zaoqu-Liu) | liuzaoqu@163.com*