--- title: "SpaTalk Algorithm: Methodological Framework" author: - name: "Zaoqu Liu" email: "liuzaoqu@163.com" affiliation: "Maintainer" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: true vignette: > %\VignetteIndexEntry{SpaTalk Algorithm: Methodological Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE ) ``` ## Overview SpaTalk employs a multi-step computational framework to infer spatially resolved cell-cell communications from spatial transcriptomics data. This vignette provides detailed explanations of the underlying algorithms. ## Algorithmic Pipeline ``` ┌─────────────────────────────────────────────────────────────────┐ │ SpaTalk Pipeline │ ├─────────────────────────────────────────────────────────────────┤ │ 1. Cell-type Deconvolution (for spot-based ST) │ │ └── Non-negative Linear Model (NNLM) │ │ │ │ 2. Spatial Reconstruction │ │ └── Monte Carlo sampling + k-NN spatial mapping │ │ │ │ 3. LR-Pathway Filtering │ │ └── Knowledge graph: LR pairs → Downstream TFs │ │ │ │ 4. CCI Inference │ │ └── Co-expression analysis + Permutation test │ │ │ │ 5. Pathway Scoring │ │ └── Random walk on gene-gene interaction graph │ └─────────────────────────────────────────────────────────────────┘ ``` ## 1. Cell-Type Deconvolution ### Mathematical Formulation For spot-based ST data where each spot contains multiple cells, SpaTalk decomposes the observed expression profile into cell-type-specific contributions using **Non-negative Matrix Factorization (NMF)**. Given: - $X \in \mathbb{R}^{G \times S}$: ST expression matrix (genes × spots) - $R \in \mathbb{R}^{G \times C}$: Reference expression profile (genes × cell types) The deconvolution problem is formulated as: $$X \approx R \cdot W$$ where $W \in \mathbb{R}^{C \times S}$ represents cell-type proportions per spot, subject to: - $W_{ij} \geq 0$ (non-negativity) - $\sum_i W_{ij} = 1$ (proportions sum to 1) ### NNLM Algorithm SpaTalk uses the NNLM (Non-Negative Linear Model) package with the following optimization: $$\min_{W \geq 0} \|X - RW\|_F^2 + \lambda \|W\|_1$$ The L1 regularization term promotes sparsity in cell-type assignments. ```{r nnlm_demo, eval=FALSE} # Deconvolution workflow obj <- dec_celltype( object = obj, sc_data = sc_counts, # scRNA-seq reference sc_celltype = cell_labels, # Cell type annotations if_doParallel = TRUE ) ``` ## 2. Spatial Reconstruction ### Monte Carlo Spatial Sampling After deconvolution, SpaTalk reconstructs single-cell resolution expression by sampling cells from the reference data according to: 1. **Cell proportion sampling**: For each spot $s$ with predicted proportions $W_s$, sample $N_s$ cells where $N_s$ is the expected number of cells per spot. 2. **Spatial coordinate assignment**: Assign coordinates to sampled cells using a weighted distance model: $$P(x_i | s) \propto \exp\left(-\frac{d(x_i, c_s)^2}{2\sigma^2}\right)$$ where $c_s$ is the spot centroid and $\sigma$ controls the spatial spread. ### k-Nearest Neighbor Mapping For each reconstructed cell, find the k-nearest neighbors in the original ST data to refine expression estimates: $$\hat{E}_i = \frac{1}{|N_k(i)|} \sum_{j \in N_k(i)} E_j \cdot w_{ij}$$ where $w_{ij}$ is the distance-based weight. ## 3. Knowledge Graph Integration ### LR-Pathway Database SpaTalk integrates curated biological knowledge from multiple databases: | Database | Content | Size | |----------|---------|------| | CellTalkDB | Ligand-receptor pairs | 3,398 (Human), 2,033 (Mouse) | | KEGG | Signaling pathways | ~300 pathways | | Reactome | Pathway interactions | ~2,000 pathways | | AnimalTFDB | Transcription factors | ~1,600 TFs | ### Pathway Filtering Algorithm ```{r pathway_filter, eval=FALSE} # The find_lr_path function filters LR pairs with downstream targets obj <- find_lr_path(obj, lrpairs, pathways) # Filter criteria: # 1. Both ligand and receptor expressed in ST data # 2. Downstream pathway genes present # 3. Target TFs expressed ``` ## 4. Cell-Cell Communication Inference ### Co-expression Score For a ligand-receptor pair (L, R) between sender cell type A and receiver cell type B: $$\text{CoExp}(L_A, R_B) = \frac{|\{(a,b) : E_L^a > 0 \land E_R^b > 0 \land d(a,b) < \delta\}|}{|\{(a,b) : d(a,b) < \delta\}|}$$ where $\delta$ is the spatial distance threshold. ### Permutation Test Statistical significance is assessed via permutation: 1. Permute cell type labels $n$ times (default: 1000) 2. Recalculate co-expression for each permutation 3. Compute p-value: $$p = \frac{|\{\pi : \text{CoExp}^\pi \geq \text{CoExp}^{obs}\}| + 1}{n + 1}$$ ```{r permutation, eval=FALSE} # CCI inference with permutation testing obj <- dec_cci( object = obj, celltype_sender = "Macrophage", celltype_receiver = "Fibroblast", n_permutation = 1000 ) ``` ## 5. Downstream Pathway Scoring ### Random Walk Algorithm To score downstream transcription factors activated by a receptor, SpaTalk performs random walks on the gene-gene interaction (GGI) graph: **Algorithm: Random Walk TF Scoring** ``` Input: GGI graph G, receptor R, TF set T, n_walks, max_hop Output: TF scores Initialize: score[tf] = 0 for all tf in T for i = 1 to n_walks: current = R for hop = 1 to max_hop: neighbors = G.neighbors(current) if neighbors is empty: break next = random_choice(neighbors) if next in T: score[next] += 1 current = next Return: score / n_walks ``` ### C++ Implementation SpaTalk uses optimized C++ code for performance-critical computations: ```cpp // Simplified random walk implementation NumericVector cpp_random_walk( CharacterVector ggi_src, CharacterVector ggi_dest, String receptor_name, CharacterVector tf_names, int n_walks = 10000, int max_hop = 10 ) { // Build adjacency list // Perform random walks // Count TF visits // Return normalized scores } ``` ## Computational Complexity | Step | Time Complexity | Space Complexity | |------|-----------------|------------------| | Deconvolution | $O(G \cdot S \cdot C)$ | $O(G \cdot S)$ | | Spatial reconstruction | $O(S \cdot N \cdot k)$ | $O(N)$ | | Co-expression | $O(L \cdot N^2)$ | $O(N^2)$ | | Permutation test | $O(P \cdot L \cdot N^2)$ | $O(N^2)$ | | Random walk | $O(W \cdot H)$ | $O(|E|)$ | Where: G = genes, S = spots, C = cell types, N = cells, L = LR pairs, P = permutations, W = walks, H = max hops, E = edges. ## References 1. Shao, X., et al. (2022). Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data with SpaTalk. *Nature Communications*, 13, 4429. 2. Lin, X., & Boutros, P. C. (2020). Optimization and expansion of non-negative matrix factorization. *BMC Bioinformatics*, 21, 7. 3. Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. *Nucleic Acids Research*, 28(1), 27-30. ## Session Info ```{r session} sessionInfo() ```