SpaTalk Algorithm: Methodological Framework

Overview

SpaTalk employs a multi-step computational framework to infer spatially resolved cell-cell communications from spatial transcriptomics data. This vignette provides detailed explanations of the underlying algorithms. ## Algorithmic Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                     SpaTalk Pipeline                            │
├─────────────────────────────────────────────────────────────────┤
│  1. Cell-type Deconvolution (for spot-based ST)                 │
│     └── Non-negative Linear Model (NNLM)                        │
│                                                                 │
│  2. Spatial Reconstruction                                      │
│     └── Monte Carlo sampling + k-NN spatial mapping             │
│                                                                 │
│  3. LR-Pathway Filtering                                        │
│     └── Knowledge graph: LR pairs → Downstream TFs              │
│                                                                 │
│  4. CCI Inference                                               │
│     └── Co-expression analysis + Permutation test               │
│                                                                 │
│  5. Pathway Scoring                                             │
│     └── Random walk on gene-gene interaction graph              │
└─────────────────────────────────────────────────────────────────┘

1. Cell-Type Deconvolution

Mathematical Formulation

For spot-based ST data where each spot contains multiple cells, SpaTalk decomposes the observed expression profile into cell-type-specific contributions using Non-negative Matrix Factorization (NMF).

Given: - \(X \in \mathbb{R}^{G \times S}\): ST expression matrix (genes × spots) - \(R \in \mathbb{R}^{G \times C}\): Reference expression profile (genes × cell types)

The deconvolution problem is formulated as:

\[X \approx R \cdot W\]

where \(W \in \mathbb{R}^{C \times S}\) represents cell-type proportions per spot, subject to: - \(W_{ij} \geq 0\) (non-negativity) - \(\sum_i W_{ij} = 1\) (proportions sum to 1)

NNLM Algorithm

SpaTalk uses the NNLM (Non-Negative Linear Model) package with the following optimization:

\[\min_{W \geq 0} \|X - RW\|_F^2 + \lambda \|W\|_1\]

The L1 regularization term promotes sparsity in cell-type assignments.

# Deconvolution workflow
obj <- dec_celltype(
  object = obj,
  sc_data = sc_counts,      # scRNA-seq reference
  sc_celltype = cell_labels, # Cell type annotations
  if_doParallel = TRUE
)

2. Spatial Reconstruction

Monte Carlo Spatial Sampling

After deconvolution, SpaTalk reconstructs single-cell resolution expression by sampling cells from the reference data according to:

  1. Cell proportion sampling: For each spot \(s\) with predicted proportions \(W_s\), sample \(N_s\) cells where \(N_s\) is the expected number of cells per spot.

  2. Spatial coordinate assignment: Assign coordinates to sampled cells using a weighted distance model:

\[P(x_i | s) \propto \exp\left(-\frac{d(x_i, c_s)^2}{2\sigma^2}\right)\]

where \(c_s\) is the spot centroid and \(\sigma\) controls the spatial spread.

k-Nearest Neighbor Mapping

For each reconstructed cell, find the k-nearest neighbors in the original ST data to refine expression estimates:

\[\hat{E}_i = \frac{1}{|N_k(i)|} \sum_{j \in N_k(i)} E_j \cdot w_{ij}\]

where \(w_{ij}\) is the distance-based weight.

3. Knowledge Graph Integration

LR-Pathway Database

SpaTalk integrates curated biological knowledge from multiple databases:

Database Content Size
CellTalkDB Ligand-receptor pairs 3,398 (Human), 2,033 (Mouse)
KEGG Signaling pathways ~300 pathways
Reactome Pathway interactions ~2,000 pathways
AnimalTFDB Transcription factors ~1,600 TFs

Pathway Filtering Algorithm

# The find_lr_path function filters LR pairs with downstream targets
obj <- find_lr_path(obj, lrpairs, pathways)

# Filter criteria:
# 1. Both ligand and receptor expressed in ST data
# 2. Downstream pathway genes present
# 3. Target TFs expressed

4. Cell-Cell Communication Inference

Co-expression Score

For a ligand-receptor pair (L, R) between sender cell type A and receiver cell type B:

\[\text{CoExp}(L_A, R_B) = \frac{|\{(a,b) : E_L^a > 0 \land E_R^b > 0 \land d(a,b) < \delta\}|}{|\{(a,b) : d(a,b) < \delta\}|}\]

where \(\delta\) is the spatial distance threshold.

Permutation Test

Statistical significance is assessed via permutation:

  1. Permute cell type labels \(n\) times (default: 1000)
  2. Recalculate co-expression for each permutation
  3. Compute p-value:

\[p = \frac{|\{\pi : \text{CoExp}^\pi \geq \text{CoExp}^{obs}\}| + 1}{n + 1}\]

# CCI inference with permutation testing
obj <- dec_cci(
  object = obj,
  celltype_sender = "Macrophage",
  celltype_receiver = "Fibroblast",
  n_permutation = 1000
)

5. Downstream Pathway Scoring

Random Walk Algorithm

To score downstream transcription factors activated by a receptor, SpaTalk performs random walks on the gene-gene interaction (GGI) graph:

Algorithm: Random Walk TF Scoring

Input: GGI graph G, receptor R, TF set T, n_walks, max_hop
Output: TF scores

Initialize: score[tf] = 0 for all tf in T

for i = 1 to n_walks:
    current = R
    for hop = 1 to max_hop:
        neighbors = G.neighbors(current)
        if neighbors is empty: break
        next = random_choice(neighbors)
        if next in T:
            score[next] += 1
        current = next

Return: score / n_walks

C++ Implementation

SpaTalk uses optimized C++ code for performance-critical computations:

// Simplified random walk implementation
NumericVector cpp_random_walk(
    CharacterVector ggi_src,
    CharacterVector ggi_dest,
    String receptor_name,
    CharacterVector tf_names,
    int n_walks = 10000,
    int max_hop = 10
) {
    // Build adjacency list
    // Perform random walks
    // Count TF visits
    // Return normalized scores
}

Computational Complexity

Step Time Complexity Space Complexity
Deconvolution \(O(G \cdot S \cdot C)\) \(O(G \cdot S)\)
Spatial reconstruction \(O(S \cdot N \cdot k)\) \(O(N)\)
Co-expression \(O(L \cdot N^2)\) \(O(N^2)\)
Permutation test \(O(P \cdot L \cdot N^2)\) \(O(N^2)\)
Random walk \(O(W \cdot H)\) \(O(|E|)\)

Where: G = genes, S = spots, C = cell types, N = cells, L = LR pairs, P = permutations, W = walks, H = max hops, E = edges.

References

  1. Shao, X., et al. (2022). Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data with SpaTalk. Nature Communications, 13, 4429.

  2. Lin, X., & Boutros, P. C. (2020). Optimization and expansion of non-negative matrix factorization. BMC Bioinformatics, 21, 7.

  3. Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research, 28(1), 27-30.

Session Info

sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.39    R6_2.6.1         fastmap_1.2.0    xfun_0.59       
#>  [5] maketools_1.3.2  cachem_1.1.0     knitr_1.51       htmltools_0.5.9 
#>  [9] buildtools_1.0.0 lifecycle_1.0.5  cli_3.6.6        sass_0.4.10     
#> [13] jquerylib_0.1.4  compiler_4.6.1   sys_3.4.3        tools_4.6.1     
#> [17] evaluate_1.0.5   bslib_0.11.0     yaml_2.3.12      otel_0.2.0      
#> [21] jsonlite_2.0.0   rlang_1.2.0