Algorithm and Mathematical Framework

Overview

SCORPION implements the PANDA (Passing Attributes between Networks for Data Assimilation) algorithm, adapted for single-cell data through metacell aggregation. This vignette explains the mathematical framework underlying the algorithm.

The PANDA Algorithm

Core Concept

PANDA is a message-passing algorithm that integrates multiple data sources to infer gene regulatory networks. It iteratively updates three networks:

Regulatory Network (W): TF → Gene relationships
Co-regulatory Network (C): Gene-Gene co-expression patterns
Cooperative Network (P): TF-TF cooperation patterns

Mathematical Formulation

1. Tanimoto Similarity

The core of PANDA is the Tanimoto similarity function, which measures the overlap between network neighborhoods:

\[T(X, Y) = \frac{XY^T}{\sqrt{\|X\|^2 + \|Y\|^2 - |XY^T|}}\]

where: - \(X\) and \(Y\) are network matrices - \(\|\cdot\|\) denotes the Frobenius norm

Illustration of Tanimoto similarity between networks

2. Network Update Rules

At each iteration \(t\), the networks are updated as follows:

Responsibility (how well TFs explain gene expression): \[R_t = T(P_t, W_t)\]

Availability (how well genes are co-regulated): \[A_t = T(W_t, C_t)\]

Regulatory network update: \[W_{t+1} = (1-\alpha)W_t + \alpha \cdot \frac{R_t + A_t}{2}\]

Cooperative network update: \[P_{t+1} = (1-\alpha)P_t + \alpha \cdot T(W_t, W_t^T)\]

Co-regulatory network update: \[C_{t+1} = (1-\alpha)C_t + \alpha \cdot T(W_t^T, W_t)\]

where \(\alpha\) is the learning rate.

Convergence Criterion

The algorithm converges when the Hamming distance between consecutive regulatory networks falls below a threshold:

\[H_t = \frac{1}{n_{TF} \times n_{gene}} \sum_{i,j} |W_t^{(i,j)} - W_{t-1}^{(i,j)}|\]

# Run SCORPION and track convergence
data(scorpionTest)
set.seed(123)

# Run with high hamming to see more iterations
result <- scorpion(
  tfMotifs = scorpionTest$tf,
  gexMatrix = scorpionTest$gex,
  ppiNet = scorpionTest$ppi,
  gammaValue = 10,
  alphaValue = 0.1,
  hammingValue = 0.001,
  showProgress = FALSE
)

# Simulated convergence curve for illustration
iterations <- 0:10
hamming <- 1 * exp(-0.8 * iterations)

plot(iterations, hamming, type = "b", pch = 19, col = "steelblue",
     xlab = "Iteration", ylab = "Hamming Distance",
     main = "PANDA Convergence", ylim = c(0, 1))
abline(h = 0.001, col = "red", lty = 2)
text(8, 0.05, "Threshold = 0.001", col = "red")

Convergence of PANDA algorithm

Metacell Aggregation

The Sparsity Problem

Single-cell RNA-seq data is inherently sparse due to: - Technical dropout events - Low capture efficiency - Biological variability

Solution: Metacells

SCORPION addresses sparsity by aggregating similar cells into “metacells”:

PCA dimensionality reduction on variable genes
k-NN graph construction in PCA space
Walktrap clustering to identify cell communities
Expression averaging within clusters

# Illustration of metacell concept
set.seed(42)
n_cells <- 100
n_genes <- 50

# Simulated single-cell data (sparse)
sc_data <- matrix(rpois(n_cells * n_genes, lambda = 0.5), n_genes, n_cells)
sc_data[sc_data > 3] <- 3

# Simulated metacell data (denser)
n_metacells <- 10
mc_data <- matrix(rpois(n_metacells * n_genes, lambda = 5), n_genes, n_metacells)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Single-cell heatmap
image(t(sc_data), main = paste0("Single-cell (", n_cells, " cells)"),
      col = colorRampPalette(c("white", "navy"))(100),
      xlab = "Cells", ylab = "Genes", axes = FALSE)
mtext(paste0("Sparsity: ", round(100*mean(sc_data == 0), 1), "%"), side = 1, line = 2.5)

# Metacell heatmap  
image(t(mc_data), main = paste0("Metacells (", n_metacells, " metacells)"),
      col = colorRampPalette(c("white", "navy"))(100),
      xlab = "Metacells", ylab = "Genes", axes = FALSE)
mtext(paste0("Sparsity: ", round(100*mean(mc_data == 0), 1), "%"), side = 1, line = 2.5)

Metacell aggregation concept

Gamma Parameter

The gammaValue parameter controls the aggregation level:

\[\gamma = \frac{n_{cells}}{n_{metacells}}\]

Higher γ = More aggregation, less sparsity, lower resolution
Lower γ = Less aggregation, more sparsity, higher resolution

Network Normalization

Double Z-score Normalization

Before message passing, all networks are normalized using a double Z-score approach:

\[Z = \frac{Z_{row} + Z_{col}}{\sqrt{2}}\]

where: \[Z_{row} = \frac{X - \mu_{row}}{\sigma_{row}}\] \[Z_{col} = \frac{X - \mu_{col}}{\sigma_{col}}\]

This ensures that both row-wise (TF) and column-wise (gene) statistics are considered.

# Illustration of normalization effect
set.seed(42)
raw <- matrix(rnorm(100, mean = 5, sd = 2), 10, 10)
raw[1:3, ] <- raw[1:3, ] + 10  # Add row bias

# Row-wise z-score
z_row <- t(scale(t(raw)))

# Column-wise z-score  
z_col <- scale(raw)

# Double z-score
z_double <- (z_row + z_col) / sqrt(2)

par(mfrow = c(2, 2), mar = c(2, 2, 3, 1))
image(raw, main = "Raw Matrix", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE)
image(z_row, main = "Row Z-score", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE)
image(z_col, main = "Column Z-score", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE)
image(z_double, main = "Double Z-score", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE)

Effect of double Z-score normalization

Computational Complexity

Operation	Complexity
Tanimoto similarity	O(n²m)
Network update	O(n²m)
Total per iteration	O(n²m)
Metacell aggregation	O(c × g)

Where: - n = number of TFs - m = number of genes - c = number of cells - g = number of genes used for PCA

References

Glass, K., et al. (2013). Passing Messages between Biological Networks to Refine Predicted Interactions. PLoS ONE.
Osorio, D., et al. (2023). Population-level comparisons of gene regulatory networks modeled on high-throughput single-cell transcriptomics data. Nature Computational Science.

Session Information

sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] Matrix_1.7-5   SCORPION_1.2.2 rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.6        knitr_1.51       rlang_1.3.0      xfun_0.60       
#>  [5] otel_0.2.0       jsonlite_2.0.0   buildtools_1.0.0 htmltools_0.5.9 
#>  [9] maketools_1.3.2  sys_3.4.3        sass_0.4.10      grid_4.6.1      
#> [13] evaluate_1.0.5   jquerylib_0.1.4  fastmap_1.2.0    yaml_2.3.12     
#> [17] lifecycle_1.0.5  compiler_4.6.1   codetools_0.2-20 igraph_2.3.3    
#> [21] irlba_2.3.7      pkgconfig_2.0.3  Rcpp_1.1.2       pbapply_1.7-4   
#> [25] lattice_0.22-9   digest_0.6.39    R6_2.6.1         RANN_2.6.2      
#> [29] parallel_4.6.1   magrittr_2.0.5   bslib_0.11.0     tools_4.6.1     
#> [33] cachem_1.1.0