--- title: "Algorithm and Mathematical Framework" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm and Mathematical Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` ## Overview SCORPION implements the **PANDA** (Passing Attributes between Networks for Data Assimilation) algorithm, adapted for single-cell data through metacell aggregation. This vignette explains the mathematical framework underlying the algorithm. ## The PANDA Algorithm ### Core Concept PANDA is a message-passing algorithm that integrates multiple data sources to infer gene regulatory networks. It iteratively updates three networks: 1. **Regulatory Network (W)**: TF → Gene relationships 2. **Co-regulatory Network (C)**: Gene-Gene co-expression patterns 3. **Cooperative Network (P)**: TF-TF cooperation patterns ### Mathematical Formulation #### 1. Tanimoto Similarity The core of PANDA is the Tanimoto similarity function, which measures the overlap between network neighborhoods: $$T(X, Y) = \frac{XY^T}{\sqrt{\|X\|^2 + \|Y\|^2 - |XY^T|}}$$ where: - $X$ and $Y$ are network matrices - $\|\cdot\|$ denotes the Frobenius norm ```{r tanimoto, echo=FALSE, fig.cap="Illustration of Tanimoto similarity between networks"} library(SCORPION) # Simple visualization of Tanimoto concept set.seed(42) n <- 5 X <- matrix(runif(n*n, -1, 1), n, n) Y <- matrix(runif(n*n, -1, 1), n, n) par(mfrow = c(1, 3), mar = c(2, 2, 3, 1)) # Network X image(X, main = "Network X", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) box() # Network Y image(Y, main = "Network Y", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) box() # Similarity sim <- X %*% t(Y) image(sim, main = "Tanimoto Similarity", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) box() ``` #### 2. Network Update Rules At each iteration $t$, the networks are updated as follows: **Responsibility** (how well TFs explain gene expression): $$R_t = T(P_t, W_t)$$ **Availability** (how well genes are co-regulated): $$A_t = T(W_t, C_t)$$ **Regulatory network update**: $$W_{t+1} = (1-\alpha)W_t + \alpha \cdot \frac{R_t + A_t}{2}$$ **Cooperative network update**: $$P_{t+1} = (1-\alpha)P_t + \alpha \cdot T(W_t, W_t^T)$$ **Co-regulatory network update**: $$C_{t+1} = (1-\alpha)C_t + \alpha \cdot T(W_t^T, W_t)$$ where $\alpha$ is the learning rate. ### Convergence Criterion The algorithm converges when the Hamming distance between consecutive regulatory networks falls below a threshold: $$H_t = \frac{1}{n_{TF} \times n_{gene}} \sum_{i,j} |W_t^{(i,j)} - W_{t-1}^{(i,j)}|$$ ```{r convergence, fig.cap="Convergence of PANDA algorithm"} # Run SCORPION and track convergence data(scorpionTest) set.seed(123) # Run with high hamming to see more iterations result <- scorpion( tfMotifs = scorpionTest$tf, gexMatrix = scorpionTest$gex, ppiNet = scorpionTest$ppi, gammaValue = 10, alphaValue = 0.1, hammingValue = 0.001, showProgress = FALSE ) # Simulated convergence curve for illustration iterations <- 0:10 hamming <- 1 * exp(-0.8 * iterations) plot(iterations, hamming, type = "b", pch = 19, col = "steelblue", xlab = "Iteration", ylab = "Hamming Distance", main = "PANDA Convergence", ylim = c(0, 1)) abline(h = 0.001, col = "red", lty = 2) text(8, 0.05, "Threshold = 0.001", col = "red") ``` ## Metacell Aggregation ### The Sparsity Problem Single-cell RNA-seq data is inherently sparse due to: - Technical dropout events - Low capture efficiency - Biological variability ### Solution: Metacells SCORPION addresses sparsity by aggregating similar cells into "metacells": 1. **PCA dimensionality reduction** on variable genes 2. **k-NN graph construction** in PCA space 3. **Walktrap clustering** to identify cell communities 4. **Expression averaging** within clusters ```{r metacell, fig.cap="Metacell aggregation concept"} # Illustration of metacell concept set.seed(42) n_cells <- 100 n_genes <- 50 # Simulated single-cell data (sparse) sc_data <- matrix(rpois(n_cells * n_genes, lambda = 0.5), n_genes, n_cells) sc_data[sc_data > 3] <- 3 # Simulated metacell data (denser) n_metacells <- 10 mc_data <- matrix(rpois(n_metacells * n_genes, lambda = 5), n_genes, n_metacells) par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) # Single-cell heatmap image(t(sc_data), main = paste0("Single-cell (", n_cells, " cells)"), col = colorRampPalette(c("white", "navy"))(100), xlab = "Cells", ylab = "Genes", axes = FALSE) mtext(paste0("Sparsity: ", round(100*mean(sc_data == 0), 1), "%"), side = 1, line = 2.5) # Metacell heatmap image(t(mc_data), main = paste0("Metacells (", n_metacells, " metacells)"), col = colorRampPalette(c("white", "navy"))(100), xlab = "Metacells", ylab = "Genes", axes = FALSE) mtext(paste0("Sparsity: ", round(100*mean(mc_data == 0), 1), "%"), side = 1, line = 2.5) ``` ### Gamma Parameter The `gammaValue` parameter controls the aggregation level: $$\gamma = \frac{n_{cells}}{n_{metacells}}$$ - Higher γ = More aggregation, less sparsity, lower resolution - Lower γ = Less aggregation, more sparsity, higher resolution ## Network Normalization ### Double Z-score Normalization Before message passing, all networks are normalized using a double Z-score approach: $$Z = \frac{Z_{row} + Z_{col}}{\sqrt{2}}$$ where: $$Z_{row} = \frac{X - \mu_{row}}{\sigma_{row}}$$ $$Z_{col} = \frac{X - \mu_{col}}{\sigma_{col}}$$ This ensures that both row-wise (TF) and column-wise (gene) statistics are considered. ```{r normalize, fig.cap="Effect of double Z-score normalization"} # Illustration of normalization effect set.seed(42) raw <- matrix(rnorm(100, mean = 5, sd = 2), 10, 10) raw[1:3, ] <- raw[1:3, ] + 10 # Add row bias # Row-wise z-score z_row <- t(scale(t(raw))) # Column-wise z-score z_col <- scale(raw) # Double z-score z_double <- (z_row + z_col) / sqrt(2) par(mfrow = c(2, 2), mar = c(2, 2, 3, 1)) image(raw, main = "Raw Matrix", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) image(z_row, main = "Row Z-score", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) image(z_col, main = "Column Z-score", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) image(z_double, main = "Double Z-score", col = colorRampPalette(c("blue", "white", "red"))(100), axes = FALSE) ``` ## Computational Complexity | Operation | Complexity | |-----------|------------| | Tanimoto similarity | O(n²m) | | Network update | O(n²m) | | Total per iteration | O(n²m) | | Metacell aggregation | O(c × g) | Where: - n = number of TFs - m = number of genes - c = number of cells - g = number of genes used for PCA ## References 1. Glass, K., et al. (2013). Passing Messages between Biological Networks to Refine Predicted Interactions. *PLoS ONE*. 2. Osorio, D., et al. (2023). Population-level comparisons of gene regulatory networks modeled on high-throughput single-cell transcriptomics data. *Nature Computational Science*. ## Session Information ```{r session} sessionInfo() ```