--- title: "SpaGER: Algorithm and Mathematical Foundation" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{SpaGER: Algorithm and Mathematical Foundation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE ) ``` ## Overview SpaGER implements the **SpaGE** (Spatial Gene Enhancement) algorithm, which uses domain adaptation to predict unmeasured gene expression in spatial transcriptomics data by leveraging scRNA-seq reference data. ## The Challenge Spatial transcriptomics technologies face a fundamental trade-off: | Technology | Genes Measured | Spatial Resolution | |------------|----------------|-------------------| | MERFISH | ~100-1000 | Subcellular | | seqFISH | ~100-10000 | Subcellular | | Visium | ~20000 | ~55μm spots | | Slide-seq | ~20000 | ~10μm beads | High-resolution methods (MERFISH, seqFISH) measure fewer genes, while whole-transcriptome methods (Visium) have lower spatial resolution. **SpaGER bridges this gap** by using scRNA-seq to predict unmeasured genes. ## Algorithm Steps ### Step 1: Data Standardization Both datasets are z-score normalized using **population standard deviation**: $$z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}$$ where: - $x_{ij}$ is the expression of gene $j$ in cell $i$ - $\mu_j = \frac{1}{n}\sum_i x_{ij}$ is the mean expression - $\sigma_j = \sqrt{\frac{1}{n}\sum_i (x_{ij} - \mu_j)^2}$ is the population standard deviation ```{r zscore_demo} library(SpaGER) # Demonstrate z-score standardization set.seed(42) X <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3) X_std <- zscore(X) cat("Original data:\n") print(X) cat("\nStandardized data:\n") print(round(X_std, 4)) cat("\nColumn means (should be ~0):", round(colMeans(X_std), 10), "\n") ``` ### Step 2: Dimensionality Reduction Both datasets are independently reduced to a lower-dimensional space using PCA (or other methods): $$\mathbf{X} \approx \mathbf{T}\mathbf{P}^T$$ where $\mathbf{P}$ contains the principal component loadings. ```{r pca_demo, fig.width=6, fig.height=4} set.seed(42) # Create sample data with structure n <- 100 X_source <- cbind( rnorm(n, 0, 3), # High variance rnorm(n, 0, 1), # Medium variance rnorm(n, 0, 0.3) # Low variance ) # Perform PCA reducer <- process_dim_reduction("pca", n_dim = 2) reducer <- reducer$fit(X_source) cat("PCA loadings (components):\n") print(round(reducer$components_, 4)) cat("\nFirst PC captures high-variance direction\n") ``` ### Step 3: Matrix Orthogonalization The factor matrices are orthonormalized using SVD: $$\mathbf{P} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \rightarrow \text{orth}(\mathbf{P}) = \mathbf{U}$$ This ensures the basis vectors are orthogonal and unit-length. ### Step 4: Principal Vectors Computation The key innovation of SpaGE is finding **Principal Vectors (PVs)** that represent shared variation between domains. Given orthonormalized factor matrices $\mathbf{P}_s$ (source/scRNA-seq) and $\mathbf{P}_t$ (target/spatial), we compute: $$\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \mathbf{P}_s \mathbf{P}_t^T$$ The singular values $\sigma_i$ represent the **cosine similarity** between the $i$-th pair of principal vectors. Higher values indicate better alignment. ```{r pv_demo, fig.width=6, fig.height=4} set.seed(42) # Create two datasets with shared structure n_samples <- 100 n_features <- 50 # Shared signal shared <- matrix(rnorm(n_samples * 10), nrow = n_samples) # Source and target with shared structure source_data <- cbind( shared %*% matrix(rnorm(10 * n_features), nrow = 10), matrix(rnorm(n_samples * n_features, sd = 0.5), nrow = n_samples) )[, 1:n_features] target_data <- cbind( shared %*% matrix(rnorm(10 * n_features), nrow = 10) + matrix(rnorm(n_samples * n_features, sd = 0.3), nrow = n_samples), matrix(rnorm(n_samples * n_features, sd = 0.5), nrow = n_samples) )[, 1:n_features] # Compute principal vectors pv <- PVComputation(n_factors = 20, n_pv = 20, dim_reduction = "pca") pv <- pv$fit(source_data, target_data) # Plot cosine similarities similarities <- diag(pv$cosine_similarity_matrix_) barplot(similarities, names.arg = 1:length(similarities), main = "Principal Vector Cosine Similarities", xlab = "Principal Vector Index", ylab = "Cosine Similarity", col = ifelse(similarities > 0.3, "#2ecc71", "#e74c3c"), border = "white") abline(h = 0.3, lty = 2, col = "blue", lwd = 2) legend("topright", legend = c("Selected (>0.3)", "Rejected (≤0.3)", "Threshold"), fill = c("#2ecc71", "#e74c3c", NA), border = c("white", "white", NA), lty = c(NA, NA, 2), col = c(NA, NA, "blue")) ``` ### Step 5: Projection Both datasets are projected onto the aligned subspace defined by the selected principal vectors: $$\mathbf{Z}_s = \mathbf{X}_s \mathbf{S}$$ $$\mathbf{Z}_t = \mathbf{X}_t \mathbf{S}$$ where $\mathbf{S}$ contains the source principal vector loadings. ### Step 6: Weighted k-NN Imputation For each spatial cell $i$, find $k$ nearest scRNA-seq cells using **cosine distance**: $$d(a, b) = 1 - \frac{a \cdot b}{\|a\| \|b\|}$$ Expression is predicted as a weighted average: $$\hat{y}_i = \sum_{j \in \mathcal{N}_k(i)} w_{ij} \cdot y_j$$ where weights are computed as: $$w_{ij} = \frac{1 - d_{ij}/\sum_k d_{ik}}{|\mathcal{N}_k| - 1}$$ ```{r knn_demo, fig.width=6, fig.height=5} set.seed(42) # Demonstrate KNN with cosine distance query <- matrix(c(1, 0), nrow = 1) ref <- matrix(c( 0.9, 0.1, # Close to query 0.7, 0.3, # Medium distance 0.5, 0.5, # Farther 0.1, 0.9 # Far ), nrow = 4, byrow = TRUE) # Compute KNN knn_result <- cosine_knn(query, ref, k = 3) # Visualize plot(ref[,1], ref[,2], pch = 19, cex = 2, xlim = c(-0.2, 1.2), ylim = c(-0.2, 1.2), xlab = "Dimension 1", ylab = "Dimension 2", main = "Cosine KNN Illustration", col = c("#2ecc71", "#f39c12", "#e74c3c", "#95a5a6")) points(query[1,1], query[1,2], pch = 18, cex = 3, col = "#3498db") # Draw connections to neighbors for (i in 1:3) { idx <- knn_result$indices[1, i] lines(c(query[1,1], ref[idx,1]), c(query[1,2], ref[idx,2]), lty = 2, col = "#3498db", lwd = 1.5) } legend("topright", legend = c("Query", "NN 1", "NN 2", "NN 3", "Not selected"), pch = c(18, 19, 19, 19, 19), col = c("#3498db", "#2ecc71", "#f39c12", "#e74c3c", "#95a5a6"), pt.cex = c(2, 1.5, 1.5, 1.5, 1.5)) cat("Cosine distances to query:\n") for (i in 1:4) { d <- 1 - sum(query * ref[i,]) / (sqrt(sum(query^2)) * sqrt(sum(ref[i,]^2))) cat(sprintf(" Reference %d: %.4f\n", i, d)) } ``` ## Comparison with Python Implementation SpaGER produces numerically equivalent results to the original Python SpaGE: ```{r comparison} set.seed(42) # Z-score comparison X <- matrix(rnorm(30), nrow = 10, ncol = 3) X_r <- zscore(X) # Manual calculation (matching scipy.stats.zscore) X_manual <- scale(X, center = TRUE, scale = TRUE) * sqrt(nrow(X) / (nrow(X) - 1)) cat("Maximum difference from Python-equivalent calculation:", max(abs(X_r - X_manual)), "\n") ``` ## Summary The SpaGE algorithm workflow: ``` ┌─────────────────┐ ┌─────────────────┐ │ Spatial Data │ │ scRNA-seq Data │ │ (limited genes)│ │ (full genome) │ └────────┬────────┘ └────────┬────────┘ │ │ ▼ ▼ ┌────────────────────────────────┐ │ Z-score Standardization │ └────────────────────────────────┘ │ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ PCA │ │ PCA │ └────┬─────┘ └────┬─────┘ │ │ └───────────┬───────────┘ ▼ ┌────────────────────────────────┐ │ Principal Vectors (SVD) │ │ Select PVs with sim > 0.3 │ └────────────────────────────────┘ │ ▼ ┌────────────────────────────────┐ │ Project onto aligned space │ └────────────────────────────────┘ │ ▼ ┌────────────────────────────────┐ │ Weighted k-NN Imputation │ │ (cosine distance) │ └────────────────────────────────┘ │ ▼ ┌────────────────────────────────┐ │ Predicted Gene Expression │ └────────────────────────────────┘ ``` ## References 1. Abdelaal T, et al. (2020). SpaGE: Spatial Gene Enhancement using scRNA-seq. *Nucleic Acids Research*, 48(18):e107. 2. Mourragui S, et al. (2019). PRECISE: A domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors. *Bioinformatics*, 35(14):i510-i519. ## Session Information ```{r session} sessionInfo() ```