---
title: "SpaGER: Algorithm and Mathematical Foundation"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{SpaGER: Algorithm and Mathematical Foundation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  message = FALSE,
  warning = FALSE
)
```

## Overview

SpaGER implements the **SpaGE** (Spatial Gene Enhancement) algorithm, which uses domain adaptation to predict unmeasured gene expression in spatial transcriptomics data by leveraging scRNA-seq reference data.

## The Challenge

Spatial transcriptomics technologies face a fundamental trade-off:

| Technology | Genes Measured | Spatial Resolution |
|------------|----------------|-------------------|
| MERFISH | ~100-1000 | Subcellular |
| seqFISH | ~100-10000 | Subcellular |
| Visium | ~20000 | ~55μm spots |
| Slide-seq | ~20000 | ~10μm beads |

High-resolution methods (MERFISH, seqFISH) measure fewer genes, while whole-transcriptome methods (Visium) have lower spatial resolution. **SpaGER bridges this gap** by using scRNA-seq to predict unmeasured genes.

## Algorithm Steps

### Step 1: Data Standardization

Both datasets are z-score normalized using **population standard deviation**:

$$z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}$$

where:
- $x_{ij}$ is the expression of gene $j$ in cell $i$
- $\mu_j = \frac{1}{n}\sum_i x_{ij}$ is the mean expression
- $\sigma_j = \sqrt{\frac{1}{n}\sum_i (x_{ij} - \mu_j)^2}$ is the population standard deviation

```{r zscore_demo}
library(SpaGER)

# Demonstrate z-score standardization
set.seed(42)
X <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3)
X_std <- zscore(X)

cat("Original data:\n")
print(X)
cat("\nStandardized data:\n")
print(round(X_std, 4))
cat("\nColumn means (should be ~0):", round(colMeans(X_std), 10), "\n")
```

### Step 2: Dimensionality Reduction

Both datasets are independently reduced to a lower-dimensional space using PCA (or other methods):

$$\mathbf{X} \approx \mathbf{T}\mathbf{P}^T$$

where $\mathbf{P}$ contains the principal component loadings.

```{r pca_demo, fig.width=6, fig.height=4}
set.seed(42)
# Create sample data with structure
n <- 100
X_source <- cbind(
  rnorm(n, 0, 3),      # High variance
  rnorm(n, 0, 1),      # Medium variance  
  rnorm(n, 0, 0.3)     # Low variance
)

# Perform PCA
reducer <- process_dim_reduction("pca", n_dim = 2)
reducer <- reducer$fit(X_source)

cat("PCA loadings (components):\n")
print(round(reducer$components_, 4))
cat("\nFirst PC captures high-variance direction\n")
```

### Step 3: Matrix Orthogonalization

The factor matrices are orthonormalized using SVD:

$$\mathbf{P} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \rightarrow \text{orth}(\mathbf{P}) = \mathbf{U}$$

This ensures the basis vectors are orthogonal and unit-length.

### Step 4: Principal Vectors Computation

The key innovation of SpaGE is finding **Principal Vectors (PVs)** that represent shared variation between domains.

Given orthonormalized factor matrices $\mathbf{P}_s$ (source/scRNA-seq) and $\mathbf{P}_t$ (target/spatial), we compute:

$$\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \mathbf{P}_s \mathbf{P}_t^T$$

The singular values $\sigma_i$ represent the **cosine similarity** between the $i$-th pair of principal vectors. Higher values indicate better alignment.

```{r pv_demo, fig.width=6, fig.height=4}
set.seed(42)

# Create two datasets with shared structure
n_samples <- 100
n_features <- 50

# Shared signal
shared <- matrix(rnorm(n_samples * 10), nrow = n_samples)

# Source and target with shared structure
source_data <- cbind(
  shared %*% matrix(rnorm(10 * n_features), nrow = 10),
  matrix(rnorm(n_samples * n_features, sd = 0.5), nrow = n_samples)
)[, 1:n_features]

target_data <- cbind(
  shared %*% matrix(rnorm(10 * n_features), nrow = 10) + matrix(rnorm(n_samples * n_features, sd = 0.3), nrow = n_samples),
  matrix(rnorm(n_samples * n_features, sd = 0.5), nrow = n_samples)
)[, 1:n_features]

# Compute principal vectors
pv <- PVComputation(n_factors = 20, n_pv = 20, dim_reduction = "pca")
pv <- pv$fit(source_data, target_data)

# Plot cosine similarities
similarities <- diag(pv$cosine_similarity_matrix_)
barplot(similarities, 
        names.arg = 1:length(similarities),
        main = "Principal Vector Cosine Similarities",
        xlab = "Principal Vector Index",
        ylab = "Cosine Similarity",
        col = ifelse(similarities > 0.3, "#2ecc71", "#e74c3c"),
        border = "white")
abline(h = 0.3, lty = 2, col = "blue", lwd = 2)
legend("topright", 
       legend = c("Selected (>0.3)", "Rejected (≤0.3)", "Threshold"),
       fill = c("#2ecc71", "#e74c3c", NA),
       border = c("white", "white", NA),
       lty = c(NA, NA, 2),
       col = c(NA, NA, "blue"))
```

### Step 5: Projection

Both datasets are projected onto the aligned subspace defined by the selected principal vectors:

$$\mathbf{Z}_s = \mathbf{X}_s \mathbf{S}$$
$$\mathbf{Z}_t = \mathbf{X}_t \mathbf{S}$$

where $\mathbf{S}$ contains the source principal vector loadings.

### Step 6: Weighted k-NN Imputation

For each spatial cell $i$, find $k$ nearest scRNA-seq cells using **cosine distance**:

$$d(a, b) = 1 - \frac{a \cdot b}{\|a\| \|b\|}$$

Expression is predicted as a weighted average:

$$\hat{y}_i = \sum_{j \in \mathcal{N}_k(i)} w_{ij} \cdot y_j$$

where weights are computed as:

$$w_{ij} = \frac{1 - d_{ij}/\sum_k d_{ik}}{|\mathcal{N}_k| - 1}$$

```{r knn_demo, fig.width=6, fig.height=5}
set.seed(42)

# Demonstrate KNN with cosine distance
query <- matrix(c(1, 0), nrow = 1)
ref <- matrix(c(
  0.9, 0.1,   # Close to query
  0.7, 0.3,   # Medium distance
  0.5, 0.5,   # Farther
  0.1, 0.9    # Far
), nrow = 4, byrow = TRUE)

# Compute KNN
knn_result <- cosine_knn(query, ref, k = 3)

# Visualize
plot(ref[,1], ref[,2], pch = 19, cex = 2,
     xlim = c(-0.2, 1.2), ylim = c(-0.2, 1.2),
     xlab = "Dimension 1", ylab = "Dimension 2",
     main = "Cosine KNN Illustration",
     col = c("#2ecc71", "#f39c12", "#e74c3c", "#95a5a6"))
points(query[1,1], query[1,2], pch = 18, cex = 3, col = "#3498db")

# Draw connections to neighbors
for (i in 1:3) {
  idx <- knn_result$indices[1, i]
  lines(c(query[1,1], ref[idx,1]), c(query[1,2], ref[idx,2]), 
        lty = 2, col = "#3498db", lwd = 1.5)
}

legend("topright", 
       legend = c("Query", "NN 1", "NN 2", "NN 3", "Not selected"),
       pch = c(18, 19, 19, 19, 19),
       col = c("#3498db", "#2ecc71", "#f39c12", "#e74c3c", "#95a5a6"),
       pt.cex = c(2, 1.5, 1.5, 1.5, 1.5))

cat("Cosine distances to query:\n")
for (i in 1:4) {
  d <- 1 - sum(query * ref[i,]) / (sqrt(sum(query^2)) * sqrt(sum(ref[i,]^2)))
  cat(sprintf("  Reference %d: %.4f\n", i, d))
}
```

## Comparison with Python Implementation

SpaGER produces numerically equivalent results to the original Python SpaGE:

```{r comparison}
set.seed(42)

# Z-score comparison
X <- matrix(rnorm(30), nrow = 10, ncol = 3)
X_r <- zscore(X)

# Manual calculation (matching scipy.stats.zscore)
X_manual <- scale(X, center = TRUE, scale = TRUE) * sqrt(nrow(X) / (nrow(X) - 1))

cat("Maximum difference from Python-equivalent calculation:", 
    max(abs(X_r - X_manual)), "\n")
```

## Summary

The SpaGE algorithm workflow:

```
┌─────────────────┐     ┌─────────────────┐
│  Spatial Data   │     │  scRNA-seq Data │
│  (limited genes)│     │ (full genome)   │
└────────┬────────┘     └────────┬────────┘
         │                       │
         ▼                       ▼
    ┌────────────────────────────────┐
    │     Z-score Standardization     │
    └────────────────────────────────┘
         │                       │
         ▼                       ▼
    ┌──────────┐           ┌──────────┐
    │   PCA    │           │   PCA    │
    └────┬─────┘           └────┬─────┘
         │                       │
         └───────────┬───────────┘
                     ▼
    ┌────────────────────────────────┐
    │    Principal Vectors (SVD)     │
    │    Select PVs with sim > 0.3   │
    └────────────────────────────────┘
                     │
                     ▼
    ┌────────────────────────────────┐
    │    Project onto aligned space   │
    └────────────────────────────────┘
                     │
                     ▼
    ┌────────────────────────────────┐
    │   Weighted k-NN Imputation      │
    │   (cosine distance)             │
    └────────────────────────────────┘
                     │
                     ▼
    ┌────────────────────────────────┐
    │    Predicted Gene Expression    │
    └────────────────────────────────┘
```

## References

1. Abdelaal T, et al. (2020). SpaGE: Spatial Gene Enhancement using scRNA-seq. *Nucleic Acids Research*, 48(18):e107.

2. Mourragui S, et al. (2019). PRECISE: A domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors. *Bioinformatics*, 35(14):i510-i519.

## Session Information

```{r session}
sessionInfo()
```