--- title: "Algorithm Principles" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Algorithm Principles} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, fig.align = "center", message = FALSE, warning = FALSE ) ``` ## Overview CytoSPACER implements the CytoSPACE algorithm (Vahid et al., *Nature Biotechnology*, 2023), which formulates the problem of mapping single cells to spatial locations as a **Linear Assignment Problem (LAP)**. This vignette provides a detailed explanation of the underlying mathematical principles. ## The Spatial Mapping Problem ### Problem Statement Given: - **scRNA-seq data**: A reference single-cell dataset with $n$ cells and known cell type annotations - **Spatial transcriptomics data**: $m$ spatial spots with gene expression profiles - **Estimated cell counts**: The number of cells expected at each spot **Goal**: Find the optimal assignment of single cells to spatial spots that minimizes the total expression dissimilarity. ### Mathematical Formulation Let: - $C \in \mathbb{R}^{n \times m}$ be the cost matrix where $C_{ij}$ represents the dissimilarity between cell $i$ and spot $j$ - $X \in \{0, 1\}^{n \times m}$ be the assignment matrix where $X_{ij} = 1$ if cell $i$ is assigned to spot $j$ The optimization problem is: $$\min_{X} \sum_{i=1}^{n} \sum_{j=1}^{m} C_{ij} X_{ij}$$ Subject to: $$\sum_{j=1}^{m} X_{ij} = 1 \quad \forall i \quad \text{(each cell assigned to exactly one spot)}$$ $$\sum_{i=1}^{n} X_{ij} = k_j \quad \forall j \quad \text{(each spot receives } k_j \text{ cells)}$$ ## Algorithm Pipeline CytoSPACER proceeds through five main steps: ```{r load-library} library(CytoSPACER) ``` ### Step 1: Cell Type Fraction Estimation First, we estimate the cell type composition at each spatial spot using reference-based deconvolution: ```{r step1-demo, eval=FALSE} # Estimate cell type fractions using Seurat's TransferData fractions <- estimate_fractions( sc_data = sc_expression, cell_types = cell_labels, st_data = st_expression ) ``` This step uses anchor-based integration to transfer cell type labels from the scRNA-seq reference to the spatial data. ### Step 2: Cell Count Estimation The number of cells per spot is estimated based on total RNA content: ```{r step2-demo} # Simulate ST data for demonstration set.seed(42) st_demo <- matrix(rpois(100 * 20, lambda = 100), nrow = 100, ncol = 20) rownames(st_demo) <- paste0("Gene", 1:100) colnames(st_demo) <- paste0("Spot", 1:20) # Estimate cells per spot cells_per_spot <- estimate_cells_per_spot( st_data = st_demo, mean_cells = 5, min_cells = 1 ) # View distribution summary(cells_per_spot) ``` The estimation assumes a linear relationship between total normalized expression and cell count, anchored at the minimum and mean values. ### Step 3: Reference Cell Sampling Single cells are sampled from the scRNA-seq reference to match the estimated spatial composition: ```{r step3-demo} # Create demo scRNA-seq data sc_demo <- matrix(rpois(100 * 200, lambda = 10), nrow = 100, ncol = 200) rownames(sc_demo) <- paste0("Gene", 1:100) colnames(sc_demo) <- paste0("Cell", 1:200) cell_types_demo <- rep(c("TypeA", "TypeB", "TypeC", "TypeD"), each = 50) names(cell_types_demo) <- colnames(sc_demo) # Define target counts target_counts <- c(TypeA = 30, TypeB = 40, TypeC = 20, TypeD = 10) # Sample cells sampled <- sample_cells( sc_data = sc_demo, cell_types = cell_types_demo, target_counts = target_counts, method = "duplicates", seed = 42 ) cat("Sampled", ncol(sampled$expression), "cells\n") table(sampled$cell_types) ``` Two sampling methods are available: - **duplicates**: Allows reusing cells when the reference is insufficient - **synthetic**: Generates synthetic cells by sampling gene expression values ### Step 4: Cost Matrix Construction The cost matrix quantifies dissimilarity between each cell-spot pair: ```{r step4-demo} # Normalize data sc_norm <- normalize_expression(sc_demo[, 1:50]) st_norm <- normalize_expression(st_demo) # Compute cost matrix using Pearson correlation cost <- compute_cost_matrix( sc_data = sc_norm, st_data = st_norm, method = "pearson" ) cat("Cost matrix dimensions:", nrow(cost), "x", ncol(cost), "\n") cat("Cost range:", round(min(cost), 3), "to", round(max(cost), 3), "\n") ``` For correlation-based methods, the cost is computed as: $$C_{ij} = 1 - \rho(x_i, y_j)$$ where $\rho$ is the Pearson or Spearman correlation coefficient. ### Step 5: Linear Assignment Problem Solving The core optimization uses the **Jonker-Volgenant algorithm**, an efficient $O(n^3)$ method for solving the LAP: ```{r step5-demo} # Create a simple cost matrix cost_simple <- matrix(c( 1, 2, 3, 2, 4, 6, 3, 6, 9 ), nrow = 3, byrow = TRUE) # Solve LAP assignment <- solve_lap(cost_simple) cat("Optimal assignment:", assignment, "\n") # Compute total cost total_cost <- compute_assignment_cost(cost_simple, assignment) cat("Total cost:", total_cost, "\n") ``` ## The Jonker-Volgenant Algorithm The Jonker-Volgenant algorithm (1987) is a shortest augmenting path algorithm that efficiently solves the LAP. Key features: 1. **Column Reduction**: Initialize dual variables by column minima 2. **Reduction Transfer**: Update dual variables for assigned rows 3. **Augmenting Row Reduction**: Find initial feasible assignments 4. **Augmentation**: Use Dijkstra-like shortest path to find augmenting paths ### Implementation Details CytoSPACER provides a high-performance C++ implementation: ```{r cpp-implementation} # The C++ implementation handles both square and rectangular matrices # Square matrix cost_square <- matrix(runif(100), nrow = 10, ncol = 10) assignment_square <- solve_lap(cost_square) # Rectangular matrix (rows <= cols) cost_rect <- matrix(runif(50), nrow = 5, ncol = 10) assignment_rect <- solve_lap(cost_rect) cat("Square assignment:", assignment_square, "\n") cat("Rectangular assignment:", assignment_rect, "\n") ``` ## Distance Metrics CytoSPACER supports three distance metrics: ### Pearson Correlation $$\rho_{Pearson}(x, y) = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \sqrt{\sum_i (y_i - \bar{y})^2}}$$ ```{r pearson-demo} cost_pearson <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "pearson") cat("Pearson cost range:", round(range(cost_pearson), 3), "\n") ``` ### Spearman Correlation Uses rank-transformed data: $$\rho_{Spearman}(x, y) = \rho_{Pearson}(rank(x), rank(y))$$ ```{r spearman-demo} cost_spearman <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "spearman") cat("Spearman cost range:", round(range(cost_spearman), 3), "\n") ``` ### Euclidean Distance $$d_{Euclidean}(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$$ ```{r euclidean-demo} cost_euclidean <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "euclidean") cat("Euclidean cost range:", round(range(cost_euclidean), 3), "\n") ``` ## Normalization Expression data is normalized using TPM + log2 transformation: $$x'_{ij} = \log_2\left(\frac{x_{ij}}{\sum_i x_{ij}} \times 10^6 + 1\right)$$ ```{r normalization-demo} # Raw counts raw_counts <- matrix(rpois(100, lambda = 50), nrow = 10, ncol = 10) # Normalize normalized <- normalize_expression(raw_counts) cat("Raw column sums:", head(colSums(raw_counts)), "\n") cat("Normalized column sums:", round(head(colSums(normalized)), 2), "\n") ``` ## Handling Large Datasets For large datasets, CytoSPACER uses chunked processing: 1. **Partitioning**: Divide cells into chunks based on `chunk_size` 2. **Parallel Solving**: Solve subproblems independently 3. **Aggregation**: Combine results from all chunks ```{r parallel-demo, eval=FALSE} # Parallel processing with multiple workers results <- run_cytospace( sc_data = sc_data, cell_types = cell_types, st_data = st_data, coordinates = coordinates, chunk_size = 5000, n_workers = 4 ) ``` ## Tie-Breaking To ensure deterministic results when multiple optimal solutions exist, small random noise is added to the cost matrix: ```{r noise-demo} cost_with_noise <- add_noise_to_cost(cost_simple, scale = 1e-10, seed = 42) cat("Original cost:\n") print(cost_simple) cat("\nWith noise (difference):\n") print(round(cost_with_noise - cost_simple, 12)) ``` ## References 1. Vahid MR, et al. (2023). High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE. *Nature Biotechnology* 41, 1543–1548. 2. Jonker R, Volgenant A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. *Computing* 38(4):325-340. 3. Kuhn HW. (1955). The Hungarian method for the assignment problem. *Naval Research Logistics Quarterly* 2(1-2):83-97. ## Session Info ```{r session-info} sessionInfo() ```