---
title: "Algorithm Principles"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Algorithm Principles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  fig.align = "center",
  message = FALSE,
  warning = FALSE
)
```

## Overview

CytoSPACER implements the CytoSPACE algorithm (Vahid et al., *Nature Biotechnology*, 2023), which formulates the problem of mapping single cells to spatial locations as a **Linear Assignment Problem (LAP)**. This vignette provides a detailed explanation of the underlying mathematical principles.

## The Spatial Mapping Problem

### Problem Statement

Given:

- **scRNA-seq data**: A reference single-cell dataset with $n$ cells and known cell type annotations
- **Spatial transcriptomics data**: $m$ spatial spots with gene expression profiles
- **Estimated cell counts**: The number of cells expected at each spot

**Goal**: Find the optimal assignment of single cells to spatial spots that minimizes the total expression dissimilarity.

### Mathematical Formulation

Let:

- $C \in \mathbb{R}^{n \times m}$ be the cost matrix where $C_{ij}$ represents the dissimilarity between cell $i$ and spot $j$
- $X \in \{0, 1\}^{n \times m}$ be the assignment matrix where $X_{ij} = 1$ if cell $i$ is assigned to spot $j$

The optimization problem is:

$$\min_{X} \sum_{i=1}^{n} \sum_{j=1}^{m} C_{ij} X_{ij}$$

Subject to:

$$\sum_{j=1}^{m} X_{ij} = 1 \quad \forall i \quad \text{(each cell assigned to exactly one spot)}$$

$$\sum_{i=1}^{n} X_{ij} = k_j \quad \forall j \quad \text{(each spot receives } k_j \text{ cells)}$$

## Algorithm Pipeline

CytoSPACER proceeds through five main steps:

```{r load-library}
library(CytoSPACER)
```

### Step 1: Cell Type Fraction Estimation

First, we estimate the cell type composition at each spatial spot using reference-based deconvolution:

```{r step1-demo, eval=FALSE}
# Estimate cell type fractions using Seurat's TransferData
fractions <- estimate_fractions(
  sc_data = sc_expression,
  cell_types = cell_labels,
  st_data = st_expression
)
```

This step uses anchor-based integration to transfer cell type labels from the scRNA-seq reference to the spatial data.

### Step 2: Cell Count Estimation

The number of cells per spot is estimated based on total RNA content:

```{r step2-demo}
# Simulate ST data for demonstration
set.seed(42)
st_demo <- matrix(rpois(100 * 20, lambda = 100), nrow = 100, ncol = 20)
rownames(st_demo) <- paste0("Gene", 1:100)
colnames(st_demo) <- paste0("Spot", 1:20)

# Estimate cells per spot
cells_per_spot <- estimate_cells_per_spot(
  st_data = st_demo,
  mean_cells = 5,
  min_cells = 1
)

# View distribution
summary(cells_per_spot)
```

The estimation assumes a linear relationship between total normalized expression and cell count, anchored at the minimum and mean values.

### Step 3: Reference Cell Sampling

Single cells are sampled from the scRNA-seq reference to match the estimated spatial composition:

```{r step3-demo}
# Create demo scRNA-seq data
sc_demo <- matrix(rpois(100 * 200, lambda = 10), nrow = 100, ncol = 200)
rownames(sc_demo) <- paste0("Gene", 1:100)
colnames(sc_demo) <- paste0("Cell", 1:200)

cell_types_demo <- rep(c("TypeA", "TypeB", "TypeC", "TypeD"), each = 50)
names(cell_types_demo) <- colnames(sc_demo)

# Define target counts
target_counts <- c(TypeA = 30, TypeB = 40, TypeC = 20, TypeD = 10)

# Sample cells
sampled <- sample_cells(
  sc_data = sc_demo,
  cell_types = cell_types_demo,
  target_counts = target_counts,
  method = "duplicates",
  seed = 42
)

cat("Sampled", ncol(sampled$expression), "cells\n")
table(sampled$cell_types)
```

Two sampling methods are available:

- **duplicates**: Allows reusing cells when the reference is insufficient
- **synthetic**: Generates synthetic cells by sampling gene expression values

### Step 4: Cost Matrix Construction

The cost matrix quantifies dissimilarity between each cell-spot pair:

```{r step4-demo}
# Normalize data
sc_norm <- normalize_expression(sc_demo[, 1:50])
st_norm <- normalize_expression(st_demo)

# Compute cost matrix using Pearson correlation
cost <- compute_cost_matrix(
  sc_data = sc_norm,
  st_data = st_norm,
  method = "pearson"
)

cat("Cost matrix dimensions:", nrow(cost), "x", ncol(cost), "\n")
cat("Cost range:", round(min(cost), 3), "to", round(max(cost), 3), "\n")
```

For correlation-based methods, the cost is computed as:

$$C_{ij} = 1 - \rho(x_i, y_j)$$

where $\rho$ is the Pearson or Spearman correlation coefficient.

### Step 5: Linear Assignment Problem Solving

The core optimization uses the **Jonker-Volgenant algorithm**, an efficient $O(n^3)$ method for solving the LAP:

```{r step5-demo}
# Create a simple cost matrix
cost_simple <- matrix(c(
  1, 2, 3,
  2, 4, 6,
  3, 6, 9
), nrow = 3, byrow = TRUE)

# Solve LAP
assignment <- solve_lap(cost_simple)
cat("Optimal assignment:", assignment, "\n")

# Compute total cost
total_cost <- compute_assignment_cost(cost_simple, assignment)
cat("Total cost:", total_cost, "\n")
```

## The Jonker-Volgenant Algorithm

The Jonker-Volgenant algorithm (1987) is a shortest augmenting path algorithm that efficiently solves the LAP. Key features:

1. **Column Reduction**: Initialize dual variables by column minima
2. **Reduction Transfer**: Update dual variables for assigned rows
3. **Augmenting Row Reduction**: Find initial feasible assignments
4. **Augmentation**: Use Dijkstra-like shortest path to find augmenting paths

### Implementation Details

CytoSPACER provides a high-performance C++ implementation:

```{r cpp-implementation}
# The C++ implementation handles both square and rectangular matrices
# Square matrix
cost_square <- matrix(runif(100), nrow = 10, ncol = 10)
assignment_square <- solve_lap(cost_square)

# Rectangular matrix (rows <= cols)
cost_rect <- matrix(runif(50), nrow = 5, ncol = 10)
assignment_rect <- solve_lap(cost_rect)

cat("Square assignment:", assignment_square, "\n")
cat("Rectangular assignment:", assignment_rect, "\n")
```

## Distance Metrics

CytoSPACER supports three distance metrics:

### Pearson Correlation

$$\rho_{Pearson}(x, y) = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \sqrt{\sum_i (y_i - \bar{y})^2}}$$

```{r pearson-demo}
cost_pearson <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "pearson")
cat("Pearson cost range:", round(range(cost_pearson), 3), "\n")
```

### Spearman Correlation

Uses rank-transformed data:

$$\rho_{Spearman}(x, y) = \rho_{Pearson}(rank(x), rank(y))$$

```{r spearman-demo}
cost_spearman <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "spearman")
cat("Spearman cost range:", round(range(cost_spearman), 3), "\n")
```

### Euclidean Distance

$$d_{Euclidean}(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$$

```{r euclidean-demo}
cost_euclidean <- compute_cost_matrix(sc_norm[,1:10], st_norm[,1:5], method = "euclidean")
cat("Euclidean cost range:", round(range(cost_euclidean), 3), "\n")
```

## Normalization

Expression data is normalized using TPM + log2 transformation:

$$x'_{ij} = \log_2\left(\frac{x_{ij}}{\sum_i x_{ij}} \times 10^6 + 1\right)$$

```{r normalization-demo}
# Raw counts
raw_counts <- matrix(rpois(100, lambda = 50), nrow = 10, ncol = 10)

# Normalize
normalized <- normalize_expression(raw_counts)

cat("Raw column sums:", head(colSums(raw_counts)), "\n")
cat("Normalized column sums:", round(head(colSums(normalized)), 2), "\n")
```

## Handling Large Datasets

For large datasets, CytoSPACER uses chunked processing:

1. **Partitioning**: Divide cells into chunks based on `chunk_size`
2. **Parallel Solving**: Solve subproblems independently
3. **Aggregation**: Combine results from all chunks

```{r parallel-demo, eval=FALSE}
# Parallel processing with multiple workers
results <- run_cytospace(
  sc_data = sc_data,
  cell_types = cell_types,
  st_data = st_data,
  coordinates = coordinates,
  chunk_size = 5000,
  n_workers = 4
)
```

## Tie-Breaking

To ensure deterministic results when multiple optimal solutions exist, small random noise is added to the cost matrix:

```{r noise-demo}
cost_with_noise <- add_noise_to_cost(cost_simple, scale = 1e-10, seed = 42)
cat("Original cost:\n")
print(cost_simple)
cat("\nWith noise (difference):\n")
print(round(cost_with_noise - cost_simple, 12))
```

## References

1. Vahid MR, et al. (2023). High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE. *Nature Biotechnology* 41, 1543–1548.

2. Jonker R, Volgenant A. (1987). A shortest augmenting path algorithm for dense and sparse linear assignment problems. *Computing* 38(4):325-340.

3. Kuhn HW. (1955). The Hungarian method for the assignment problem. *Naval Research Logistics Quarterly* 2(1-2):83-97.

## Session Info

```{r session-info}
sessionInfo()
```