---
title: "Mathematical Framework and Algorithm"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Mathematical Framework and Algorithm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  fig.align = "center"
)
```

## Introduction

**CellProgramMapper** implements a projection-based approach for mapping single-cell transcriptomic data onto reference gene expression programs (GEPs). This vignette provides a detailed explanation of the mathematical framework underlying the algorithm.

## Problem Formulation

### Non-Negative Matrix Factorization (NMF)

Gene expression programs are typically discovered using Non-Negative Matrix Factorization (NMF). Given an expression matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n cells x p genes), NMF seeks to find:

$$\mathbf{X} \approx \mathbf{W} \mathbf{H}$$

where:

- $\mathbf{W} \in \mathbb{R}^{n \times k}$: Usage matrix (n cells x k programs)
- $\mathbf{H} \in \mathbb{R}^{k \times p}$: Spectra matrix (k programs x p genes)

The non-negativity constraints $\mathbf{W} \geq 0$ and $\mathbf{H} \geq 0$ ensure biological interpretability.

### Projection Problem

CellProgramMapper addresses a related but distinct problem: given a **fixed** reference spectra matrix $\mathbf{H}$, estimate the usage matrix $\mathbf{W}$ for new query data $\mathbf{X}$.

This is formulated as:

$$\min_{\mathbf{W} \geq 0} \|\mathbf{X} - \mathbf{W}\mathbf{H}\|_F^2$$

where $\|\cdot\|_F$ denotes the Frobenius norm.

## Cell-wise Decomposition

The objective function decomposes over cells:

$$\|\mathbf{X} - \mathbf{W}\mathbf{H}\|_F^2 = \sum_{i=1}^{n} \|\mathbf{x}_i - \mathbf{H}^\top \mathbf{w}_i\|_2^2$$

where $\mathbf{x}_i \in \mathbb{R}^{p}$ is the expression vector for cell $i$, and $\mathbf{w}_i \in \mathbb{R}^{k}$ is its usage vector.

This allows us to solve $n$ independent **Non-Negative Least Squares (NNLS)** problems:

$$\min_{\mathbf{w}_i \geq 0} \|\mathbf{x}_i - \mathbf{H}^\top \mathbf{w}_i\|_2^2$$

## NNLS Problem

### Standard Form

The general NNLS problem is:

$$\min_{\mathbf{x} \geq 0} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2$$

In our context:

- $\mathbf{A} = \mathbf{H}^\top$ (p genes x k programs)
- $\mathbf{x} = \mathbf{w}_i$ (k programs x 1)
- $\mathbf{b} = \mathbf{x}_i$ (p genes x 1)

### KKT Conditions

The optimal solution satisfies the Karush-Kuhn-Tucker (KKT) conditions:

1. **Primal feasibility**: $\mathbf{x}^* \geq 0$
2. **Stationarity**: $\nabla f(\mathbf{x}^*) = \mathbf{A}^\top(\mathbf{A}\mathbf{x}^* - \mathbf{b}) \geq 0$ for components where $x_j^* = 0$
3. **Complementary slackness**: $x_j^* \cdot [\mathbf{A}^\top(\mathbf{A}\mathbf{x}^* - \mathbf{b})]_j = 0$

## Data Preprocessing

### Scaling

Before NNLS fitting, the query data is scaled by dividing each gene by its **population standard deviation** (without centering):

$$x'_{ij} = \frac{x_{ij}}{\sigma_j}$$

where:

$$\sigma_j = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}$$

This matches the preprocessing used in Python's `sklearn.preprocessing.scale(X, with_mean=False)` and ensures consistency with the Python implementation.

### Why Not Center?

Centering (subtracting the mean) is avoided because:

1. Single-cell count data is inherently non-negative
2. Centering would violate the non-negativity assumption
3. The reference spectra are trained on non-centered data

## Usage Normalization

After obtaining the raw usage matrix $\mathbf{W}$, rows are normalized to sum to 1:

$$\tilde{w}_{ij} = \frac{w_{ij}}{\sum_{l=1}^{k} w_{il}}$$

This normalized usage represents the **relative contribution** of each program to a cell's expression profile.

## Computational Complexity

For a single cell with $p$ genes and $k$ programs:

- **Active Set Method**: $O(k^3)$ per iteration, $O(I \cdot k^3)$ total
- **Coordinate Descent**: $O(k^2)$ per iteration, $O(I \cdot k^2)$ total

where $I$ is the number of iterations.

For $n$ cells, the total complexity is $O(n \cdot I \cdot k^2)$ or $O(n \cdot I \cdot k^3)$.

## Visualization of the Algorithm

```{r algorithm-diagram, echo=FALSE, fig.cap="Schematic of the CellProgramMapper algorithm", eval=TRUE}
# Create a simple illustration
par(mar = c(1, 1, 2, 1))
plot(NULL, xlim = c(0, 10), ylim = c(0, 7), 
     xlab = "", ylab = "", axes = FALSE,
     main = "CellProgramMapper Algorithm Flow")

# Input data box
rect(0.5, 5, 2.5, 6.5, col = "#e3f2fd", border = "#1976d2", lwd = 2)
text(1.5, 5.75, "Query Data\n(cells x genes)", cex = 0.8)

# Reference box
rect(0.5, 2.5, 2.5, 4, col = "#fff3e0", border = "#f57c00", lwd = 2)
text(1.5, 3.25, "Reference H\n(programs x genes)", cex = 0.8)

# Preprocessing
rect(3.5, 5, 5.5, 6.5, col = "#f3e5f5", border = "#7b1fa2", lwd = 2)
text(4.5, 5.75, "Scale\n(/ std dev)", cex = 0.8)

# NNLS
rect(3.5, 2.5, 5.5, 4, col = "#e8f5e9", border = "#388e3c", lwd = 2)
text(4.5, 3.25, "NNLS\nSolver", cex = 0.8)

# Output
rect(6.5, 3.5, 9, 5, col = "#ffebee", border = "#d32f2f", lwd = 2)
text(7.75, 4.25, "Usage Matrix W\n(cells x programs)", cex = 0.8)

# Arrows
arrows(2.5, 5.75, 3.5, 5.75, lwd = 2, col = "#666")
arrows(4.5, 5, 4.5, 4, lwd = 2, col = "#666")
arrows(2.5, 3.25, 3.5, 3.25, lwd = 2, col = "#666")
arrows(5.5, 3.5, 6.5, 4, lwd = 2, col = "#666")
```

## Example: Geometric Interpretation

The NNLS solution can be interpreted geometrically. Each cell's expression profile is projected onto the **non-negative cone** spanned by the reference program spectra.

```{r geometric, echo=FALSE, eval=TRUE, fig.cap="Geometric interpretation of NNLS projection"}
set.seed(42)

# Create 2D example for visualization
par(mar = c(4, 4, 2, 1))
plot(NULL, xlim = c(0, 3), ylim = c(0, 3),
     xlab = "Gene 1", ylab = "Gene 2",
     main = "NNLS as Projection onto Non-negative Cone")

# Reference vectors (columns of H^T)
arrows(0, 0, 2, 0.5, col = "#1976d2", lwd = 3, length = 0.15)
arrows(0, 0, 0.8, 2.2, col = "#388e3c", lwd = 3, length = 0.15)

text(2.1, 0.3, "H1'", col = "#1976d2", cex = 1.2)
text(0.6, 2.4, "H2'", col = "#388e3c", cex = 1.2)

# Shade the cone
polygon(c(0, 4, 1.6), c(0, 1, 4.4), 
        col = rgb(0.5, 0.5, 0.5, 0.1), border = NA)

# Query point
points(1.8, 1.5, pch = 19, col = "#d32f2f", cex = 1.5)
text(1.95, 1.65, "x (query)", col = "#d32f2f", cex = 1)

# Projected point (w1*H1' + w2*H2')
proj_x <- 1.2
proj_y <- 1.1
points(proj_x, proj_y, pch = 17, col = "#7b1fa2", cex = 1.5)
text(proj_x + 0.15, proj_y - 0.15, "Hw (projection)", col = "#7b1fa2", cex = 1)

# Dashed line from query to projection
segments(1.8, 1.5, proj_x, proj_y, lty = 2, col = "#666", lwd = 1.5)

legend("topright", 
       legend = c("Reference spectra", "Query point", "NNLS projection"),
       col = c("#1976d2", "#d32f2f", "#7b1fa2"),
       pch = c(NA, 19, 17),
       lwd = c(3, NA, NA),
       bty = "n")
```

## Summary

| Component | Description |
|-----------|-------------|
| **Input** | Query matrix X (cells x genes), Reference H (programs x genes) |
| **Output** | Usage matrix W (cells x programs) |
| **Preprocessing** | Scale by population standard deviation |
| **Solver** | NNLS (Coordinate Descent or Active Set) |
| **Post-processing** | Row normalization (optional) |

## References

1. Lee DD, Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. *Nature* 401:788-791.

2. Lawson CL, Hanson RJ (1974). *Solving Least Squares Problems*. Prentice-Hall.

3. Franc V, Hlavac V, Navara M (2005). Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. *CAIP 2005*.

## Session Info

```{r session}
sessionInfo()
```