--- title: "Mathematical Framework and Algorithm" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Mathematical Framework and Algorithm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, fig.align = "center" ) ``` ## Introduction **CellProgramMapper** implements a projection-based approach for mapping single-cell transcriptomic data onto reference gene expression programs (GEPs). This vignette provides a detailed explanation of the mathematical framework underlying the algorithm. ## Problem Formulation ### Non-Negative Matrix Factorization (NMF) Gene expression programs are typically discovered using Non-Negative Matrix Factorization (NMF). Given an expression matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n cells x p genes), NMF seeks to find: $$\mathbf{X} \approx \mathbf{W} \mathbf{H}$$ where: - $\mathbf{W} \in \mathbb{R}^{n \times k}$: Usage matrix (n cells x k programs) - $\mathbf{H} \in \mathbb{R}^{k \times p}$: Spectra matrix (k programs x p genes) The non-negativity constraints $\mathbf{W} \geq 0$ and $\mathbf{H} \geq 0$ ensure biological interpretability. ### Projection Problem CellProgramMapper addresses a related but distinct problem: given a **fixed** reference spectra matrix $\mathbf{H}$, estimate the usage matrix $\mathbf{W}$ for new query data $\mathbf{X}$. This is formulated as: $$\min_{\mathbf{W} \geq 0} \|\mathbf{X} - \mathbf{W}\mathbf{H}\|_F^2$$ where $\|\cdot\|_F$ denotes the Frobenius norm. ## Cell-wise Decomposition The objective function decomposes over cells: $$\|\mathbf{X} - \mathbf{W}\mathbf{H}\|_F^2 = \sum_{i=1}^{n} \|\mathbf{x}_i - \mathbf{H}^\top \mathbf{w}_i\|_2^2$$ where $\mathbf{x}_i \in \mathbb{R}^{p}$ is the expression vector for cell $i$, and $\mathbf{w}_i \in \mathbb{R}^{k}$ is its usage vector. This allows us to solve $n$ independent **Non-Negative Least Squares (NNLS)** problems: $$\min_{\mathbf{w}_i \geq 0} \|\mathbf{x}_i - \mathbf{H}^\top \mathbf{w}_i\|_2^2$$ ## NNLS Problem ### Standard Form The general NNLS problem is: $$\min_{\mathbf{x} \geq 0} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2$$ In our context: - $\mathbf{A} = \mathbf{H}^\top$ (p genes x k programs) - $\mathbf{x} = \mathbf{w}_i$ (k programs x 1) - $\mathbf{b} = \mathbf{x}_i$ (p genes x 1) ### KKT Conditions The optimal solution satisfies the Karush-Kuhn-Tucker (KKT) conditions: 1. **Primal feasibility**: $\mathbf{x}^* \geq 0$ 2. **Stationarity**: $\nabla f(\mathbf{x}^*) = \mathbf{A}^\top(\mathbf{A}\mathbf{x}^* - \mathbf{b}) \geq 0$ for components where $x_j^* = 0$ 3. **Complementary slackness**: $x_j^* \cdot [\mathbf{A}^\top(\mathbf{A}\mathbf{x}^* - \mathbf{b})]_j = 0$ ## Data Preprocessing ### Scaling Before NNLS fitting, the query data is scaled by dividing each gene by its **population standard deviation** (without centering): $$x'_{ij} = \frac{x_{ij}}{\sigma_j}$$ where: $$\sigma_j = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}$$ This matches the preprocessing used in Python's `sklearn.preprocessing.scale(X, with_mean=False)` and ensures consistency with the Python implementation. ### Why Not Center? Centering (subtracting the mean) is avoided because: 1. Single-cell count data is inherently non-negative 2. Centering would violate the non-negativity assumption 3. The reference spectra are trained on non-centered data ## Usage Normalization After obtaining the raw usage matrix $\mathbf{W}$, rows are normalized to sum to 1: $$\tilde{w}_{ij} = \frac{w_{ij}}{\sum_{l=1}^{k} w_{il}}$$ This normalized usage represents the **relative contribution** of each program to a cell's expression profile. ## Computational Complexity For a single cell with $p$ genes and $k$ programs: - **Active Set Method**: $O(k^3)$ per iteration, $O(I \cdot k^3)$ total - **Coordinate Descent**: $O(k^2)$ per iteration, $O(I \cdot k^2)$ total where $I$ is the number of iterations. For $n$ cells, the total complexity is $O(n \cdot I \cdot k^2)$ or $O(n \cdot I \cdot k^3)$. ## Visualization of the Algorithm ```{r algorithm-diagram, echo=FALSE, fig.cap="Schematic of the CellProgramMapper algorithm", eval=TRUE} # Create a simple illustration par(mar = c(1, 1, 2, 1)) plot(NULL, xlim = c(0, 10), ylim = c(0, 7), xlab = "", ylab = "", axes = FALSE, main = "CellProgramMapper Algorithm Flow") # Input data box rect(0.5, 5, 2.5, 6.5, col = "#e3f2fd", border = "#1976d2", lwd = 2) text(1.5, 5.75, "Query Data\n(cells x genes)", cex = 0.8) # Reference box rect(0.5, 2.5, 2.5, 4, col = "#fff3e0", border = "#f57c00", lwd = 2) text(1.5, 3.25, "Reference H\n(programs x genes)", cex = 0.8) # Preprocessing rect(3.5, 5, 5.5, 6.5, col = "#f3e5f5", border = "#7b1fa2", lwd = 2) text(4.5, 5.75, "Scale\n(/ std dev)", cex = 0.8) # NNLS rect(3.5, 2.5, 5.5, 4, col = "#e8f5e9", border = "#388e3c", lwd = 2) text(4.5, 3.25, "NNLS\nSolver", cex = 0.8) # Output rect(6.5, 3.5, 9, 5, col = "#ffebee", border = "#d32f2f", lwd = 2) text(7.75, 4.25, "Usage Matrix W\n(cells x programs)", cex = 0.8) # Arrows arrows(2.5, 5.75, 3.5, 5.75, lwd = 2, col = "#666") arrows(4.5, 5, 4.5, 4, lwd = 2, col = "#666") arrows(2.5, 3.25, 3.5, 3.25, lwd = 2, col = "#666") arrows(5.5, 3.5, 6.5, 4, lwd = 2, col = "#666") ``` ## Example: Geometric Interpretation The NNLS solution can be interpreted geometrically. Each cell's expression profile is projected onto the **non-negative cone** spanned by the reference program spectra. ```{r geometric, echo=FALSE, eval=TRUE, fig.cap="Geometric interpretation of NNLS projection"} set.seed(42) # Create 2D example for visualization par(mar = c(4, 4, 2, 1)) plot(NULL, xlim = c(0, 3), ylim = c(0, 3), xlab = "Gene 1", ylab = "Gene 2", main = "NNLS as Projection onto Non-negative Cone") # Reference vectors (columns of H^T) arrows(0, 0, 2, 0.5, col = "#1976d2", lwd = 3, length = 0.15) arrows(0, 0, 0.8, 2.2, col = "#388e3c", lwd = 3, length = 0.15) text(2.1, 0.3, "H1'", col = "#1976d2", cex = 1.2) text(0.6, 2.4, "H2'", col = "#388e3c", cex = 1.2) # Shade the cone polygon(c(0, 4, 1.6), c(0, 1, 4.4), col = rgb(0.5, 0.5, 0.5, 0.1), border = NA) # Query point points(1.8, 1.5, pch = 19, col = "#d32f2f", cex = 1.5) text(1.95, 1.65, "x (query)", col = "#d32f2f", cex = 1) # Projected point (w1*H1' + w2*H2') proj_x <- 1.2 proj_y <- 1.1 points(proj_x, proj_y, pch = 17, col = "#7b1fa2", cex = 1.5) text(proj_x + 0.15, proj_y - 0.15, "Hw (projection)", col = "#7b1fa2", cex = 1) # Dashed line from query to projection segments(1.8, 1.5, proj_x, proj_y, lty = 2, col = "#666", lwd = 1.5) legend("topright", legend = c("Reference spectra", "Query point", "NNLS projection"), col = c("#1976d2", "#d32f2f", "#7b1fa2"), pch = c(NA, 19, 17), lwd = c(3, NA, NA), bty = "n") ``` ## Summary | Component | Description | |-----------|-------------| | **Input** | Query matrix X (cells x genes), Reference H (programs x genes) | | **Output** | Usage matrix W (cells x programs) | | **Preprocessing** | Scale by population standard deviation | | **Solver** | NNLS (Coordinate Descent or Active Set) | | **Post-processing** | Row normalization (optional) | ## References 1. Lee DD, Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. *Nature* 401:788-791. 2. Lawson CL, Hanson RJ (1974). *Solving Least Squares Problems*. Prentice-Hall. 3. Franc V, Hlavac V, Navara M (2005). Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. *CAIP 2005*. ## Session Info ```{r session} sessionInfo() ```