--- title: "Algorithm Principles" subtitle: "Mathematical Foundations of CellOracleR" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_width: 8 fig_height: 6 vignette: > %\VignetteIndexEntry{Algorithm Principles} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", message = FALSE, warning = FALSE ) ``` ## Introduction CellOracleR implements a mathematical framework for predicting cell state transitions following transcription factor (TF) perturbations. This vignette describes the core algorithms underlying the package. ## 1. Gene Regulatory Network (GRN) Inference ### 1.1 Mathematical Formulation For each target gene $g$, we model its expression as a linear function of its regulators: $$ X_g = \sum_{r \in R_g} \beta_{r,g} \cdot X_r + \epsilon_g $$ where: - $X_g$ is the expression of target gene $g$ - $R_g$ is the set of regulators for gene $g$ - $\beta_{r,g}$ is the regulatory coefficient from regulator $r$ to target $g$ - $\epsilon_g$ is the residual error ### 1.2 Ridge Regression We use Ridge regression (L2 regularization) to estimate $\beta$ coefficients: $$ \hat{\beta} = \arg\min_\beta \left\{ \|X_g - X_R \beta\|_2^2 + \alpha \|\beta\|_2^2 \right\} $$ The closed-form solution is: $$ \hat{\beta} = (X_R^T X_R + \alpha I)^{-1} X_R^T X_g $$ **Why Ridge Regression?** - **Prevents overfitting**: Regularization shrinks coefficients toward zero - **Handles multicollinearity**: Correlated regulators don't cause instability - **Numerical stability**: Always has a unique solution ```{r ridge-demo, eval=FALSE} # Ridge regression in CellOracleR coef_matrix <- fit_grn_coef_matrix( gem = expression_matrix, TFdict = tf_target_dict, alpha = 10 # regularization strength ) ``` ### 1.3 Bootstrap Aggregation (Bagging) To improve robustness, we employ bagging: 1. Sample 80% of cells (without replacement) 2. Fit Ridge regression on the subsample 3. Repeat $B$ times (typically 200) 4. Use median of coefficients as final estimate $$ \hat{\beta}_{final} = \text{median}(\hat{\beta}_1, \hat{\beta}_2, \ldots, \hat{\beta}_B) $$ **Advantages:** - Reduces variance of coefficient estimates - More robust to outlier cells - Provides coefficient stability measures ## 2. Signal Propagation Simulation ### 2.1 Perturbation Model Given a perturbation condition (e.g., TF knockout), we simulate the downstream effects: $$ \Delta X^{(0)} = X_{perturbed} - X_{original} $$ ### 2.2 Iterative Propagation The signal propagates through the GRN iteratively: $$ \Delta X^{(t+1)} = \Delta X^{(t)} \cdot W $$ where $W$ is the coefficient matrix (genes × genes). **Key constraint**: Perturbed gene values are maintained at each step: $$ \Delta X^{(t+1)}_p = \Delta X^{(0)}_p \quad \text{for all perturbed genes } p $$ ### 2.3 Non-negativity Constraint Gene expression cannot be negative: $$ X_{simulated} = \max(0, X_{original} + \Delta X) $$ ```{r propagation-diagram, echo=FALSE, fig.cap="Signal propagation through GRN"} # Create a simple diagram showing signal propagation if (requireNamespace("ggplot2", quietly = TRUE)) { library(ggplot2) # Create nodes nodes <- data.frame( x = c(0, 1, 2, 1, 2), y = c(1, 1.5, 1.5, 0.5, 0.5), label = c("TF (KO)", "Gene A", "Gene B", "Gene C", "Gene D"), color = c("red", "orange", "orange", "yellow", "yellow") ) # Create edges edges <- data.frame( x = c(0, 0, 1, 1, 1.5, 1.5), xend = c(1, 1, 2, 2, 2, 2), y = c(1, 1, 1.5, 0.5, 1.5, 0.5), yend = c(1.5, 0.5, 1.5, 0.5, 1.5, 0.5) ) p <- ggplot() + geom_segment(data = edges, aes(x = x, y = y, xend = xend, yend = yend), arrow = arrow(length = unit(0.3, "cm")), color = "gray50", size = 1) + geom_point(data = nodes, aes(x = x, y = y, fill = color), shape = 21, size = 15) + geom_text(data = nodes, aes(x = x, y = y, label = label), size = 3, fontface = "bold") + scale_fill_identity() + theme_void() + labs(title = "Signal Propagation in GRN", subtitle = "Red: Perturbed TF → Orange: Direct targets → Yellow: Indirect targets") + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"), plot.subtitle = element_text(hjust = 0.5, size = 10)) print(p) } ``` ## 3. Transition Probability Estimation ### 3.1 Correlation-Based Approach We estimate the probability of cell $i$ transitioning to cell $j$ based on: $$ \rho_{ij} = \text{cor}(\Delta X_i, X_j - X_i) $$ This measures how well the predicted expression change aligns with the direction toward cell $j$. ### 3.2 Probability Conversion Correlations are converted to probabilities using an exponential kernel: $$ P_{ij} = \frac{\exp(\rho_{ij} / \sigma)}{\sum_{k \in N_i} \exp(\rho_{ik} / \sigma)} $$ where: - $N_i$ is the set of k-nearest neighbors of cell $i$ - $\sigma$ is a bandwidth parameter (default: 0.05) ### 3.3 Embedding Shift Calculation The expected movement in embedding space: $$ \Delta E_i = \sum_{j \in N_i} P_{ij} \cdot \hat{d}_{ij} $$ where $\hat{d}_{ij}$ is the unit vector from cell $i$ to cell $j$ in embedding space. ## 4. Markov Chain Simulation ### 4.1 State Transition Model Cell fate is modeled as a discrete-time Markov chain: $$ P(S_{t+1} = j | S_t = i) = P_{ij} $$ ### 4.2 Trajectory Simulation For each starting cell, we simulate $N$ steps: ``` Algorithm: Markov Walk Input: transition_prob P, start_cell s, n_steps T Output: trajectory [s_0, s_1, ..., s_T] s_0 ← s for t = 1 to T: sample s_t from Categorical(P[s_{t-1}, :]) return [s_0, s_1, ..., s_T] ``` ### 4.3 Fate Probability For absorbing Markov chains, we can compute the probability of reaching each terminal state: $$ F = (I - Q)^{-1} R $$ where $Q$ is the transient-to-transient transition matrix and $R$ is the transient-to-absorbing matrix. ## 5. Network Analysis Metrics ### 5.1 Centrality Measures **Degree centrality:** $$ C_D(v) = \text{deg}(v) = |N(v)| $$ **Betweenness centrality:** $$ C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}} $$ **Eigenvector centrality:** $$ C_E(v) = \frac{1}{\lambda} \sum_{u \in N(v)} C_E(u) $$ ### 5.2 Network Entropy Degree distribution entropy: $$ H = -\sum_{k} p(k) \log_2 p(k) $$ Higher entropy indicates more heterogeneous connectivity. ## Summary CellOracleR combines: 1. **Ridge regression** for robust GRN inference 2. **Signal propagation** for perturbation simulation 3. **Correlation-based transition probabilities** for fate prediction 4. **Markov chains** for trajectory modeling 5. **Graph theory** for network analysis These mathematical foundations enable accurate prediction of cellular responses to genetic perturbations. ## References 1. Kamimoto, K., et al. (2023). CellOracle: Dissecting cell identity via network inference and in silico gene perturbation. *Molecular Systems Biology*. 2. Hoerl, A.E. & Kennard, R.W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. *Technometrics*. 3. Breiman, L. (1996). Bagging Predictors. *Machine Learning*. ## Session Info ```{r session-info} sessionInfo() ```