---
title: "Algorithm Principles"
subtitle: "Mathematical Foundations of CellOracleR"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 8
    fig_height: 6
vignette: >
  %\VignetteIndexEntry{Algorithm Principles}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  message = FALSE,
  warning = FALSE
)
```

## Introduction

CellOracleR implements a mathematical framework for predicting cell state transitions following transcription factor (TF) perturbations. This vignette describes the core algorithms underlying the package.

## 1. Gene Regulatory Network (GRN) Inference

### 1.1 Mathematical Formulation

For each target gene $g$, we model its expression as a linear function of its regulators:

$$
X_g = \sum_{r \in R_g} \beta_{r,g} \cdot X_r + \epsilon_g
$$

where:

- $X_g$ is the expression of target gene $g$
- $R_g$ is the set of regulators for gene $g$
- $\beta_{r,g}$ is the regulatory coefficient from regulator $r$ to target $g$
- $\epsilon_g$ is the residual error

### 1.2 Ridge Regression

We use Ridge regression (L2 regularization) to estimate $\beta$ coefficients:

$$
\hat{\beta} = \arg\min_\beta \left\{ \|X_g - X_R \beta\|_2^2 + \alpha \|\beta\|_2^2 \right\}
$$

The closed-form solution is:

$$
\hat{\beta} = (X_R^T X_R + \alpha I)^{-1} X_R^T X_g
$$

**Why Ridge Regression?**

- **Prevents overfitting**: Regularization shrinks coefficients toward zero
- **Handles multicollinearity**: Correlated regulators don't cause instability
- **Numerical stability**: Always has a unique solution

```{r ridge-demo, eval=FALSE}
# Ridge regression in CellOracleR
coef_matrix <- fit_grn_coef_matrix(
  gem = expression_matrix,
  TFdict = tf_target_dict,
  alpha = 10  # regularization strength
)
```

### 1.3 Bootstrap Aggregation (Bagging)

To improve robustness, we employ bagging:

1. Sample 80% of cells (without replacement)
2. Fit Ridge regression on the subsample
3. Repeat $B$ times (typically 200)
4. Use median of coefficients as final estimate

$$
\hat{\beta}_{final} = \text{median}(\hat{\beta}_1, \hat{\beta}_2, \ldots, \hat{\beta}_B)
$$

**Advantages:**

- Reduces variance of coefficient estimates
- More robust to outlier cells
- Provides coefficient stability measures

## 2. Signal Propagation Simulation

### 2.1 Perturbation Model

Given a perturbation condition (e.g., TF knockout), we simulate the downstream effects:

$$
\Delta X^{(0)} = X_{perturbed} - X_{original}
$$

### 2.2 Iterative Propagation

The signal propagates through the GRN iteratively:

$$
\Delta X^{(t+1)} = \Delta X^{(t)} \cdot W
$$

where $W$ is the coefficient matrix (genes × genes).

**Key constraint**: Perturbed gene values are maintained at each step:

$$
\Delta X^{(t+1)}_p = \Delta X^{(0)}_p \quad \text{for all perturbed genes } p
$$

### 2.3 Non-negativity Constraint

Gene expression cannot be negative:

$$
X_{simulated} = \max(0, X_{original} + \Delta X)
$$

```{r propagation-diagram, echo=FALSE, fig.cap="Signal propagation through GRN"}
# Create a simple diagram showing signal propagation
if (requireNamespace("ggplot2", quietly = TRUE)) {
  library(ggplot2)
  
  # Create nodes
  nodes <- data.frame(
    x = c(0, 1, 2, 1, 2),
    y = c(1, 1.5, 1.5, 0.5, 0.5),
    label = c("TF (KO)", "Gene A", "Gene B", "Gene C", "Gene D"),
    color = c("red", "orange", "orange", "yellow", "yellow")
  )
  
  # Create edges
  edges <- data.frame(
    x = c(0, 0, 1, 1, 1.5, 1.5),
    xend = c(1, 1, 2, 2, 2, 2),
    y = c(1, 1, 1.5, 0.5, 1.5, 0.5),
    yend = c(1.5, 0.5, 1.5, 0.5, 1.5, 0.5)
  )
  
  p <- ggplot() +
    geom_segment(data = edges, aes(x = x, y = y, xend = xend, yend = yend),
                 arrow = arrow(length = unit(0.3, "cm")), 
                 color = "gray50", size = 1) +
    geom_point(data = nodes, aes(x = x, y = y, fill = color), 
               shape = 21, size = 15) +
    geom_text(data = nodes, aes(x = x, y = y, label = label), 
              size = 3, fontface = "bold") +
    scale_fill_identity() +
    theme_void() +
    labs(title = "Signal Propagation in GRN",
         subtitle = "Red: Perturbed TF → Orange: Direct targets → Yellow: Indirect targets") +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
          plot.subtitle = element_text(hjust = 0.5, size = 10))
  
  print(p)
}
```

## 3. Transition Probability Estimation

### 3.1 Correlation-Based Approach

We estimate the probability of cell $i$ transitioning to cell $j$ based on:

$$
\rho_{ij} = \text{cor}(\Delta X_i, X_j - X_i)
$$

This measures how well the predicted expression change aligns with the direction toward cell $j$.

### 3.2 Probability Conversion

Correlations are converted to probabilities using an exponential kernel:

$$
P_{ij} = \frac{\exp(\rho_{ij} / \sigma)}{\sum_{k \in N_i} \exp(\rho_{ik} / \sigma)}
$$

where:

- $N_i$ is the set of k-nearest neighbors of cell $i$
- $\sigma$ is a bandwidth parameter (default: 0.05)

### 3.3 Embedding Shift Calculation

The expected movement in embedding space:

$$
\Delta E_i = \sum_{j \in N_i} P_{ij} \cdot \hat{d}_{ij}
$$

where $\hat{d}_{ij}$ is the unit vector from cell $i$ to cell $j$ in embedding space.

## 4. Markov Chain Simulation

### 4.1 State Transition Model

Cell fate is modeled as a discrete-time Markov chain:

$$
P(S_{t+1} = j | S_t = i) = P_{ij}
$$

### 4.2 Trajectory Simulation

For each starting cell, we simulate $N$ steps:

```
Algorithm: Markov Walk
Input: transition_prob P, start_cell s, n_steps T
Output: trajectory [s_0, s_1, ..., s_T]

s_0 ← s
for t = 1 to T:
    sample s_t from Categorical(P[s_{t-1}, :])
return [s_0, s_1, ..., s_T]
```

### 4.3 Fate Probability

For absorbing Markov chains, we can compute the probability of reaching each terminal state:

$$
F = (I - Q)^{-1} R
$$

where $Q$ is the transient-to-transient transition matrix and $R$ is the transient-to-absorbing matrix.

## 5. Network Analysis Metrics

### 5.1 Centrality Measures

**Degree centrality:**
$$
C_D(v) = \text{deg}(v) = |N(v)|
$$

**Betweenness centrality:**
$$
C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}
$$

**Eigenvector centrality:**
$$
C_E(v) = \frac{1}{\lambda} \sum_{u \in N(v)} C_E(u)
$$

### 5.2 Network Entropy

Degree distribution entropy:

$$
H = -\sum_{k} p(k) \log_2 p(k)
$$

Higher entropy indicates more heterogeneous connectivity.

## Summary

CellOracleR combines:

1. **Ridge regression** for robust GRN inference
2. **Signal propagation** for perturbation simulation  
3. **Correlation-based transition probabilities** for fate prediction
4. **Markov chains** for trajectory modeling
5. **Graph theory** for network analysis

These mathematical foundations enable accurate prediction of cellular responses to genetic perturbations.

## References

1. Kamimoto, K., et al. (2023). CellOracle: Dissecting cell identity via network inference and in silico gene perturbation. *Molecular Systems Biology*.

2. Hoerl, A.E. & Kennard, R.W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. *Technometrics*.

3. Breiman, L. (1996). Bagging Predictors. *Machine Learning*.

## Session Info

```{r session-info}
sessionInfo()
```