---
title: "GRN Inference Details"
subtitle: "Building Gene Regulatory Networks from Single-Cell Data"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 8
    fig_height: 6
vignette: >
  %\VignetteIndexEntry{GRN Inference Details}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  message = FALSE,
  warning = FALSE
)
library(ggplot2)
library(Matrix)
```

## Overview

Gene Regulatory Network (GRN) inference is the foundation of CellOracleR analysis. This vignette provides detailed documentation of the GRN inference pipeline.

## Pipeline Architecture

```{r pipeline-diagram, echo=FALSE, fig.height=4, fig.width=10}
# Create pipeline diagram
pipeline_data <- data.frame(
  x = c(1, 3, 5, 7, 9),
  label = c("scRNA-seq\nData", "TF-Target\nPriors", "Ridge\nRegression", "Bagging\nAggregation", "Links\nObject"),
  shape = c("rect", "rect", "diamond", "diamond", "rect"),
  fill = c("#e1f5fe", "#e8f5e9", "#fff3e0", "#fff3e0", "#fce4ec")
)

arrows <- data.frame(
  x = c(1.5, 3.5, 5.5, 7.5),
  xend = c(2.5, 4.5, 6.5, 8.5)
)

ggplot() +
  geom_rect(data = pipeline_data, 
            aes(xmin = x - 0.8, xmax = x + 0.8, ymin = -0.5, ymax = 0.5, fill = fill),
            color = "black", size = 0.5) +
  geom_segment(data = arrows, 
               aes(x = x, xend = xend, y = 0, yend = 0),
               arrow = arrow(length = unit(0.3, "cm")), size = 1.2) +
  geom_text(data = pipeline_data, aes(x = x, y = 0, label = label), 
            size = 3.5, fontface = "bold") +
  scale_fill_identity() +
  theme_void() +
  coord_fixed(ratio = 0.8) +
  labs(title = "CellOracleR GRN Inference Pipeline") +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
```

## Step 1: Input Data Preparation

### Expression Matrix

CellOracleR requires a normalized expression matrix (genes × cells):

```{r expression-format, eval=FALSE}
# From Seurat object
library(Seurat)
library(CellOracleR)

# Create Oracle object
oracle <- create_oracle(
  seurat_obj,
  cluster_col = "seurat_clusters",
  embedding_name = "umap"
)

# Extract expression data
expr_matrix <- oracle$get_expression()
dim(expr_matrix)  # genes × cells
```

**Recommendations:**

- Use normalized data (log-transformed)
- Filter low-quality cells and genes
- 2,000-5,000 highly variable genes typically sufficient

### TF-Target Prior Dictionary

The TF-target dictionary specifies which TFs can potentially regulate each gene:

```{r tfdict-format, eval=FALSE}
# Structure: list with gene names as keys
# Each element contains a character vector of potential TF regulators

TFdict <- list(
  Gene1 = c("TF_A", "TF_B", "TF_C"),
  Gene2 = c("TF_A", "TF_D"),
  Gene3 = c("TF_B", "TF_C", "TF_D", "TF_E")
)
```

**Sources for TF-target priors:**

1. **Motif analysis** (recommended): Scan promoter sequences for TF binding motifs
2. **ChIP-seq data**: Experimentally determined TF binding sites
3. **Literature curation**: Known regulatory relationships
4. **ATAC-seq + Cicero**: Accessibility-based predictions

## Step 2: Cluster-Specific GRN Fitting

GRNs are fitted separately for each cell cluster to capture cell-type-specific regulatory logic.

### Per-Cluster Fitting

```{r per-cluster, eval=FALSE}
# Fit GRN for specific cluster
oracle$fit_grn_for_simulation(
  GRN_unit = "cluster",          # Group by cluster
  alpha = 10,                     # Regularization strength
  bagging_number = 200,           # Number of bootstrap iterations
  use_cache = TRUE                # Cache intermediate results
)
```

**Why cluster-specific?**

```{r cluster-diagram, echo=FALSE, fig.height=4, fig.width=8}
# Illustrate cluster-specific GRNs
set.seed(42)
n <- 100

# Create mock data showing different regulatory relationships
cluster_data <- data.frame(
  TF_A = c(rnorm(n, 5, 1), rnorm(n, 3, 1)),
  Gene_X = c(rnorm(n, 5, 1) + 0.8 * rnorm(n, 5, 1), 
             rnorm(n, 3, 1) - 0.5 * rnorm(n, 3, 1)),
  cluster = rep(c("Cluster 1", "Cluster 2"), each = n)
)

ggplot(cluster_data, aes(x = TF_A, y = Gene_X, color = cluster)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, size = 1.5) +
  facet_wrap(~cluster) +
  scale_color_manual(values = c("#2196F3", "#F44336")) +
  labs(
    title = "TF-Target Relationships Differ Across Cell Types",
    subtitle = "Same TF can activate in one context, repress in another",
    x = "TF_A Expression",
    y = "Gene_X Expression"
  ) +
  theme_bw() +
  theme(
    legend.position = "none",
    strip.background = element_rect(fill = "gray90"),
    strip.text = element_text(face = "bold", size = 11)
  )
```

## Step 3: Ridge Regression Details

### Regularization Parameter (α)

The regularization parameter controls the bias-variance trade-off:

```{r alpha-effect, echo=FALSE, fig.height=4, fig.width=10}
# Demonstrate effect of alpha
set.seed(42)
alpha_values <- c(0.1, 1, 10, 100)
n_points <- 50

alpha_demo <- do.call(rbind, lapply(alpha_values, function(a) {
  x <- seq(-2, 2, length.out = n_points)
  # Simulate coefficient shrinkage
  coef_raw <- c(2, -1.5, 0.5, -0.3, 0.1)
  coef_shrunk <- coef_raw / (1 + a/10)
  
  data.frame(
    alpha = paste("α =", a),
    original = coef_raw,
    shrunk = coef_shrunk,
    coef_id = paste("β", 1:5)
  )
}))

ggplot(alpha_demo, aes(x = coef_id)) +
  geom_segment(aes(xend = coef_id, y = 0, yend = original), 
               color = "gray70", size = 1) +
  geom_point(aes(y = original), color = "gray70", size = 3) +
  geom_point(aes(y = shrunk), color = "#2196F3", size = 4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  facet_wrap(~alpha, nrow = 1) +
  labs(
    title = "Effect of Regularization Parameter on Coefficients",
    subtitle = "Gray: Original OLS estimates | Blue: Ridge estimates",
    x = "Coefficient",
    y = "Value"
  ) +
  theme_bw() +
  theme(
    strip.background = element_rect(fill = "#e3f2fd"),
    strip.text = element_text(face = "bold")
  )
```

**Guidelines for choosing α:**

| α value | Effect | Use case |
|---------|--------|----------|
| 0.1-1 | Minimal shrinkage | High-quality data, few predictors |
| 1-10 | Moderate shrinkage | Standard scRNA-seq (recommended) |
| 10-100 | Strong shrinkage | Noisy data, many predictors |
| >100 | Heavy shrinkage | Highly correlated predictors |

**Default: α = 10** - provides good balance for typical scRNA-seq data.

### Handling Zero Variance

When a gene has zero variance (constant expression), special handling is required:

```{r zero-var, eval=FALSE}
# CellOracleR automatically handles this:
# - Zero-variance genes get coefficient = 0
# - A small constant is added to avoid numerical issues
```

## Step 4: Bootstrap Aggregation

### Algorithm

```{r bagging-algo, eval=FALSE}
# Pseudocode for bagging
bagging_coefficients <- function(X, y, n_bootstrap = 200, sample_ratio = 0.8) {
  n_cells <- ncol(X)
  n_sample <- floor(n_cells * sample_ratio)
  
  coef_list <- list()
  
  for (b in 1:n_bootstrap) {
    # Random subsample without replacement
    idx <- sample(n_cells, n_sample, replace = FALSE)
    
    # Fit Ridge regression
    coef_list[[b]] <- ridge_fit(X[, idx], y[idx], alpha = 10)
  }
  
  # Aggregate: use median (robust to outliers)
  final_coef <- apply(do.call(cbind, coef_list), 1, median)
  
  return(final_coef)
}
```

### Coefficient Stability Assessment

```{r stability-plot, echo=FALSE, fig.height=4, fig.width=8}
# Simulate coefficient distributions from bagging
set.seed(42)
n_boot <- 200
coef_names <- c("TF_A", "TF_B", "TF_C", "TF_D", "TF_E")

# Simulate: TF_A and TF_B are stable, TF_C-E are unstable
boot_results <- data.frame(
  TF = rep(coef_names, each = n_boot),
  coefficient = c(
    rnorm(n_boot, 0.8, 0.05),   # TF_A: stable positive
    rnorm(n_boot, -0.6, 0.04),  # TF_B: stable negative
    rnorm(n_boot, 0.1, 0.15),   # TF_C: unstable
    rnorm(n_boot, 0.05, 0.2),   # TF_D: very unstable
    rnorm(n_boot, -0.02, 0.25)  # TF_E: noise
  )
)

# Calculate stability (inverse of CV)
stability_stats <- boot_results |>
  transform(TF = factor(TF, levels = coef_names)) |>
  aggregate(coefficient ~ TF, FUN = function(x) c(mean = mean(x), sd = sd(x)))

stability_df <- data.frame(
  TF = stability_stats$TF,
  mean = stability_stats$coefficient[, "mean"],
  sd = stability_stats$coefficient[, "sd"]
)

ggplot(boot_results, aes(x = TF, y = coefficient, fill = TF)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) +
  scale_fill_manual(values = c("#4CAF50", "#F44336", "#FFC107", "#9C27B0", "#607D8B")) +
  labs(
    title = "Coefficient Distributions Across Bootstrap Iterations",
    subtitle = "Narrow distributions indicate stable, reliable coefficients",
    x = "Transcription Factor",
    y = "Coefficient Value"
  ) +
  theme_bw() +
  theme(legend.position = "none")
```

## Step 5: Links Object

The `Links` class stores and analyzes the fitted GRN:

```{r links-overview, eval=FALSE}
# Get Links object from Oracle
links <- oracle$get_links(
  cluster_name_for_GRN_unit = "Cluster1"
)

# Explore the network
links$filter(
  threshold_p = 0.001,      # p-value threshold
  threshold_coef = 0.1      # coefficient magnitude threshold
)

# Calculate network metrics
links$get_network_score()

# Export for visualization
links$get_igraph()
```

### Links Data Structure

```{r links-structure, echo=FALSE}
# Show example Links structure
example_links <- data.frame(
  source = c("TF_A", "TF_A", "TF_B", "TF_B", "TF_C"),
  target = c("Gene_1", "Gene_2", "Gene_1", "Gene_3", "Gene_2"),
  coef = c(0.85, 0.42, -0.61, 0.33, 0.71),
  p_value = c(1e-15, 1e-8, 1e-12, 1e-5, 1e-10)
)

knitr::kable(
  example_links,
  caption = "Example Links Data Frame Structure",
  col.names = c("Source (TF)", "Target (Gene)", "Coefficient", "P-value"),
  digits = c(NA, NA, 3, 2)
)
```

## Performance Considerations

### Computational Complexity

| Component | Time Complexity | Memory |
|-----------|-----------------|--------|
| Ridge regression | O(p³) per gene | O(p²) |
| Bagging (B iterations) | O(B × n × p³) | O(p²) |
| Full GRN | O(G × B × n × p³) | O(G × p) |

Where: G = genes, n = cells, p = max TFs per gene, B = bootstrap iterations

### Parallelization

CellOracleR uses the `future` package for parallel processing:

```{r parallel, eval=FALSE}
library(future)

# Setup parallel processing
plan(multisession, workers = 4)

# Now GRN fitting will use 4 cores
oracle$fit_grn_for_simulation(
  bagging_number = 200
)

# Reset to sequential
plan(sequential)
```

### C++ Acceleration

Performance-critical operations are implemented in C++ via Rcpp:

- Matrix multiplication in signal propagation
- K-nearest neighbor weight calculation  
- Correlation computation
- Markov chain simulation

## Best Practices

### 1. Data Quality

```{r best-practice, eval=FALSE}
# Filter before GRN inference
seurat_obj <- subset(seurat_obj, 
                     nFeature_RNA > 500 & 
                     nCount_RNA > 1000 &
                     percent.mt < 20)
```

### 2. Appropriate Clustering

- Use biological meaningful clusters
- Aim for 500-5000 cells per cluster
- Very small clusters may produce unstable GRNs

### 3. TF-Target Prior Quality

- High-quality priors improve inference accuracy
- Consider using multiple evidence sources
- Validate key relationships when possible

## Summary

The CellOracleR GRN inference pipeline:

1. **Prepares** expression data and TF-target priors
2. **Fits** cluster-specific Ridge regression models
3. **Stabilizes** estimates through bootstrap aggregation
4. **Stores** results in the Links class for downstream analysis

This produces robust, interpretable gene regulatory networks suitable for perturbation simulation.

## Session Info

```{r session}
sessionInfo()
```