--- title: "GRN Inference Details" subtitle: "Building Gene Regulatory Networks from Single-Cell Data" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_width: 8 fig_height: 6 vignette: > %\VignetteIndexEntry{GRN Inference Details} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = "center", message = FALSE, warning = FALSE ) library(ggplot2) library(Matrix) ``` ## Overview Gene Regulatory Network (GRN) inference is the foundation of CellOracleR analysis. This vignette provides detailed documentation of the GRN inference pipeline. ## Pipeline Architecture ```{r pipeline-diagram, echo=FALSE, fig.height=4, fig.width=10} # Create pipeline diagram pipeline_data <- data.frame( x = c(1, 3, 5, 7, 9), label = c("scRNA-seq\nData", "TF-Target\nPriors", "Ridge\nRegression", "Bagging\nAggregation", "Links\nObject"), shape = c("rect", "rect", "diamond", "diamond", "rect"), fill = c("#e1f5fe", "#e8f5e9", "#fff3e0", "#fff3e0", "#fce4ec") ) arrows <- data.frame( x = c(1.5, 3.5, 5.5, 7.5), xend = c(2.5, 4.5, 6.5, 8.5) ) ggplot() + geom_rect(data = pipeline_data, aes(xmin = x - 0.8, xmax = x + 0.8, ymin = -0.5, ymax = 0.5, fill = fill), color = "black", size = 0.5) + geom_segment(data = arrows, aes(x = x, xend = xend, y = 0, yend = 0), arrow = arrow(length = unit(0.3, "cm")), size = 1.2) + geom_text(data = pipeline_data, aes(x = x, y = 0, label = label), size = 3.5, fontface = "bold") + scale_fill_identity() + theme_void() + coord_fixed(ratio = 0.8) + labs(title = "CellOracleR GRN Inference Pipeline") + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) ``` ## Step 1: Input Data Preparation ### Expression Matrix CellOracleR requires a normalized expression matrix (genes × cells): ```{r expression-format, eval=FALSE} # From Seurat object library(Seurat) library(CellOracleR) # Create Oracle object oracle <- create_oracle( seurat_obj, cluster_col = "seurat_clusters", embedding_name = "umap" ) # Extract expression data expr_matrix <- oracle$get_expression() dim(expr_matrix) # genes × cells ``` **Recommendations:** - Use normalized data (log-transformed) - Filter low-quality cells and genes - 2,000-5,000 highly variable genes typically sufficient ### TF-Target Prior Dictionary The TF-target dictionary specifies which TFs can potentially regulate each gene: ```{r tfdict-format, eval=FALSE} # Structure: list with gene names as keys # Each element contains a character vector of potential TF regulators TFdict <- list( Gene1 = c("TF_A", "TF_B", "TF_C"), Gene2 = c("TF_A", "TF_D"), Gene3 = c("TF_B", "TF_C", "TF_D", "TF_E") ) ``` **Sources for TF-target priors:** 1. **Motif analysis** (recommended): Scan promoter sequences for TF binding motifs 2. **ChIP-seq data**: Experimentally determined TF binding sites 3. **Literature curation**: Known regulatory relationships 4. **ATAC-seq + Cicero**: Accessibility-based predictions ## Step 2: Cluster-Specific GRN Fitting GRNs are fitted separately for each cell cluster to capture cell-type-specific regulatory logic. ### Per-Cluster Fitting ```{r per-cluster, eval=FALSE} # Fit GRN for specific cluster oracle$fit_grn_for_simulation( GRN_unit = "cluster", # Group by cluster alpha = 10, # Regularization strength bagging_number = 200, # Number of bootstrap iterations use_cache = TRUE # Cache intermediate results ) ``` **Why cluster-specific?** ```{r cluster-diagram, echo=FALSE, fig.height=4, fig.width=8} # Illustrate cluster-specific GRNs set.seed(42) n <- 100 # Create mock data showing different regulatory relationships cluster_data <- data.frame( TF_A = c(rnorm(n, 5, 1), rnorm(n, 3, 1)), Gene_X = c(rnorm(n, 5, 1) + 0.8 * rnorm(n, 5, 1), rnorm(n, 3, 1) - 0.5 * rnorm(n, 3, 1)), cluster = rep(c("Cluster 1", "Cluster 2"), each = n) ) ggplot(cluster_data, aes(x = TF_A, y = Gene_X, color = cluster)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", se = FALSE, size = 1.5) + facet_wrap(~cluster) + scale_color_manual(values = c("#2196F3", "#F44336")) + labs( title = "TF-Target Relationships Differ Across Cell Types", subtitle = "Same TF can activate in one context, repress in another", x = "TF_A Expression", y = "Gene_X Expression" ) + theme_bw() + theme( legend.position = "none", strip.background = element_rect(fill = "gray90"), strip.text = element_text(face = "bold", size = 11) ) ``` ## Step 3: Ridge Regression Details ### Regularization Parameter (α) The regularization parameter controls the bias-variance trade-off: ```{r alpha-effect, echo=FALSE, fig.height=4, fig.width=10} # Demonstrate effect of alpha set.seed(42) alpha_values <- c(0.1, 1, 10, 100) n_points <- 50 alpha_demo <- do.call(rbind, lapply(alpha_values, function(a) { x <- seq(-2, 2, length.out = n_points) # Simulate coefficient shrinkage coef_raw <- c(2, -1.5, 0.5, -0.3, 0.1) coef_shrunk <- coef_raw / (1 + a/10) data.frame( alpha = paste("α =", a), original = coef_raw, shrunk = coef_shrunk, coef_id = paste("β", 1:5) ) })) ggplot(alpha_demo, aes(x = coef_id)) + geom_segment(aes(xend = coef_id, y = 0, yend = original), color = "gray70", size = 1) + geom_point(aes(y = original), color = "gray70", size = 3) + geom_point(aes(y = shrunk), color = "#2196F3", size = 4) + geom_hline(yintercept = 0, linetype = "dashed") + facet_wrap(~alpha, nrow = 1) + labs( title = "Effect of Regularization Parameter on Coefficients", subtitle = "Gray: Original OLS estimates | Blue: Ridge estimates", x = "Coefficient", y = "Value" ) + theme_bw() + theme( strip.background = element_rect(fill = "#e3f2fd"), strip.text = element_text(face = "bold") ) ``` **Guidelines for choosing α:** | α value | Effect | Use case | |---------|--------|----------| | 0.1-1 | Minimal shrinkage | High-quality data, few predictors | | 1-10 | Moderate shrinkage | Standard scRNA-seq (recommended) | | 10-100 | Strong shrinkage | Noisy data, many predictors | | >100 | Heavy shrinkage | Highly correlated predictors | **Default: α = 10** - provides good balance for typical scRNA-seq data. ### Handling Zero Variance When a gene has zero variance (constant expression), special handling is required: ```{r zero-var, eval=FALSE} # CellOracleR automatically handles this: # - Zero-variance genes get coefficient = 0 # - A small constant is added to avoid numerical issues ``` ## Step 4: Bootstrap Aggregation ### Algorithm ```{r bagging-algo, eval=FALSE} # Pseudocode for bagging bagging_coefficients <- function(X, y, n_bootstrap = 200, sample_ratio = 0.8) { n_cells <- ncol(X) n_sample <- floor(n_cells * sample_ratio) coef_list <- list() for (b in 1:n_bootstrap) { # Random subsample without replacement idx <- sample(n_cells, n_sample, replace = FALSE) # Fit Ridge regression coef_list[[b]] <- ridge_fit(X[, idx], y[idx], alpha = 10) } # Aggregate: use median (robust to outliers) final_coef <- apply(do.call(cbind, coef_list), 1, median) return(final_coef) } ``` ### Coefficient Stability Assessment ```{r stability-plot, echo=FALSE, fig.height=4, fig.width=8} # Simulate coefficient distributions from bagging set.seed(42) n_boot <- 200 coef_names <- c("TF_A", "TF_B", "TF_C", "TF_D", "TF_E") # Simulate: TF_A and TF_B are stable, TF_C-E are unstable boot_results <- data.frame( TF = rep(coef_names, each = n_boot), coefficient = c( rnorm(n_boot, 0.8, 0.05), # TF_A: stable positive rnorm(n_boot, -0.6, 0.04), # TF_B: stable negative rnorm(n_boot, 0.1, 0.15), # TF_C: unstable rnorm(n_boot, 0.05, 0.2), # TF_D: very unstable rnorm(n_boot, -0.02, 0.25) # TF_E: noise ) ) # Calculate stability (inverse of CV) stability_stats <- boot_results |> transform(TF = factor(TF, levels = coef_names)) |> aggregate(coefficient ~ TF, FUN = function(x) c(mean = mean(x), sd = sd(x))) stability_df <- data.frame( TF = stability_stats$TF, mean = stability_stats$coefficient[, "mean"], sd = stability_stats$coefficient[, "sd"] ) ggplot(boot_results, aes(x = TF, y = coefficient, fill = TF)) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") + geom_violin(alpha = 0.7) + geom_boxplot(width = 0.1, fill = "white", outlier.shape = NA) + scale_fill_manual(values = c("#4CAF50", "#F44336", "#FFC107", "#9C27B0", "#607D8B")) + labs( title = "Coefficient Distributions Across Bootstrap Iterations", subtitle = "Narrow distributions indicate stable, reliable coefficients", x = "Transcription Factor", y = "Coefficient Value" ) + theme_bw() + theme(legend.position = "none") ``` ## Step 5: Links Object The `Links` class stores and analyzes the fitted GRN: ```{r links-overview, eval=FALSE} # Get Links object from Oracle links <- oracle$get_links( cluster_name_for_GRN_unit = "Cluster1" ) # Explore the network links$filter( threshold_p = 0.001, # p-value threshold threshold_coef = 0.1 # coefficient magnitude threshold ) # Calculate network metrics links$get_network_score() # Export for visualization links$get_igraph() ``` ### Links Data Structure ```{r links-structure, echo=FALSE} # Show example Links structure example_links <- data.frame( source = c("TF_A", "TF_A", "TF_B", "TF_B", "TF_C"), target = c("Gene_1", "Gene_2", "Gene_1", "Gene_3", "Gene_2"), coef = c(0.85, 0.42, -0.61, 0.33, 0.71), p_value = c(1e-15, 1e-8, 1e-12, 1e-5, 1e-10) ) knitr::kable( example_links, caption = "Example Links Data Frame Structure", col.names = c("Source (TF)", "Target (Gene)", "Coefficient", "P-value"), digits = c(NA, NA, 3, 2) ) ``` ## Performance Considerations ### Computational Complexity | Component | Time Complexity | Memory | |-----------|-----------------|--------| | Ridge regression | O(p³) per gene | O(p²) | | Bagging (B iterations) | O(B × n × p³) | O(p²) | | Full GRN | O(G × B × n × p³) | O(G × p) | Where: G = genes, n = cells, p = max TFs per gene, B = bootstrap iterations ### Parallelization CellOracleR uses the `future` package for parallel processing: ```{r parallel, eval=FALSE} library(future) # Setup parallel processing plan(multisession, workers = 4) # Now GRN fitting will use 4 cores oracle$fit_grn_for_simulation( bagging_number = 200 ) # Reset to sequential plan(sequential) ``` ### C++ Acceleration Performance-critical operations are implemented in C++ via Rcpp: - Matrix multiplication in signal propagation - K-nearest neighbor weight calculation - Correlation computation - Markov chain simulation ## Best Practices ### 1. Data Quality ```{r best-practice, eval=FALSE} # Filter before GRN inference seurat_obj <- subset(seurat_obj, nFeature_RNA > 500 & nCount_RNA > 1000 & percent.mt < 20) ``` ### 2. Appropriate Clustering - Use biological meaningful clusters - Aim for 500-5000 cells per cluster - Very small clusters may produce unstable GRNs ### 3. TF-Target Prior Quality - High-quality priors improve inference accuracy - Consider using multiple evidence sources - Validate key relationships when possible ## Summary The CellOracleR GRN inference pipeline: 1. **Prepares** expression data and TF-target priors 2. **Fits** cluster-specific Ridge regression models 3. **Stabilizes** estimates through bootstrap aggregation 4. **Stores** results in the Links class for downstream analysis This produces robust, interpretable gene regulatory networks suitable for perturbation simulation. ## Session Info ```{r session} sessionInfo() ```