| Title: | In Silico Gene Perturbation Analysis with Single-Cell Data |
|---|---|
| Description: | An R implementation of the CellOracle framework for in silico gene perturbation analysis and gene regulatory network (GRN) inference from single-cell RNA-seq data. Predicts cell state transitions in response to transcription factor perturbations by combining GRN models with single-cell expression data. Key features include motif analysis for base GRN construction, cluster-specific GRN inference using regularized regression, perturbation simulation with signal propagation, and comprehensive visualization of predicted cell fate changes. Based on the methodology described in Kamimoto et al. (2023) <doi:10.15252/msb.202211547>. |
| Authors: | Zaoqu Liu [aut, cre], Kenji Kamimoto [ctb] (Original CellOracle Python package author) |
| Maintainer: | Zaoqu Liu <[email protected]> |
| License: | Apache License (>= 2) | file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-25 10:51:17 UTC |
| Source: | https://github.com/Zaoqu-Liu/CellOracleR |
Associates peaks with nearby genes based on TSS proximity.
annotate_peaks(peaks_df, ref_genome, upstream = 2000, downstream = 2000)annotate_peaks(peaks_df, ref_genome, upstream = 2000, downstream = 2000)
peaks_df |
Data frame with chr, start, end columns |
ref_genome |
Reference genome |
upstream |
Distance upstream of TSS to consider |
downstream |
Distance downstream of TSS to consider |
Data frame with peak-gene associations
Efficiently assembles the full coefficient matrix from individual gene regression results stored in a list.
build_coef_matrix_cpp(coef_list, gene_names, target_genes)build_coef_matrix_cpp(coef_list, gene_names, target_genes)
coef_list |
List of coefficient vectors (one per target gene) |
gene_names |
Names of all genes |
target_genes |
Names of genes with fitted models |
Full coefficient matrix (genes x genes)
Creates a sparse adjacency matrix from KNN indices where entry (i,j) = 1 if j is a neighbor of i.
build_knn_graph_cpp(knn_idx, n_cells, mutual = FALSE)build_knn_graph_cpp(knn_idx, n_cells, mutual = FALSE)
knn_idx |
KNN index matrix (cells x k), 0-indexed |
n_cells |
Total number of cells |
mutual |
Whether to require mutual neighbors |
Sparse connectivity matrix (cells x cells)
Computes the expected embedding shift (velocity) based on transition probabilities and neighbor positions.
calculate_embedding_shift_cpp(embedding, trans_prob, knn_adj)calculate_embedding_shift_cpp(embedding, trans_prob, knn_adj)
embedding |
Embedding coordinates (cells x 2) |
trans_prob |
Transition probability matrix (cells x cells) |
knn_adj |
KNN adjacency matrix |
Delta embedding matrix (cells x 2)
Computes flow vectors on a regular grid using weighted averaging from nearby cells with Gaussian kernel.
calculate_grid_arrows_cpp( embedding, delta_embedding, grid_points, knn_idx, knn_dist, smooth = 0.5 )calculate_grid_arrows_cpp( embedding, delta_embedding, grid_points, knn_idx, knn_dist, smooth = 0.5 )
embedding |
Cell embedding coordinates (cells x 2) |
delta_embedding |
Cell velocity vectors (cells x 2) |
grid_points |
Grid coordinates (grid_points x 2) |
knn_idx |
KNN indices for each grid point (grid_points x k) |
knn_dist |
KNN distances (grid_points x k) |
smooth |
Smoothing parameter (multiplied by grid step) |
Grid flow vectors (grid_points x 2)
Calculates the relative position of simulated values within the original expression range for out-of-distribution detection.
calculate_relative_ratio_cpp(simulated, reference)calculate_relative_ratio_cpp(simulated, reference)
simulated |
Simulated expression matrix |
reference |
Reference expression matrix |
Matrix of relative ratios (0-1 = within range)
Subtracts column means from each column.
center_cols_cpp(X)center_cols_cpp(X)
X |
Input matrix |
Centered matrix
Cleans transition probability matrix by replacing NaN values with 0 and assigning self-transition probability of 1 for rows that sum to 0.
clean_transition_prob_cpp(trans_prob)clean_transition_prob_cpp(trans_prob)
trans_prob |
Transition probability matrix |
Cleaned transition matrix
Clips simulated gene expression values to stay within the original expression distribution range to avoid out-of-distribution predictions.
clip_to_range_cpp(simulated, original)clip_to_range_cpp(simulated, original)
simulated |
Simulated expression matrix (cells x genes) |
original |
Original expression matrix (cells x genes) |
Clipped simulated matrix
Computes correlation between expression shifts for all cell pairs. More expensive than partial version but useful for validation.
col_delta_cor_full_cpp(X, delta_X)col_delta_cor_full_cpp(X, delta_X)
X |
Expression matrix (cells x genes) |
delta_X |
Simulated expression shift (cells x genes) |
Full correlation matrix (cells x cells)
Computes correlation between expression (X) and simulated shift (delta_X) for each cell with its neighbors. This is the core computation for transition probability estimation.
col_delta_cor_partial_cpp(X, delta_X, neigh_idx)col_delta_cor_partial_cpp(X, delta_X, neigh_idx)
X |
Expression matrix (cells x genes) |
delta_X |
Simulated expression shift (cells x genes) |
neigh_idx |
Neighbor index matrix (cells x k), 0-indexed |
Correlation coefficient matrix (cells x cells)
Computes standard deviation for each column of a matrix.
colSds_cpp(X)colSds_cpp(X)
X |
Input matrix |
Vector of column standard deviations
Computes overlap and differences between cluster GRNs.
compare_networks(links, clusters = NULL)compare_networks(links, clusters = NULL)
links |
Links object |
clusters |
Clusters to compare (default: all) |
List with comparison statistics
Computes KNN in embedding space, optionally restricted to a subset of cells. This is useful for Markov simulation on subsampled cells.
compute_embedding_knn_subset_cpp(embedding, cell_idx, k)compute_embedding_knn_subset_cpp(embedding, cell_idx, k)
embedding |
Full embedding matrix (all cells x dims) |
cell_idx |
Indices of cells to use (0-indexed) |
k |
Number of neighbors |
List with knn_idx and distances
Computes the probability of reaching each terminal state from each starting state.
compute_fate_probability(trans_prob, terminal_states, n_steps = 500)compute_fate_probability(trans_prob, terminal_states, n_steps = 500)
trans_prob |
Transition probability matrix |
terminal_states |
Indices of terminal states |
n_steps |
Number of simulation steps |
Matrix of fate probabilities (cells x terminal states)
Estimates pseudotime ordering based on the perturbation-induced transition probability structure.
compute_pseudotime( oracle, start_cells, method = c("markov", "diffusion"), n_steps = 500, n_duplicates = 100 )compute_pseudotime( oracle, start_cells, method = c("markov", "diffusion"), n_steps = 500, n_duplicates = 100 )
oracle |
Oracle object with transition probabilities |
start_cells |
Starting cell indices (1-indexed) |
method |
Method: "markov" (simulation-based) or "diffusion" |
n_steps |
Number of steps for Markov simulation |
n_duplicates |
Duplicates per cell for Markov simulation |
Named vector of pseudotime values
Converts correlation coefficients to transition probabilities using exponential kernel, restricted to KNN neighbors.
correlation_to_transition_prob_cpp(corrcoef, knn_adj, sigma_corr = 0.05)correlation_to_transition_prob_cpp(corrcoef, knn_adj, sigma_corr = 0.05)
corrcoef |
Correlation coefficient matrix |
knn_adj |
KNN adjacency matrix (sparse, 1 for neighbors) |
sigma_corr |
Correlation kernel bandwidth |
Transition probability matrix (rows sum to 1)
Complete workflow to create base GRN from scATAC-seq peaks.
create_base_grn( peaks_df, ref_genome, motifs = NULL, upstream = 2000, downstream = 2000, fpr = 0.02, min_peaks = 10, n_cores = 1 )create_base_grn( peaks_df, ref_genome, motifs = NULL, upstream = 2000, downstream = 2000, fpr = 0.02, min_peaks = 10, n_cores = 1 )
peaks_df |
Data frame with chr, start, end columns |
ref_genome |
Reference genome |
motifs |
Motif database (NULL for JASPAR) |
upstream |
TSS upstream distance |
downstream |
TSS downstream distance |
fpr |
Motif scanning FPR |
min_peaks |
Minimum peaks for TF filtering |
n_cores |
Number of parallel cores |
TFdict suitable for Oracle$import_TF_data()
Create Oracle object from Seurat
create_oracle(seurat, cluster_column, embedding_name, verbose = TRUE)create_oracle(seurat, cluster_column, embedding_name, verbose = TRUE)
seurat |
Seurat object |
cluster_column |
Cluster column name |
embedding_name |
Embedding name |
verbose |
Print messages |
Oracle object
Helper function to create perturbation condition dictionary.
create_perturb_condition(seurat, genes, values = "knockout")create_perturb_condition(seurat, genes, values = "knockout")
seurat |
Seurat object |
genes |
Gene(s) to perturb |
values |
Value(s) for perturbation. Can be numeric or character (method name like "knockout", "max") |
Named list suitable for simulate_shift
Creates a TFinfo object from peak data frame for motif analysis.
create_tfinfo(peak_df, ref_genome)create_tfinfo(peak_df, ref_genome)
peak_df |
Data frame with columns: chr, start, end, gene_short_name |
ref_genome |
Reference genome (e.g., "hg38", "mm10") |
TFinfo object
Performs iterative signal propagation through the gene regulatory network. For each propagation step, expression changes are computed based on the GRN coefficient matrix, while perturbed gene values are maintained.
do_simulation_cpp(coef_matrix, simulation_input, gem, n_propagation)do_simulation_cpp(coef_matrix, simulation_input, gem, n_propagation)
coef_matrix |
Coefficient matrix (genes x genes) representing GRN weights |
simulation_input |
Initial expression state with perturbation applied (cells x genes) |
gem |
Original gene expression matrix (cells x genes) |
n_propagation |
Number of propagation steps |
Simulated gene expression matrix (cells x genes)
Export network to various formats
export_network( links, format = c("graphml", "gml", "edge_list", "pajek"), file_path, cluster = NULL )export_network( links, format = c("graphml", "gml", "edge_list", "pajek"), file_path, cluster = NULL )
links |
Links object |
format |
Output format: "graphml", "gml", "edge_list", "pajek" |
file_path |
Output file path |
cluster |
Specific cluster (NULL = all) |
Extract expression data from a Seurat object, compatible with both V4 and V5.
extract_expression( seurat, layer = c("data", "counts", "scale.data"), assay = "RNA", as_dense = FALSE )extract_expression( seurat, layer = c("data", "counts", "scale.data"), assay = "RNA", as_dense = FALSE )
seurat |
Seurat object |
layer |
Layer/slot to extract: "counts", "data" (normalized), or "scale.data" |
assay |
Assay name (default: "RNA") |
as_dense |
Convert to dense matrix (default: FALSE) |
Expression matrix (genes x cells)
## Not run: # Get normalized expression expr <- extract_expression(seurat, layer = "data") # Get raw counts as dense matrix counts <- extract_expression(seurat, layer = "counts", as_dense = TRUE) ## End(Not run)## Not run: # Get normalized expression expr <- extract_expression(seurat, layer = "data") # Get raw counts as dense matrix counts <- extract_expression(seurat, layer = "counts", as_dense = TRUE) ## End(Not run)
Extracts peak information from a Signac/Seurat object.
extract_peaks_from_signac(seurat, assay = "peaks", min_cells = 10)extract_peaks_from_signac(seurat, assay = "peaks", min_cells = 10)
seurat |
Seurat object with ATAC assay |
assay |
Name of ATAC assay |
min_cells |
Minimum cells for peak filtering |
Data frame with peak coordinates
Filter GRN links based on coefficient thresholds.
filter_links(links, p = 0.001, threshold_number = 10000, min_coef = 0)filter_links(links, p = 0.001, threshold_number = 10000, min_coef = 0)
links |
Links object |
p |
Percentile threshold |
threshold_number |
Maximum number of links |
min_coef |
Minimum absolute coefficient |
Modified Links object
Identifies common regulatory motifs (feedforward loops, feedback loops, etc.)
find_network_motifs(links, cluster = NULL, motif_size = 3)find_network_motifs(links, cluster = NULL, motif_size = 3)
links |
Links object |
cluster |
Specific cluster |
motif_size |
Size of motifs to find (3 or 4) |
Data frame with motif counts
Higher-level function that fits GRN and stores in Oracle object.
fit_grn(oracle, GRN_unit = c("cluster", "whole"), alpha = 1, verbose = TRUE)fit_grn(oracle, GRN_unit = c("cluster", "whole"), alpha = 1, verbose = TRUE)
oracle |
Oracle object |
GRN_unit |
"cluster" or "whole" |
alpha |
Regularization strength |
verbose |
Print progress |
Modified Oracle object
Fit GRN using Ridge regression with bootstrap aggregation (bagging) for all target genes. This matches Python's behavior exactly.
fit_grn_bagging( gem, TFdict, alpha = 10, bagging_number = 20, verbose = TRUE, n_jobs = -1 )fit_grn_bagging( gem, TFdict, alpha = 10, bagging_number = 20, verbose = TRUE, n_jobs = -1 )
gem |
Gene expression matrix (cells x genes) |
TFdict |
TF-target dictionary |
alpha |
Regularization strength |
bagging_number |
Number of bootstrap iterations |
verbose |
Print progress |
n_jobs |
Number of parallel jobs (-1 for all cores) |
List with median coefficients and all bootstrap coefficients
Fit a gene regulatory network using Ridge regression. Returns a coefficient matrix where entry (i,j) represents the regulatory effect of gene i on gene j. This matches Python's _getCoefMatrix function.
fit_grn_coef_matrix(gem, TFdict, alpha = 1, verbose = TRUE)fit_grn_coef_matrix(gem, TFdict, alpha = 1, verbose = TRUE)
gem |
Gene expression matrix (cells x genes) |
TFdict |
Named list mapping target genes to regulator TFs |
alpha |
Regularization strength for Ridge regression |
verbose |
Print progress |
Coefficient matrix (genes x genes)
Applies Gaussian kernel to convert distances to similarities.
gaussian_kernel_cpp(dist, sigma)gaussian_kernel_cpp(dist, sigma)
dist |
Distance matrix |
sigma |
Kernel bandwidth |
Similarity matrix
Fit Ridge regression with bootstrap aggregation (bagging) for a single target gene. EXACT port from Python's get_bagging_ridge_coefs - uses:
bootstrap=TRUE: sample rows with replacement
max_features=0.8: randomly select 80% of features for each estimator
get_bagging_ridge_coefs( target_gene, gem, gem_scaled, TFdict, cellstate = NULL, bagging_number = 1000, scaling = TRUE, alpha = 1, n_jobs = -1 )get_bagging_ridge_coefs( target_gene, gem, gem_scaled, TFdict, cellstate = NULL, bagging_number = 1000, scaling = TRUE, alpha = 1, n_jobs = -1 )
target_gene |
Target gene name |
gem |
Gene expression matrix (cells x genes data.frame) |
gem_scaled |
Scaled gene expression matrix |
TFdict |
TF-target dictionary |
cellstate |
Optional cell state data frame |
bagging_number |
Number of bootstrap iterations (default 1000 like Python) |
scaling |
Whether to use scaled data |
alpha |
Ridge regularization strength |
n_jobs |
Number of parallel jobs (not used in single-gene function) |
Data frame with coefficient values for each bootstrap iteration, or 0 if not applicable
Processes Cicero coaccessibility results to create a base GRN. This connects peaks to genes based on proximity and coaccessibility.
get_base_grn_from_cicero( cicero_connections, peak_annotation, coaccess_threshold = 0.05, max_distance = 1e+05 )get_base_grn_from_cicero( cicero_connections, peak_annotation, coaccess_threshold = 0.05, max_distance = 1e+05 )
cicero_connections |
Cicero connection data frame |
peak_annotation |
Peak-to-gene annotation data frame |
coaccess_threshold |
Coaccessibility score threshold |
max_distance |
Maximum distance for peak-gene association |
Data frame suitable for TFinfo import
Fit Bayesian Ridge regression for a target gene and return coefficients with uncertainty estimates. Exact port from Python's get_bayesian_ridge_coefs.
get_bayesian_ridge_coefs( target_gene, gem, gem_scaled, TFdict, cellstate = NULL, scaling = TRUE )get_bayesian_ridge_coefs( target_gene, gem, gem_scaled, TFdict, cellstate = NULL, scaling = TRUE )
target_gene |
Target gene name |
gem |
Gene expression matrix (cells x genes data.frame) |
gem_scaled |
Scaled gene expression matrix |
TFdict |
TF-target dictionary |
cellstate |
Optional cell state data frame |
scaling |
Whether to use scaled data |
List with coef_mean, coef_variance, coef_names (or 0,0,0 if not applicable)
Extract dimensional reduction embedding (e.g., UMAP, tSNE, PCA).
get_embedding(seurat, embedding_name = "umap", dims = 1:2)get_embedding(seurat, embedding_name = "umap", dims = 1:2)
seurat |
Seurat object |
embedding_name |
Name of the reduction (e.g., "umap", "tsne", "pca") |
dims |
Dimensions to extract (default: first 2) |
Matrix (cells x dimensions)
## Not run: # Get UMAP coordinates umap <- get_embedding(seurat, "umap") # Get first 50 PCs pcs <- get_embedding(seurat, "pca", dims = 1:50) ## End(Not run)## Not run: # Get UMAP coordinates umap <- get_embedding(seurat, "umap") # Get first 50 PCs pcs <- get_embedding(seurat, "pca", dims = 1:50) ## End(Not run)
Fits GRNs per cluster and returns a Links object containing regulatory networks for each cluster.
get_links( oracle, cluster_name_for_GRN_unit = NULL, alpha = 10, bagging_number = 20, verbose_level = 1, n_jobs = -1 )get_links( oracle, cluster_name_for_GRN_unit = NULL, alpha = 10, bagging_number = 20, verbose_level = 1, n_jobs = -1 )
oracle |
Oracle object |
cluster_name_for_GRN_unit |
Column name for clustering |
alpha |
Regularization strength |
bagging_number |
Number of bagging iterations |
verbose_level |
Verbosity (0=silent, 1=progress, 2=detailed) |
n_jobs |
Number of parallel jobs |
Links object
Computes the entropy of degree distribution as a measure of network complexity.
get_network_entropy(links, cluster = NULL)get_network_entropy(links, cluster = NULL)
links |
Links object |
cluster |
Specific cluster (NULL for all) |
Named vector of entropy values
Compute network centrality metrics for genes.
get_network_score(links)get_network_score(links)
links |
Links object |
Modified Links object with scores
Helper function to get perturbation values based on expression statistics.
get_perturb_value(seurat, gene, method = "knockout")get_perturb_value(seurat, gene, method = "knockout")
seurat |
Seurat object |
gene |
Gene name |
method |
Method for value: "knockout" (0), "min", "max", "median", "mean", "percentile_X" (e.g., "percentile_90") |
Perturbation value
Finds highly connected genes (hubs) in the network.
identify_hubs( links, cluster = NULL, top_n = 20, method = c("degree", "betweenness", "eigenvector") )identify_hubs( links, cluster = NULL, top_n = 20, method = c("degree", "betweenness", "eigenvector") )
links |
Links object |
cluster |
Specific cluster |
top_n |
Number of top hubs |
method |
Hub definition: "degree", "betweenness", "eigenvector" |
Data frame with hub genes
Converts KNN distance matrix to connectivity weights using Gaussian kernel.
knn_distances_to_weights_cpp(distances, sigma = 0)knn_distances_to_weights_cpp(distances, sigma = 0)
distances |
Distance matrix (cells x k) |
sigma |
Gaussian kernel bandwidth (if 0, auto-estimate) |
Weight matrix (cells x k)
Performs KNN-based smoothing/imputation of expression data using precomputed nearest neighbor indices and weights.
knn_impute_cpp(X, knn_idx, weights, diag_weight = 1)knn_impute_cpp(X, knn_idx, weights, diag_weight = 1)
X |
Expression matrix (cells x genes) |
knn_idx |
KNN index matrix (cells x k), 0-indexed |
weights |
Weight vector for neighbors (length k) |
diag_weight |
Weight for self (diagonal) |
Smoothed expression matrix
Links class stores and analyzes gene regulatory networks. It contains GRN links (edges) for each cluster and provides methods for filtering, merging, and computing network metrics.
links_dictNamed list of link data frames (one per cluster)
filtered_linksFiltered links after applying thresholds
merged_scoreMerged network scores across clusters
TFdictTF-target dictionary
cluster_columnCluster column name
raw_countRaw link counts per cluster
new()
Create a new Links object
Links$new(links_dict = NULL, TFdict = NULL, cluster_column = NULL)
links_dictNamed list of link data frames
TFdictTF dictionary
cluster_columnCluster column name
filter_links()
Filter links by statistical significance and coefficient threshold
Links$filter_links(p = 0.001, threshold_number = 10000, min_coef = 0)
pPercentile threshold for filtering (0-1)
threshold_numberMaximum number of links to keep per cluster
min_coefMinimum absolute coefficient
get_network_score()
Get network scores for each gene
Computes degree, betweenness, eigenvector centrality for genes.
Links$get_network_score()
get_links_df()
Get links as combined data frame
Links$get_links_df(filtered = TRUE)
filteredUse filtered links
Data frame with all links
get_degree_distribution()
Get degree distribution statistics
Links$get_degree_distribution(cluster = NULL, mode = "all")
clusterSpecific cluster (NULL for all)
modeDegree mode: "all", "in", "out"
Data frame with degree statistics
compare_gene()
Compare network metrics between clusters
Links$compare_gene(gene)
geneGene to compare
Data frame with per-cluster metrics
get_top_genes()
Get top genes by network metric
Links$get_top_genes(metric = "degree_all", cluster = NULL, n = 20)
metricMetric name (degree_all, betweenness, eigenvector)
clusterSpecific cluster (NULL for all)
nNumber of top genes
Data frame with top genes
get_regulators()
Get regulators of a target gene
Links$get_regulators(target, cluster = NULL)
targetTarget gene name
clusterSpecific cluster (NULL for all)
Data frame with regulator information
get_targets()
Get targets of a regulator
Links$get_targets(source, cluster = NULL)
sourceRegulator gene name
clusterSpecific cluster (NULL for all)
Data frame with target information
to_igraph()
Export to igraph object
Links$to_igraph(cluster = NULL)
clusterSpecific cluster (NULL = merge all)
igraph object
save()
Save Links object
Links$save(file_path)
file_pathPath to save file
print()
Print Links summary
Links$print()
clone()
The objects of this class are cloneable with this method.
Links$clone(deep = FALSE)
deepWhether to make a deep clone.
Loads pre-built base GRN data for common reference genomes.
load_base_grn(ref_genome, lineage = NULL)load_base_grn(ref_genome, lineage = NULL)
ref_genome |
Reference genome name |
lineage |
Tissue/lineage type (if available) |
TFdict (named list)
Load Oracle object from file
load_oracle(path)load_oracle(path)
path |
File path to saved Oracle object |
Oracle object
Load TFinfo object from file
load_tfinfo(file_path)load_tfinfo(file_path)
file_path |
Path to saved TFinfo object |
TFinfo object
Simulates cell state transitions using Markov chain based on transition probability matrix.
markov_simulate( trans_prob, start_cells, n_steps = 100, n_duplicates = 10, seed = 123 )markov_simulate( trans_prob, start_cells, n_steps = 100, n_duplicates = 10, seed = 123 )
trans_prob |
Transition probability matrix |
start_cells |
Starting cell indices (1-indexed) |
n_steps |
Number of simulation steps |
n_duplicates |
Number of trajectories per start cell |
seed |
Random seed |
Matrix of trajectories (n_trajectories x n_steps+1)
Runs multiple Markov simulations per starting cell for robust estimation.
markov_walk_batch_cpp( start_cells, trans_prob, n_steps, n_duplicates, seed = 123L )markov_walk_batch_cpp( start_cells, trans_prob, n_steps, n_duplicates, seed = 123L )
start_cells |
Starting cell indices (0-indexed) |
trans_prob |
Transition probability matrix |
n_steps |
Number of steps |
n_duplicates |
Number of simulations per start cell |
seed |
Random seed |
Matrix of trajectories (n_start * n_duplicates x n_steps+1)
Performs Markov chain simulation based on transition probability matrix. This simulates cell state transitions over multiple time steps.
markov_walk_cpp(start_cells, trans_prob, n_steps, seed = 123L)markov_walk_cpp(start_cells, trans_prob, n_steps, seed = 123L)
start_cells |
Starting cell indices (0-indexed) |
trans_prob |
Transition probability matrix (cells x cells) |
n_steps |
Number of simulation steps |
seed |
Random seed for reproducibility |
Matrix of trajectory indices (n_start x n_steps+1)
Net class represents a gene regulatory network model for a specific gene subset. It handles the regression fitting and coefficient estimation.
target_geneTarget gene for this model
regulatorsVector of regulator gene names
all_genesAll genes in the expression matrix
coef_matrixFull coefficient matrix
fittedWhether model has been fitted
new()
Create a new Net object
Net$new(target_gene = NULL, regulators = NULL, all_genes = NULL)
target_geneTarget gene
regulatorsRegulator genes
all_genesAll genes
fit()
Fit Ridge regression model
Net$fit(gem, alpha = 10, bagging_number = 20, sample_frac = 0.8)
gemGene expression matrix (cells x genes)
alphaRegularization strength
bagging_numberNumber of bagging iterations
sample_fracSample fraction for bagging
Self (modified)
get_coef()
Get coefficient for a specific regulator
Net$get_coef(regulator)
regulatorRegulator gene name
Coefficient value
get_active_regulators()
Get all non-zero regulators
Net$get_active_regulators()
Character vector of regulator names
print()
Print Net summary
Net$print()
clone()
The objects of this class are cloneable with this method.
Net$clone(deep = FALSE)
deepWhether to make a deep clone.
Calculate network statistics summary
network_summary(links, cluster = NULL)network_summary(links, cluster = NULL)
links |
Links object |
cluster |
Specific cluster (NULL for each cluster separately) |
Data frame with network statistics
Normalizes flow vectors to a reference percentile magnitude.
normalize_flow_cpp(flow, percentile = 99.5)normalize_flow_cpp(flow, percentile = 99.5)
flow |
Flow matrix (grid_points x 2) |
percentile |
Percentile for normalization (default 99.5) |
Normalized flow and magnitude
Oracle is the main class in CellOracleR. It imports scRNA-seq data (Seurat object) and TF information to infer cluster-specific GRNs. It can predict future gene expression patterns and cell state transitions in response to TF perturbations.
The Oracle class stores:
Seurat object with scRNA-seq data
TF-target gene regulatory information (TFdict)
GRN coefficients for simulation
Perturbation simulation results
seuratSeurat object containing scRNA-seq data
cluster_columnColumn name in metadata containing cluster information
embedding_nameName of dimensional reduction to use (e.g., "umap")
TFdictNamed list: key = target gene, value = vector of regulator TFs
all_target_genesAll target genes in TFdict
all_regulatory_genesAll regulatory genes in TFdict
active_regulatory_genesRegulatory genes with active connections in GRN
high_var_genesHigh variability genes
pcsPCA results
pcaPCA model
k_knn_imputationK used for KNN imputation
knnKNN graph
knn_smoothing_wKNN smoothing weights
GRN_unitGRN calculation unit: "cluster" or "whole"
alpha_for_simulationRegularization strength used for GRN
coef_matrixCoefficient matrix (for GRN_unit="whole")
coef_matrix_per_clusterList of coefficient matrices per cluster
perturb_conditionCurrent perturbation condition
embeddingEmbedding coordinates
delta_embeddingSimulated embedding shifts
delta_embedding_randomRandomized embedding shifts (control)
transition_probTransition probability matrix
transition_prob_randomRandom transition probability (control)
corrcoefCorrelation coefficients
corrcoef_randomRandom correlation coefficients (control)
flow_gridGrid coordinates for flow visualization
flowFlow vectors on grid
flow_normNormalized flow vectors
total_p_massProbability mass at each grid point
mass_filterMass filter for visualization
colorandumCell colors based on cluster
new()
Create a new Oracle object
Oracle$new( seurat = NULL, cluster_column = NULL, embedding_name = NULL, verbose = TRUE )
seuratSeurat object with scRNA-seq data (normalized, not scaled)
cluster_columnColumn name containing cluster assignments
embedding_nameName of dimensional reduction (e.g., "umap", "tsne")
verbosePrint messages
A new Oracle object
import_TF_data()
Import TF-target regulatory data
Oracle$import_TF_data( TFinfo_df = NULL, TFinfo_path = NULL, TFdict = NULL, verbose = TRUE )
TFinfo_dfData frame with columns: peak_id, gene_short_name, and TF columns
TFinfo_pathPath to parquet file with TF info
TFdictNamed list mapping target genes to regulator TFs
verbosePrint messages
perform_PCA()
Perform PCA on expression data
Oracle$perform_PCA(n_components = 50, use_seurat_pca = TRUE)
n_componentsNumber of PCs to compute
use_seurat_pcaUse existing PCA from Seurat object
knn_imputation()
Perform KNN imputation of expression data
Oracle$knn_imputation( k = NULL, n_pca_dims = NULL, balanced = FALSE, diag_weight = 1 )
kNumber of neighbors (default: 2.5% of cells)
n_pca_dimsNumber of PCA dimensions to use for KNN
balancedUse balanced KNN
diag_weightWeight for self in smoothing
fit_GRN_for_simulation()
Fit GRN for perturbation simulation
Oracle$fit_GRN_for_simulation(
GRN_unit = c("cluster", "whole"),
alpha = 1,
verbose_level = 1
)GRN_unit"cluster" for cluster-specific GRNs or "whole" for one GRN
alphaRegularization strength for Ridge regression
verbose_levelVerbosity: 0=silent, 1=progress, 2=detailed
simulate_shift()
Simulate gene perturbation effects
Oracle$simulate_shift( perturb_condition, n_propagation = 3, GRN_unit = NULL, clip_delta_X = FALSE, ignore_warning = FALSE )
perturb_conditionNamed list: gene name -> expression value
n_propagationNumber of signal propagation steps (1-5)
GRN_unitOverride GRN unit for simulation
clip_delta_XClip to original expression range
ignore_warningIgnore validation warnings
estimate_transition_prob()
Estimate transition probability
Oracle$estimate_transition_prob( n_neighbors = NULL, sigma_corr = 0.05, sampled_fraction = 0.3, calculate_randomized = TRUE, n_jobs = 1, random_seed = 123 )
n_neighborsNumber of neighbors for KNN
sigma_corrCorrelation kernel bandwidth
sampled_fractionFraction of neighbors to sample
calculate_randomizedCalculate randomized control
n_jobsNumber of parallel jobs
random_seedRandom seed
calculate_embedding_shift()
Calculate embedding shifts from transition probability
Oracle$calculate_embedding_shift(sigma_corr = 0.05)
sigma_corrKernel bandwidth (not used, kept for API compatibility)
calculate_grid_arrows()
Calculate grid arrows for visualization
Oracle$calculate_grid_arrows(n_grid = 40, n_neighbors = 100, smooth = 0.5)
n_gridNumber of grid points per dimension
n_neighborsNumber of neighbors for smoothing
smoothSmoothing parameter
calculate_mass_filter()
Calculate mass filter for visualization
Oracle$calculate_mass_filter(min_mass = 0.01)
min_massMinimum mass threshold
get_links()
Get Links object for network analysis
Oracle$get_links( cluster_name_for_GRN_unit = NULL, alpha = 10, bagging_number = 20, verbose_level = 1, n_jobs = -1 )
cluster_name_for_GRN_unitCluster column for GRN unit
alphaRegularization strength
bagging_numberNumber of bagging iterations
verbose_levelVerbosity level
n_jobsNumber of parallel jobs
Links object
save()
Save Oracle object to file
Oracle$save(file_path)
file_pathPath to save file (should end with .celloracle.oracle)
copy()
Deep copy the Oracle object
Oracle$copy()
A new Oracle object
change_cluster_unit()
Change the cluster unit used for analysis
Oracle$change_cluster_unit(new_cluster_column)
new_cluster_columnNew cluster column name
update_cluster_colors()
Update cluster colors
Oracle$update_cluster_colors(palette)
paletteNamed vector of colors for each cluster
get_cluster_colors()
Get cluster colors as a named vector
Oracle$get_cluster_colors()
Named vector of colors
print()
Print Oracle object summary Process TFdict metadata Get imputed expression as data frame Get delta_X from simulation results Store simulation results in Seurat Validate perturbation condition Extract active regulatory genes from coefficient matrix
Oracle$print()
clone()
The objects of this class are cloneable with this method.
Oracle$clone(deep = FALSE)
deepWhether to make a deep clone.
Computes Euclidean distance between all pairs of rows.
pairwise_dist_cpp(X)pairwise_dist_cpp(X)
X |
Input matrix (observations x features) |
Distance matrix (symmetric)
Randomly permutes elements within each row and flips signs. Used for generating randomized control in transition probability.
permute_rows_nsign_cpp(X, seed = 123L)permute_rows_nsign_cpp(X, seed = 123L)
X |
Input matrix (will be modified in place) |
seed |
Random seed |
Permuted matrix (also modifies X in place)
Creates a scatter plot of cells colored by cluster on dimensional reduction.
plot_cluster( oracle, cluster_column = NULL, embedding_name = NULL, point_size = 0.5, alpha = 0.8, show_legend = TRUE, title = NULL )plot_cluster( oracle, cluster_column = NULL, embedding_name = NULL, point_size = 0.5, alpha = 0.8, show_legend = TRUE, title = NULL )
oracle |
Oracle object |
cluster_column |
Column for coloring (default: oracle$cluster_column) |
embedding_name |
Embedding to use (default: oracle$embedding_name) |
point_size |
Size of points |
alpha |
Point transparency |
show_legend |
Show legend |
title |
Plot title |
ggplot object
Plot degree distribution
plot_degree_distribution( links, cluster = NULL, mode = "all", log_scale = TRUE, title = NULL )plot_degree_distribution( links, cluster = NULL, mode = "all", log_scale = TRUE, title = NULL )
links |
Links object |
cluster |
Specific cluster (NULL for all) |
mode |
Degree mode: "all", "in", "out" |
log_scale |
Use log scale |
title |
Plot title |
ggplot object
Plot gene expression on embedding
plot_gene_expression( oracle, gene, layer = "data", point_size = 0.5, title = NULL )plot_gene_expression( oracle, gene, layer = "data", point_size = 0.5, title = NULL )
oracle |
Oracle object |
gene |
Gene name |
layer |
Expression layer: "data", "simulated", "delta_X" |
point_size |
Point size |
title |
Plot title |
ggplot object
Plot GRN as network graph
plot_network_graph( links, cluster, top_n = 50, layout = "fr", node_size_by = "degree", title = NULL )plot_network_graph( links, cluster, top_n = 50, layout = "fr", node_size_by = "degree", title = NULL )
links |
Links object |
cluster |
Cluster to plot |
top_n |
Number of top edges to show |
layout |
Layout algorithm: "fr", "kk", "circle", "star" |
node_size_by |
Size nodes by: "degree", "betweenness", "fixed" |
title |
Plot title |
ggplot object (requires ggraph)
Plot pseudotime on embedding
plot_pseudotime(oracle, pseudotime, point_size = 0.5, title = NULL)plot_pseudotime(oracle, pseudotime, point_size = 0.5, title = NULL)
oracle |
Oracle object |
pseudotime |
Named vector of pseudotime values |
point_size |
Point size |
title |
Plot title |
ggplot object
Plot quiver (cell-level velocity arrows)
plot_quiver( oracle, scale = 1, sample_frac = 0.3, arrow_color = "black", title = NULL )plot_quiver( oracle, scale = 1, sample_frac = 0.3, arrow_color = "black", title = NULL )
oracle |
Oracle object |
scale |
Arrow scale |
sample_frac |
Fraction of cells to show |
arrow_color |
Arrow color |
title |
Plot title |
ggplot object
Compare network scores between clusters
plot_score_comparison(links, gene, metric = "degree_all", title = NULL)plot_score_comparison(links, gene, metric = "degree_all", title = NULL)
links |
Links object |
gene |
Gene to compare |
metric |
Metric to plot |
title |
Plot title |
ggplot object
Plot network scores as ranked bar plot
plot_scores_as_rank( links, cluster, metric = "degree_all", top_n = 20, fill_color = "#3288BD", title = NULL )plot_scores_as_rank( links, cluster, metric = "degree_all", top_n = 20, fill_color = "#3288BD", title = NULL )
links |
Links object |
cluster |
Cluster to plot |
metric |
Metric: "degree_all", "degree_in", "degree_out", "betweenness", "eigenvector" |
top_n |
Number of top genes |
fill_color |
Bar fill color |
title |
Plot title |
ggplot object
Create combined simulation plot
plot_simulation_combined(oracle, ncol = 2, genes = NULL)plot_simulation_combined(oracle, ncol = 2, genes = NULL)
oracle |
Oracle object with simulation results |
ncol |
Number of columns |
genes |
Genes to show expression for |
Combined plot (requires patchwork)
Visualizes the predicted cell state changes as a quiver/arrow plot.
plot_simulation_flow( oracle, scale = 1, min_mass = 0.01, arrow_color = "black", arrow_alpha = 0.8, point_size = 0.3, point_alpha = 0.3, title = NULL )plot_simulation_flow( oracle, scale = 1, min_mass = 0.01, arrow_color = "black", arrow_alpha = 0.8, point_size = 0.3, point_alpha = 0.3, title = NULL )
oracle |
Oracle object with simulation results |
scale |
Arrow scaling factor |
min_mass |
Minimum probability mass threshold |
arrow_color |
Arrow color (NULL for cluster colors) |
arrow_alpha |
Arrow transparency |
point_size |
Background point size |
point_alpha |
Background point transparency |
title |
Plot title |
ggplot object
Creates a Sankey diagram visualizing cell state transitions from perturbation simulation results.
plot_transition_sankey( oracle, cluster_column = NULL, before_column = "cluster_original", after_column = "cluster_predicted", color_dict = NULL, ... )plot_transition_sankey( oracle, cluster_column = NULL, before_column = "cluster_original", after_column = "cluster_predicted", color_dict = NULL, ... )
oracle |
Oracle object with simulation results |
cluster_column |
Column containing cluster assignments |
before_column |
Column containing original cluster assignments |
after_column |
Column containing predicted cluster assignments |
color_dict |
Optional named vector of colors |
... |
Additional arguments passed to sankey() |
Sankey diagram
Print method for Links
## S3 method for class 'Links' print(x, ...)## S3 method for class 'Links' print(x, ...)
x |
Links object |
... |
Additional arguments (unused) |
Invisibly returns x
Print method for Net
## S3 method for class 'Net' print(x, ...)## S3 method for class 'Net' print(x, ...)
x |
Net object |
... |
Additional arguments (unused) |
Invisibly returns x
Print method for Oracle
## S3 method for class 'Oracle' print(x, ...)## S3 method for class 'Oracle' print(x, ...)
x |
Oracle object |
... |
Additional arguments (unused) |
Invisibly returns x
Print method for TFinfo
## S3 method for class 'TFinfo' print(x, ...)## S3 method for class 'TFinfo' print(x, ...)
x |
TFinfo object |
... |
Additional arguments (unused) |
Invisibly returns x
Computes L2 norm (Euclidean length) for each row.
row_norms_cpp(X)row_norms_cpp(X)
X |
Input matrix |
Vector of row norms
Computes standard deviation for each row of a matrix.
rowSds_cpp(X)rowSds_cpp(X)
X |
Input matrix |
Vector of row standard deviations
Save Oracle object
save_oracle(oracle, path)save_oracle(oracle, path)
oracle |
Oracle object |
path |
File path to save |
Save TFinfo object
save_tfinfo(tfinfo, file_path)save_tfinfo(tfinfo, file_path)
tfinfo |
TFinfo object |
file_path |
Path to save |
Divides each column by its standard deviation.
scale_cols_cpp(X)scale_cols_cpp(X)
X |
Input matrix |
Scaled matrix
Convenience function for motif scanning.
scan_motifs(tfinfo, motifs = NULL, fpr = 0.02, n_cores = 1)scan_motifs(tfinfo, motifs = NULL, fpr = 0.02, n_cores = 1)
tfinfo |
TFinfo object |
motifs |
PWMatrixList or path to motif database (NULL for JASPAR) |
fpr |
False positive rate threshold |
n_cores |
Number of parallel cores |
Modified TFinfo object
Uses Support Vector Regression (SVR) to fit a nonparametric relationship between CV and mean expression, exactly like Python CellOracle.
score_cv_vs_mean( expr_matrix, n_top = 1000, min_expr_cells = 2, max_expr_avg = 20, min_expr_avg = 0, svr_gamma = NULL, winsorize = FALSE, winsor_perc = c(1, 99.5), sort_inverse = FALSE, plot = FALSE )score_cv_vs_mean( expr_matrix, n_top = 1000, min_expr_cells = 2, max_expr_avg = 20, min_expr_avg = 0, svr_gamma = NULL, winsorize = FALSE, winsor_perc = c(1, 99.5), sort_inverse = FALSE, plot = FALSE )
expr_matrix |
Expression matrix (genes x cells) |
n_top |
Number of top variable genes to select |
min_expr_cells |
Minimum cells expressing gene |
max_expr_avg |
Maximum average expression |
min_expr_avg |
Minimum average expression |
svr_gamma |
SVR gamma parameter (default: 150/n_genes) |
winsorize |
Whether to winsorize data |
winsor_perc |
Winsorization percentiles |
sort_inverse |
If TRUE, sort from less to more noisy |
plot |
Whether to plot results |
List with scores and selected genes
Configure the future parallel backend for CellOracleR computations.
setup_parallel( workers = NULL, plan = c("multisession", "multicore", "sequential"), verbose = TRUE )setup_parallel( workers = NULL, plan = c("multisession", "multicore", "sequential"), verbose = TRUE )
workers |
Number of workers (cores) to use. Default uses all available. |
plan |
Parallel plan: "multisession" (default, cross-platform), "multicore" (Unix only, faster), or "sequential" (no parallelization). |
verbose |
Whether to print information |
This function sets up the parallel backend using the future framework.
"multisession": Works on all platforms (Windows, Mac, Linux)
"multicore": Faster but Unix/Mac only (not Windows)
"sequential": No parallelization, useful for debugging
Invisibly returns the previous plan
## Not run: # Use 4 cores setup_parallel(workers = 4) # Use all available cores setup_parallel() # Disable parallelization setup_parallel(plan = "sequential") ## End(Not run)## Not run: # Use 4 cores setup_parallel(workers = 4) # Use all available cores setup_parallel() # Disable parallelization setup_parallel(plan = "sequential") ## End(Not run)
Calculate shortest paths between genes
shortest_paths(links, from, to, cluster = NULL)shortest_paths(links, from, to, cluster = NULL)
links |
Links object |
from |
Source gene |
to |
Target gene(s) |
cluster |
Specific cluster |
List with path information
Creates a randomized version of the coefficient matrix by shuffling target gene assignments while preserving the overall structure.
shuffle_coef_matrix_cpp(coef_matrix, seed = 123L)shuffle_coef_matrix_cpp(coef_matrix, seed = 123L)
coef_matrix |
Original coefficient matrix |
seed |
Random seed for reproducibility |
Shuffled coefficient matrix
Simulate the effect of gene perturbation on the gene regulatory network. This function propagates the perturbation signal through the GRN to predict changes in gene expression.
simulate_shift( oracle, perturb_condition, n_propagation = 3, GRN_unit = NULL, clip_delta_X = FALSE, ignore_warning = FALSE )simulate_shift( oracle, perturb_condition, n_propagation = 3, GRN_unit = NULL, clip_delta_X = FALSE, ignore_warning = FALSE )
oracle |
Oracle object with fitted GRN |
perturb_condition |
Named list: gene -> expression value |
n_propagation |
Number of signal propagation steps (1-5) |
GRN_unit |
Which GRN to use: "cluster" or "whole" |
clip_delta_X |
Clip simulated values to original range |
ignore_warning |
Ignore validation warnings |
Modified Oracle object with simulation results
Print summary statistics of simulation results.
simulation_summary(oracle)simulation_summary(oracle)
oracle |
Oracle object after simulation |
Invisible data frame with summary stats
Summarize Markov simulation trajectories
summarize_trajectories(trajectory, cluster_labels, n_steps_for_summary = NULL)summarize_trajectories(trajectory, cluster_labels, n_steps_for_summary = NULL)
trajectory |
Trajectory matrix from markov_simulate |
cluster_labels |
Cluster labels for cells |
n_steps_for_summary |
Steps to consider for summary |
Summary statistics
TFinfo class handles transcription factor binding site (TFBS) analysis. It processes peak data, scans for motif matches, and generates TF-target gene dictionaries for GRN construction.
peak_dfData frame with peak information
ref_genomeReference genome name
bsgenomeBSgenome object name
peak_rangesGRanges object for peaks
scanned_dfData frame with motif scanning results
TF_onehotOne-hot encoded TF binding matrix
all_TF_listAll TF names
new()
Create a new TFinfo object
TFinfo$new(peak_df = NULL, ref_genome = NULL, genomes_dir = NULL)
peak_dfData frame with columns: chr, start, end, peak_id, gene_short_name
ref_genomeReference genome (e.g., "hg38", "mm10")
genomes_dirDirectory for genome data (optional)
scan()
Scan peaks for TF motif matches
TFinfo$scan(motifs = NULL, fpr = 0.02, n_cores = 1, verbose = TRUE)
motifsPWMatrixList or path to motif database
fprFalse positive rate threshold
n_coresNumber of cores for parallel processing
verbosePrint progress
filter()
Filter TF results by various criteria
TFinfo$filter(min_peaks = 10, tfs_to_keep = NULL, tfs_to_remove = NULL)
min_peaksMinimum peaks with TF binding
tfs_to_keepTFs to include (NULL for all)
tfs_to_removeTFs to exclude
to_dataframe()
Convert to data frame format
TFinfo$to_dataframe()
Data frame with peak_id, gene_short_name, and TF columns
to_dictionary()
Convert to TF dictionary format
TFinfo$to_dictionary()
Named list mapping target genes to regulator TFs
save()
Save TFinfo object
TFinfo$save(file_path)
file_pathPath to save file
print()
Print TFinfo summary Get BSgenome package name from reference genome Load genome from BSgenome Get default motifs from JASPAR Load motifs from file
TFinfo$print()
clone()
The objects of this class are cloneable with this method.
TFinfo$clone(deep = FALSE)
deepWhether to make a deep clone.