| Title: | Map Single Cells to Reference Gene Expression Programs |
|---|---|
| Description: | Maps single-cell RNA sequencing data to reference gene expression programs (GEPs) using non-negative matrix factorization. Enables cell type annotation and state characterization by projecting query cells onto pre-built or custom reference programs. Includes tools for building consensus references from multiple datasets. Features C++ accelerated NNLS solvers and built-in machine learning models for cell type prediction. |
| Authors: | Zaoqu Liu [aut, cre] (ORCID: <https://orcid.org/0000-0002-0452-742X>) |
| Maintainer: | Zaoqu Liu <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0 |
| Built: | 2026-05-26 08:24:34 UTC |
| Source: | https://github.com/Zaoqu-Liu/CellProgramMapper |
Add CellProgramMapper results to Seurat object metadata
add_results_to_seurat(object, result, prefix = "")add_results_to_seurat(object, result, prefix = "")
object |
A Seurat object |
result |
A CellProgramMapperResult object |
prefix |
Prefix for column names (default: "") |
Seurat object with added metadata
Returns a data frame of available pre-built references in the CellProgramMapper database
available_references()available_references()
A data frame with reference information including Name, Version, Cell_Type, Tissue, Species, etc.
## Not run: refs <- available_references() print(refs) ## End(Not run)## Not run: refs <- available_references() print(refs) ## End(Not run)
Build consensus gene expression programs (cGEPs) by clustering GEPs from multiple cNMF results based on their pairwise correlations.
BuildConsensusReference( cnmf_paths, ks = NULL, density_thresholds = NULL, tpm_fns = NULL, score_fns = NULL, output_dir = ".", prefix = "", order_thresh = NULL, corr_thresh = 0.5, pct_thresh = 0.666, verbose = TRUE )BuildConsensusReference( cnmf_paths, ks = NULL, density_thresholds = NULL, tpm_fns = NULL, score_fns = NULL, output_dir = ".", prefix = "", order_thresh = NULL, corr_thresh = 0.5, pct_thresh = 0.666, verbose = TRUE )
cnmf_paths |
Character vector of paths to cNMF project directories. Each path should include the cNMF project name at the end, e.g., "cnmf_output_dir/cnmf_name" |
ks |
Integer vector of K values used for each cNMF result |
density_thresholds |
Numeric vector of density thresholds used for each cNMF result |
tpm_fns |
Optional: direct paths to TPM spectra files (alternative to ks/density_thresholds) |
score_fns |
Optional: direct paths to score spectra files |
output_dir |
Output directory for results (default: ".") |
prefix |
Prefix for output filenames (default: "") |
order_thresh |
Maximum rank for programs to be considered for clustering (default: number of datasets) |
corr_thresh |
Minimum correlation for programs to cluster (default: 0.5) |
pct_thresh |
Minimum fraction of connected programs to merge clusters (default: 0.666) |
verbose |
Print progress messages (default: TRUE) |
A list containing:
cluster_df: Data frame showing which GEPs clustered together
spectra_tpm: Consensus spectra in TPM units
spectra_score: Consensus spectra scores
hvgs_union: Union of highly variable genes
top_genes: Top genes for each cGEP
## Not run: result <- BuildConsensusReference( cnmf_paths = c("path/to/cnmf1", "path/to/cnmf2"), ks = c(10, 15), density_thresholds = c(0.1, 0.1), output_dir = "./consensus_output" ) ## End(Not run)## Not run: result <- BuildConsensusReference( cnmf_paths = c("path/to/cnmf1", "path/to/cnmf2"), ks = c(10, 15), density_thresholds = c(0.1, 0.1), output_dir = "./consensus_output" ) ## End(Not run)
Projects single-cell gene expression data onto a reference set of gene expression programs (GEPs) using non-negative matrix factorization. This enables cell type annotation and state characterization based on established program definitions.
CellProgramMapper( query, reference = "TCAT.V1", assay = NULL, layer = "counts", return_unnormalized = FALSE, method = c("cd", "active_set"), max_iter = 1000L, tol = 1e-08, n_workers = 1L, cache_dir = NULL, verbose = TRUE )CellProgramMapper( query, reference = "TCAT.V1", assay = NULL, layer = "counts", return_unnormalized = FALSE, method = c("cd", "active_set"), max_iter = 1000L, tol = 1e-08, n_workers = 1L, cache_dir = NULL, verbose = TRUE )
query |
Query data. Accepts multiple input formats:
|
reference |
Reference spectra. Can be:
|
assay |
For Seurat objects, which assay to use (default: active assay) |
layer |
For Seurat objects, which layer/slot to extract (default: "counts") |
return_unnormalized |
Logical. If TRUE, also return raw usage values. Default is FALSE, returning only normalized usage. |
method |
NNLS solver algorithm:
|
max_iter |
Maximum iterations for NNLS solver (default: 1000) |
tol |
Convergence tolerance for NNLS solver (default: 1e-8) |
n_workers |
Number of parallel workers for large datasets (default: 1) |
cache_dir |
Directory for caching downloaded references |
verbose |
Logical. Print progress messages (default: TRUE) |
The algorithm projects each cell's expression profile onto the reference gene expression programs by solving a non-negative least squares problem:
where is the scaled expression vector for cell i, is the
reference spectra matrix, and is the usage vector to be estimated.
Input data is preprocessed by:
Subsetting to genes present in both query and reference
Scaling each gene by its standard deviation (without centering)
A CellProgramMapperResult object containing:
Raw usage matrix (cells x programs)
Normalized usage matrix (rows sum to 1)
Computed add-on scores (if defined in reference)
Genes used for mapping
Reference name
Number of cells processed
Number of programs in reference
available_references for listing pre-built references
add_results_to_seurat for Seurat integration
BuildConsensusReference for building custom references
## Not run: # With Seurat object result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1") # With matrix result <- CellProgramMapper(counts_matrix, reference = "path/to/ref.tsv") # Access results head(result$usage_norm) head(result$scores) # Add to Seurat object seurat_obj <- add_results_to_seurat(seurat_obj, result) ## End(Not run)## Not run: # With Seurat object result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1") # With matrix result <- CellProgramMapper(counts_matrix, reference = "path/to/ref.tsv") # Access results head(result$usage_norm) head(result$scores) # Add to Seurat object seurat_obj <- add_results_to_seurat(seurat_obj, result) ## End(Not run)
Compute pre-defined scores from usage matrix based on score definitions
compute_scores( usage, usage_norm, score_data, score_path = NULL, ref_name = NULL, verbose = TRUE )compute_scores( usage, usage_norm, score_data, score_path = NULL, ref_name = NULL, verbose = TRUE )
usage |
Usage matrix (cells x programs) |
usage_norm |
Normalized usage matrix (rows sum to 1) |
score_data |
Score definitions loaded from YAML file |
score_path |
Path to score definitions directory |
ref_name |
Reference name (for built-in model lookup) |
verbose |
Print progress messages |
Data frame of computed scores
## Not run: scores <- compute_scores(usage, usage_norm, score_data) ## End(Not run)## Not run: scores <- compute_scores(usage, usage_norm, score_data) ## End(Not run)
Functions for building consensus gene expression programs from multiple datasets
Create a score_data structure from individual score definitions
create_score_data(...)create_score_data(...)
... |
Score definitions created by create_score_definition |
A score_data list compatible with compute_scores
## Not run: score_data <- create_score_data( create_score_definition("Score1", c("GEP1", "GEP2")), create_score_definition("Score2", c("GEP3"), type = "discrete", threshold = 0.5) ) ## End(Not run)## Not run: score_data <- create_score_data( create_score_definition("Score1", c("GEP1", "GEP2")), create_score_definition("Score2", c("GEP3"), type = "discrete", threshold = 0.5) ) ## End(Not run)
Create a custom score definition for use with compute_scores
create_score_definition( name, columns, weights = NULL, type = c("continuous", "discrete"), threshold = NULL, normalization = c("normalized", "raw") )create_score_definition( name, columns, weights = NULL, type = c("continuous", "discrete"), threshold = NULL, normalization = c("normalized", "raw") )
name |
Score name |
columns |
Vector of program names to include in score |
weights |
Optional vector of weights (default: all 1s) |
type |
Score type: "continuous" or "discrete" |
threshold |
Threshold for discrete scores |
normalization |
Use "normalized" or "raw" usage |
A score definition list
## Not run: my_score <- create_score_definition( name = "MyScore", columns = c("GEP1", "GEP2", "GEP3"), weights = c(1, 2, 1), type = "continuous" ) ## End(Not run)## Not run: my_score <- create_score_definition( name = "MyScore", columns = c("GEP1", "GEP2", "GEP3"), weights = c(1, 2, 1), type = "continuous" ) ## End(Not run)
Download a pre-built reference from the database to local cache
download_reference(reference, cache_dir = NULL, verbose = TRUE)download_reference(reference, cache_dir = NULL, verbose = TRUE)
reference |
Name of the reference to download |
cache_dir |
Directory to cache the reference (default: package cache) |
verbose |
Print progress messages |
Path to the downloaded reference directory
## Not run: ref_path <- download_reference("TCAT.V1") ## End(Not run)## Not run: ref_path <- download_reference("TCAT.V1") ## End(Not run)
Get probability for each lineage class instead of just the predicted label.
get_lineage_probabilities(usage_norm, model_name = "TCAT.V1")get_lineage_probabilities(usage_norm, model_name = "TCAT.V1")
usage_norm |
Normalized usage matrix |
model_name |
Model name (default: "TCAT.V1") |
Matrix of probabilities (cells x classes)
## Not run: probs <- get_lineage_probabilities(usage_norm, "TCAT.V1") ## End(Not run)## Not run: probs <- get_lineage_probabilities(usage_norm, "TCAT.V1") ## End(Not run)
Convenience function to extract computed scores from results.
get_scores(result)get_scores(result)
result |
A CellProgramMapperResult object |
Scores data frame (or NULL if no scores computed)
## Not run: result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1") scores <- get_scores(result) ## End(Not run)## Not run: result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1") scores <- get_scores(result) ## End(Not run)
Convenience function to extract usage matrix from results.
get_usage(result, normalized = TRUE)get_usage(result, normalized = TRUE)
result |
A CellProgramMapperResult object |
normalized |
Logical. Return normalized usage (TRUE, default) or raw values (FALSE) |
Usage matrix as data frame
## Not run: result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1") usage <- get_usage(result) ## End(Not run)## Not run: result <- CellProgramMapper(seurat_obj, reference = "TCAT.V1") usage <- get_usage(result) ## End(Not run)
List Available Built-in Models
list_builtin_models()list_builtin_models()
Character vector of available model names
Load a counts matrix from various file formats including 10X outputs, tab-delimited text files, or h5ad files.
load_counts(counts_file)load_counts(counts_file)
counts_file |
Path to counts file. Supported formats:
|
A list containing:
counts: Sparse matrix (cells x genes)
cell_names: Character vector of cell barcodes
gene_names: Character vector of gene names
## Not run: data <- load_counts("path/to/matrix.mtx.gz") dim(data$counts) ## End(Not run)## Not run: data <- load_counts("path/to/matrix.mtx.gz") dim(data$counts) ## End(Not run)
Load reference spectra from file or cache
load_reference(reference, cache_dir = NULL, verbose = TRUE)load_reference(reference, cache_dir = NULL, verbose = TRUE)
reference |
Either a reference name (e.g., "TCAT.V1") or path to a reference file (.tsv/.txt) |
cache_dir |
Directory to cache references |
verbose |
Print progress messages |
A list containing:
spectra: Reference spectra matrix (programs x genes)
score_data: Score definitions (if available)
score_path: Path to score file (if available)
ref_name: Reference name
## Not run: ref <- load_reference("TCAT.V1") dim(ref$spectra) ## End(Not run)## Not run: ref <- load_reference("TCAT.V1") dim(ref$spectra) ## End(Not run)
Map single cells to reference gene expression programs
Print Method for CellProgramMapperResult
## S3 method for class 'CellProgramMapperResult' print(x, ...)## S3 method for class 'CellProgramMapperResult' print(x, ...)
x |
A CellProgramMapperResult object |
... |
Additional arguments (ignored) |
Invisible x
Read a pandas DataFrame saved as NPZ file (format used by cNMF)
read_df_from_npz(file)read_df_from_npz(file)
file |
Path to .npz file saved with np.savez() from pandas DataFrame |
data.frame with proper row and column names
## Not run: df <- read_df_from_npz("path/to/data.df.npz") head(df) ## End(Not run)## Not run: df <- read_df_from_npz("path/to/data.df.npz") head(df) ## End(Not run)
Read a NumPy .npz archive file (compressed collection of .npy files). For files containing object arrays (strings), uses reticulate/numpy if available.
read_npz(file)read_npz(file)
file |
Path to .npz file |
Named list of arrays
## Not run: data <- read_npz("path/to/file.npz") names(data) ## End(Not run)## Not run: data <- read_npz("path/to/file.npz") names(data) ## End(Not run)
Functions for loading, downloading, and managing references
Save CellProgramMapper results to tab-delimited files
save_results( result, output_dir = ".", prefix = "CellProgramMapper", verbose = TRUE )save_results( result, output_dir = ".", prefix = "CellProgramMapper", verbose = TRUE )
result |
A CellProgramMapperResult object |
output_dir |
Output directory |
prefix |
Prefix for output files |
verbose |
Print progress messages |
Invisible NULL
## Not run: save_results(result, output_dir = "./output", prefix = "my_analysis") ## End(Not run)## Not run: save_results(result, output_dir = "./output", prefix = "my_analysis") ## End(Not run)
Summary Method for CellProgramMapperResult
## S3 method for class 'CellProgramMapperResult' summary(object, ...)## S3 method for class 'CellProgramMapperResult' summary(object, ...)
object |
A CellProgramMapperResult object |
... |
Additional arguments (ignored) |
Invisible summary data frame