| Title: | Multi-Omics Fusion for Subtype Recognition |
|---|---|
| Description: | A comprehensive toolkit for integrating multi-modal biological data to discover disease subtypes and biological mechanisms. MOFSR provides 15 state-of-the-art multi-omics clustering algorithms (SNF, wSNF, CPCA, iClusterBayes, IntNMF, LRAcluster, MCIA, MOFA, NEMO, PINSPlus, RGCCA, SGCCA, CIMLR, BCC, LateFusion), 17 classification methods, parallel computing support, comprehensive visualization (UMAP, heatmaps, survival curves), data preprocessing (normalization, filtering, batch correction, QC), feature selection with bootstrap validation, and cluster quality assessment. All core algorithms are implemented internally for maximum compatibility, cross-platform support, and optimized performance. Works on Windows, macOS, and Linux without external dependencies for core functionality. |
| Authors: | Zaoqu Liu [aut, cre] |
| Maintainer: | Zaoqu Liu <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 2.3.0 |
| Built: | 2026-06-06 06:29:09 UTC |
| Source: | https://github.com/Zaoqu-Liu/MOFSR |
Factor analysis methods for multi-omics data integration.
Performs multi-view factor analysis for multi-omics integration. This is a simplified Bayesian factor model with ARD priors for sparsity. For the full MOFA implementation, use the MOFA2 Bioconductor package.
multi_view_factor_analysis( data_list, n_factors = 10, max_iter = 1000, tol = 1e-05, sparsity_prior = TRUE, verbose = FALSE )multi_view_factor_analysis( data_list, n_factors = 10, max_iter = 1000, tol = 1e-05, sparsity_prior = TRUE, verbose = FALSE )
data_list |
List of data matrices (features x samples). |
n_factors |
Number of latent factors (default: 10). |
max_iter |
Maximum iterations (default: 1000). |
tol |
Convergence tolerance (default: 1e-5). |
sparsity_prior |
Use ARD prior for sparsity (default: TRUE). |
verbose |
Print progress (default: FALSE). |
This function implements a simplified multi-view factor model: Y_m = W_m * Z^T + E_m, where Y_m is the data for view m, W_m are the loadings, Z are the latent factors shared across views, and E_m is Gaussian noise. ARD (Automatic Relevance Determination) priors are used for automatic factor pruning.
List with factors (Z), weights (W), variance explained, and convergence info.
For production use of MOFA, we recommend the MOFA2 Bioconductor package which provides the full implementation with Python backend.
Zaoqu Liu; Email: [email protected]
Pure R implementation of iClusterBayes for multi-omics clustering.
Performs Bayesian integrative clustering using Gibbs sampling.
icluster_bayes( data_list, k, n_burnin = 1000, n_sample = 2000, n_thin = 1, prior_alpha = 1, prior_sigma = c(1, 1), verbose = FALSE )icluster_bayes( data_list, k, n_burnin = 1000, n_sample = 2000, n_thin = 1, prior_alpha = 1, prior_sigma = c(1, 1), verbose = FALSE )
data_list |
List of data matrices (features x samples). |
k |
Number of clusters. |
n_burnin |
Number of burn-in iterations (default: 1000). |
n_sample |
Number of sampling iterations (default: 2000). |
n_thin |
Thinning interval (default: 1). |
prior_alpha |
Dirichlet prior parameter (default: 1). |
prior_sigma |
Prior for residual variance (default: c(1, 1)). |
verbose |
Print progress. |
List with cluster assignments, posterior matrices, and parameters.
Zaoqu Liu
Mo Q, et al. A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics. 2018.
Internal implementation of NEMO algorithm for multi-omics clustering.
.NEMO_NEIGHBORS_RATIO.NEMO_NEIGHBORS_RATIO
An object of class numeric of length 1.
Zaoqu Liu
Rappoport N, Shamir R. NEMO: cancer subtyping by integration of partial multi-omic data. Bioinformatics. 2019.
Internal implementation of SNF algorithm for multi-omics data integration.
Computes an affinity matrix using local Gaussian kernel.
snf_affinity_matrix(diff, K = 20, sigma = 0.5)snf_affinity_matrix(diff, K = 20, sigma = 0.5)
diff |
Distance matrix (squared Euclidean distances). |
K |
Number of nearest neighbors for local scaling (default: 20). |
sigma |
Variance for local model (default: 0.5). |
Affinity matrix with exponential similarity.
Zaoqu Liu
Wang B, et al. Similarity Network Fusion for Aggregating Data Types on a Genomic Scale. Nat Methods. 2014;11(3):333-337.
Wang B, et al. Nat Methods. 2014 - Equation in Methods section
Aligns samples across multiple datasets to common samples.
align_samples(data_list)align_samples(data_list)
data_list |
List of data matrices. |
List of aligned matrices with common samples.
Performs Bayesian Consensus Clustering.
bcc_cluster( data_list, k, n_iter = 1000, n_burnin = 500, alpha = 1, verbose = FALSE )bcc_cluster( data_list, k, n_iter = 1000, n_burnin = 500, alpha = 1, verbose = FALSE )
data_list |
List of data matrices (features x samples). |
k |
Number of clusters. |
n_iter |
Number of MCMC iterations (default: 1000). |
n_burnin |
Number of burn-in iterations (default: 500). |
alpha |
Dirichlet prior concentration (default: 1). |
verbose |
Print progress. |
List with cluster assignments and consensus matrix.
Faster version of BCC using EM-like approach.
bcc_cluster_fast(data_list, k, max_iter = 50)bcc_cluster_fast(data_list, k, max_iter = 50)
data_list |
List of data matrices. |
k |
Number of clusters. |
max_iter |
Maximum iterations (default: 50). |
List with cluster assignments.
Computes CH index to evaluate clustering quality.
calc_chi( hclust_result, dist_matrix = NULL, max_clusters = round(1 + 3.3 * log10(length(hclust_result$order))) )calc_chi( hclust_result, dist_matrix = NULL, max_clusters = round(1 + 3.3 * log10(length(hclust_result$order))) )
hclust_result |
Hierarchical clustering result. |
dist_matrix |
Optional distance matrix. |
max_clusters |
Maximum clusters to evaluate. |
Vector of CH index values.
Computes PAC to evaluate clustering stability.
calc_pac(consensus_result, range_clusters = 2:6, x1 = 0.1, x2 = 0.9)calc_pac(consensus_result, range_clusters = 2:6, x1 = 0.1, x2 = 0.9)
consensus_result |
Result from consensus_cluster(). |
range_clusters |
Vector of K values to evaluate (default: 2:6). |
x1 |
Lower bound threshold (default: 0.1). |
x2 |
Upper bound threshold (default: 0.9). |
Data frame with PAC values for each K.
Calculates the Calinski-Harabasz index for evaluating clustering quality.
CalCHI( hclust_result, dist_matrix = NULL, max_clusters = round(1 + 3.3 * log10(length(hclust_result$order))) )CalCHI( hclust_result, dist_matrix = NULL, max_clusters = round(1 + 3.3 * log10(length(hclust_result$order))) )
hclust_result |
A hierarchical clustering object (result of hclust). |
dist_matrix |
Optional distance matrix. If not provided, cophenetic is used. |
max_clusters |
Integer. Maximum number of clusters to evaluate. |
A vector containing CH index values for each number of clusters.
Zaoqu Liu; Email: [email protected]
This function calculates the Proportion of Ambiguous Clustering (PAC) to help evaluate the optimal number of clusters in a consensus clustering analysis.
CalPAC(consensus_result, range_clusters = 2:6, x1 = 0.1, x2 = 0.9)CalPAC(consensus_result, range_clusters = 2:6, x1 = 0.1, x2 = 0.9)
consensus_result |
A list containing consensus clustering results from ConsensusClusterPlus. |
range_clusters |
Integer vector. The range of cluster numbers evaluated during consensus clustering. |
x1 |
Numeric. Lower bound for defining the PAC (default: 0.1). |
x2 |
Numeric. Upper bound for defining the PAC (default: 0.9). |
The PAC is calculated as the difference between the cumulative distribution function values at two thresholds, x2 and x1, on the consensus matrix values. A lower PAC indicates a more stable clustering solution.
A data frame containing the PAC values for each number of clusters.
Zaoqu Liu; Email: [email protected]
data <- mtcars cc_res <- RunCC(data) pac_values <- CalPAC(cc_res) pac_valuesdata <- mtcars cc_res <- RunCC(data) pac_values <- CalPAC(cc_res) pac_values
Checks if samples are aligned across datasets.
check_sample_alignment(data_list, strict = FALSE)check_sample_alignment(data_list, strict = FALSE)
data_list |
List of data matrices. |
strict |
If TRUE, requires identical sample names. |
TRUE if aligned, otherwise prints discrepancies.
Performs multi-kernel learning based clustering.
cimlr_cluster(data_list, k, n_kernels = 10, max_iter = 30)cimlr_cluster(data_list, k, n_kernels = 10, max_iter = 30)
data_list |
List of data matrices (features x samples). |
k |
Number of clusters. |
n_kernels |
Number of kernels per data type (default: 10). |
max_iter |
Maximum iterations (default: 30). |
List with clusters, kernel, weights.
Ranks features based on their contribution to clustering.
cimlr_feature_ranking(data_list, cluster)cimlr_feature_ranking(data_list, cluster)
data_list |
List of data matrices. |
cluster |
Cluster assignments. |
List of feature importance for each data type.
This function performs classification using AdaBoost to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.Adaboost( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.Adaboost( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains an AdaBoost model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Adaboost( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Adaboost( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Decision Tree to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.DT( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.DT( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a Decision Tree model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.DT( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.DT( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Elastic Net to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.Enet( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.Enet( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains an Elastic Net model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Enet( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Enet( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using pathway enrichment analysis and a neural network to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.Enrichment( data.test, data.train, cluster.data, cluster.markers, scale = TRUE, nCores = 5 )Classifier.Enrichment( data.test, data.train, cluster.data, cluster.markers, scale = TRUE, nCores = 5 )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names, 'OR' as odds ratio, and 'AUC' as area under curve. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
nCores |
An integer indicating the number of cores to use for pathway enrichment analysis. Default is 5. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Filters out markers with low odds ratio. 6. Performs pathway enrichment analysis using the ssMwwGST method.
A data frame with: - ID: The sample identifier. - Cluster: The predicted cluster label for each sample. - NES: Normalized Enrichment Score (NES) for each cluster assignment.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Enrichment( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Enrichment( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Gradient Boosted Decision Trees (GBDT) to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.GBDT( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.GBDT( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a Gradient Boosted Decision Trees model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.GBDT( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.GBDT( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using k-Nearest Neighbors to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.kNN( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.kNN( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names and matches training data. 2. Scales the test data for prediction if 'scale' is TRUE. 3. Selects genes that are common between the test and training datasets. 4. Uses glmnet to identify the important markers for each cluster and trains a k-Nearest Neighbors (kNN) model for classification. 5. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Cluster: The predicted cluster label for each sample. - Probabilities: The probabilities for each cluster assignment.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.kNN( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.kNN( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using LASSO (Least Absolute Shrinkage and Selection Operator) to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.LASSO( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.LASSO( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a LASSO model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.LASSO( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.LASSO( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Linear Discriminant Analysis (LDA) to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.LDA( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.LDA( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains an LDA model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.LDA( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.LDA( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Naive Bayes to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.NBayes( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.NBayes( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a Naive Bayes model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.NBayes( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.NBayes( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using a neural network to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.NNet( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.NNet( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a neural network model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.NNet( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.NNet( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using PCA and a neural network to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.PCA( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.PCA( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and performs PCA to reduce dimensionality. 6. Trains a neural network model for classification using the top PCs. 7. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.PCA( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.PCA( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Random Forest to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.RF( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.RF( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction if 'scale' is TRUE. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a Random Forest model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Cluster: The predicted cluster label for each sample. - Probabilities: The probabilities for each cluster assignment.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.RF( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.RF( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Ridge Regression to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.Ridge( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.Ridge( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains a Ridge model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Ridge( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.Ridge( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs single-sample Gene Set Enrichment Analysis (ssGSEA) to assign subtypes to samples based on marker gene sets. It calculates enrichment scores with permutation testing and predicts sample subtypes based on the most significant enrichment results.
Classifier.ssGSEA( data.test, marker.list, dir.file = ".", gct.filename = "data.gct", number.perms = 100, tolerate.mixed = FALSE, method = c("internal", "GSVA", "external"), seed = 12345 )Classifier.ssGSEA( data.test, marker.list, dir.file = ".", gct.filename = "data.gct", number.perms = 100, tolerate.mixed = FALSE, method = c("internal", "GSVA", "external"), seed = 12345 )
data.test |
A matrix or data frame representing the input expression data, where rows are genes and columns are samples. |
marker.list |
A named list of marker gene sets, where each list element corresponds to a specific subtype or category of interest. |
dir.file |
Character. Directory for saving the output files (default: '.'). Set to NULL to skip file output. |
gct.filename |
Character. The filename for the generated GCT file (default: 'data.gct'). |
number.perms |
Integer. Number of permutations for ssGSEA analysis (default: 100). |
tolerate.mixed |
Logical. Whether to allow "Mixed" predictions when multiple gene sets have the same minimum p-value (default: FALSE). |
method |
Character. The ssGSEA implementation to use: "internal" (built-in), "GSVA" (requires GSVA package), or "external" (requires ssgsea.GBM.classification package). Default: "internal". |
seed |
Integer. Random seed for reproducibility (default: 12345). |
The function:
Calculates ssGSEA enrichment scores for each marker gene set in every sample.
Performs permutation testing to estimate statistical significance.
Predicts subtypes by identifying the marker gene set with the most significant enrichment (smallest p-value).
If tolerate.mixed is TRUE and multiple gene sets share the same
minimum p-value, the sample is labeled as "Mixed".
A data frame with the following columns:
ID: Sample identifiers.
Predict: Predicted subtype for each sample.
Columns with _pval: P-values for each marker gene set or subtype.
Zaoqu Liu; Email: [email protected]
Wang Q, Hu B, Hu X, Kim H, Squatrito M, Scarpace L, et al. Tumor Evolution of Glioma-Intrinsic Gene Expression Subtypes Associates with Immunological Changes in the Microenvironment. Cancer Cell. July 2017;32(1):42-56.e6.
# Simulated expression data data.test <- matrix(rnorm(10000), nrow = 100, ncol = 100) rownames(data.test) <- paste0("Gene", 1:100) colnames(data.test) <- paste0("Sample", 1:100) # Example marker list marker.list <- list( Subtype1 = c("Gene1", "Gene2", "Gene3"), Subtype2 = c("Gene4", "Gene5", "Gene6") ) # Run ssGSEA-based subtyping result <- Classifier.ssGSEA( data.test = data.test, marker.list = marker.list, number.perms = 50, tolerate.mixed = TRUE ) print(result)# Simulated expression data data.test <- matrix(rnorm(10000), nrow = 100, ncol = 100) rownames(data.test) <- paste0("Gene", 1:100) colnames(data.test) <- paste0("Sample", 1:100) # Example marker list marker.list <- list( Subtype1 = c("Gene1", "Gene2", "Gene3"), Subtype2 = c("Gene4", "Gene5", "Gene6") ) # Run ssGSEA-based subtyping result <- Classifier.ssGSEA( data.test = data.test, marker.list = marker.list, number.perms = 50, tolerate.mixed = TRUE ) print(result)
This function performs classification using Stepwise Logistic Regression to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.StepLR( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.StepLR( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Scales the test data for prediction. 3. Selects genes that are common between the test and training datasets. 4. Uses glmnet to identify the important markers for each cluster and trains a multinomial logistic regression model for classification. 5. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.StepLR( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.StepLR( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using SVD and a neural network to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.SVD( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.SVD( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and performs SVD to reduce dimensionality. 6. Trains a neural network model for classification using the top singular vectors. 7. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.SVD( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.SVD( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using Support Vector Machine (SVM) to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.SVM( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.SVM( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains an SVM model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.SVM( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.SVM( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
This function performs classification using XGBoost to predict cluster assignments for test data based on trained models from training data and cluster markers.
Classifier.XGBoost( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )Classifier.XGBoost( data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function operates as follows: 1. Ensures that the 'cluster.data' has the correct column names. 2. Adds a one-hot encoded matrix for cluster assignments. 3. Scales the test data for prediction. 4. Selects genes that are common between the test and training datasets. 5. Uses glmnet to identify the important markers for each cluster and trains an XGBoost model for classification. 6. Predicts the cluster for test samples and provides probabilities for each cluster.
A data frame with: - ID: The sample identifier. - Probabilities: The probabilities for each cluster assignment. - Predict: The predicted cluster label for each sample.
Zaoqu Liu; Email: [email protected]
cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.XGBoost( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)cluster.data <- data.frame( Sample = paste0("Sample", 1:60), Cluster = rep(paste0("C", 1:3), each = 20) ) data.train <- matrix(rnorm(6000), nrow = 100, dimnames = list(paste0("Gene", 1:100), cluster.data$Sample) ) data.test <- matrix(rnorm(5000), nrow = 100, dimnames = list(paste0("Gene", 1:100), paste0("P", 1:50)) ) cluster.markers <- setNames( lapply( unique(cluster.data$Cluster), function(cluster) { data.frame(Gene = sample(rownames(data.train), 10)) } ), unique(cluster.data$Cluster) ) result <- Classifier.XGBoost( data.test = data.test, data.train = data.train, cluster.data, cluster.markers ) head(result)
Creates a comparison matrix showing agreement between algorithms.
compare_clusterings(results)compare_clusterings(results)
results |
List of clustering results from run_multiple_algorithms(). |
Matrix of Adjusted Rand Index values.
Performs UMAP for visualization of multi-omics data.
compute_umap(data, n_neighbors = 15, min_dist = 0.1, n_epochs = 200, seed = 42)compute_umap(data, n_neighbors = 15, min_dist = 0.1, n_epochs = 200, seed = 42)
data |
Data matrix (samples x features) or list of matrices. |
n_neighbors |
Number of neighbors (default: 15). |
min_dist |
Minimum distance (default: 0.1). |
n_epochs |
Number of epochs (default: 200). |
seed |
Random seed. |
Matrix with UMAP coordinates (samples x 2).
Performs consensus clustering to identify stable clusters.
consensus_cluster( d, maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1, clusterAlg = "hc", innerLinkage = "ward.D2", finalLinkage = "ward.D2", distance = "euclidean", seed = NULL, verbose = FALSE )consensus_cluster( d, maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1, clusterAlg = "hc", innerLinkage = "ward.D2", finalLinkage = "ward.D2", distance = "euclidean", seed = NULL, verbose = FALSE )
d |
Data matrix (features x samples) or distance object. |
maxK |
Maximum number of clusters to evaluate (default: 6). |
reps |
Number of resampling iterations (default: 1000). |
pItem |
Proportion of items to sample in each iteration (default: 0.8). |
pFeature |
Proportion of features to sample (default: 1). |
clusterAlg |
Clustering algorithm: "hc", "km", or "pam" (default: "hc"). |
innerLinkage |
Linkage method for hierarchical clustering (default: "ward.D2"). |
finalLinkage |
Linkage for final clustering (default: "ward.D2"). |
distance |
Distance metric (default: "euclidean"). |
seed |
Random seed for reproducibility. |
verbose |
Print progress messages. |
List containing consensus matrices and clustering results.
Performs simple batch correction using mean centering.
correct_batch(data_list, batch, method = "center")correct_batch(data_list, batch, method = "center")
data_list |
List of data matrices. |
batch |
Vector of batch labels for each sample. |
method |
Method: "center" (mean centering) or "combat_simple". |
List of batch-corrected matrices.
Performs Consensus PCA on multiple data types.
cpca(data_list, ncomp = 2)cpca(data_list, ncomp = 2)
data_list |
List of data matrices (features x samples). |
ncomp |
Number of components (default: 2). |
List with scores, loadings, eigenvalues.
This function calculates the coefficient of variation (CV) for a given numeric vector.
CV(V)CV(V)
V |
A numeric vector. |
The coefficient of variation (CV) for the input vector.
Zaoqu Liu; E-mail: [email protected]
This function calculates the coefficient of variation (CV) for each column in a data frame.
CV.df(df)CV.df(df)
df |
A data frame with row samples and column features. |
A sorted vector of CV values for each column in the data frame, in decreasing order.
Zaoqu Liu; E-mail: [email protected]
This function performs feature selection using a logistic regression model and bootstrapping for each cluster in the dataset. For each cluster, the function evaluates which features are significantly associated with the cluster based on a logistic regression model with bootstrap sampling. The output includes a count of significant features for each cluster based on the p-value threshold.
FeatureSelectionWithBootstrap( data, p.no_bootstrap = 0.01, p.bootstrap = 0.05, num.iteration = 1000, nCores = parallel::detectCores() - 3 )FeatureSelectionWithBootstrap( data, p.no_bootstrap = 0.01, p.bootstrap = 0.05, num.iteration = 1000, nCores = parallel::detectCores() - 3 )
data |
A data frame where the first column is the cluster variable (categorical), and the other columns are feature values. |
p.no_bootstrap |
Numeric. The p-value threshold for the features to be selected based on the original data (default: 0.01). |
p.bootstrap |
Numeric. The p-value threshold for the features to be selected based on bootstrapped samples (default: 0.05). |
num.iteration |
Integer. The number of bootstrap iterations (default: 1000). |
nCores |
Integer. The number of CPU cores to use for parallel processing (default: automatically set to all cores minus 3). |
A list of data frames, each containing the features selected for each cluster and the count of significant features based on bootstrapping.
Zaoqu Liu; Email: [email protected]
Keeps features with highest MAD values.
filter_by_mad(data_list, top_n = 5000)filter_by_mad(data_list, top_n = 5000)
data_list |
List of data matrices. |
top_n |
Number of top features to keep. |
List of filtered matrices.
Removes features with low variance across samples.
filter_low_variance(data_list, min_var = 0.01, top_n = NULL, top_pct = NULL)filter_low_variance(data_list, min_var = 0.01, top_n = NULL, top_pct = NULL)
data_list |
List of data matrices (features x samples). |
min_var |
Minimum variance threshold (default: 0.01). |
top_n |
Keep top N features by variance (optional). |
top_pct |
Keep top percentage of features (optional, 0-1). |
List of filtered matrices.
Selects optimal feature combinations for multi-modality clustering analysis.
Find.OptClusterFeatures( data_layers, feature_subset_sizes, try_num_clusters = 2:6, n_runs = 5, n_fold = 5 )Find.OptClusterFeatures( data_layers, feature_subset_sizes, try_num_clusters = 2:6, n_runs = 5, n_fold = 5 )
data_layers |
A named list of matrices (features x samples). |
feature_subset_sizes |
A list of sequences for feature subset sizes. |
try_num_clusters |
Integer vector. Cluster numbers to test (default: 2:6). |
n_runs |
Integer. NMF iterations (default: 5). |
n_fold |
Integer. Cross-validation folds (default: 5). |
A list with optimal_combination and all_results.
Zaoqu Liu; Email: [email protected]
The gene_sets data object contains a unified collection of gene sets derived from multiple sources, including curated pathways (c2.cp.v2022.1.Hs), Gene Ontology terms (c5.go.v2022.1.Hs), and hallmark gene sets (h.all.v2022.1.Hs). These gene sets are based on version 2022.1 and represent key biological processes, pathways, and well-defined biological states. By integrating these sources, the gene_sets object provides a comprehensive dataset that can be utilized for enrichment analysis and functional exploration, offering valuable insights into underlying biological mechanisms.
gene_setsgene_sets
data.frame
Extracts cluster assignments for a specific K.
get_consensus_class(consensus_fit, k)get_consensus_class(consensus_fit, k)
consensus_fit |
Result from consensus_cluster(). |
k |
Number of clusters. |
Data frame with sample IDs and cluster assignments.
This function extracts binary cluster assignments from multiple clustering results.
get.binary.clusters(res)get.binary.clusters(res)
res |
A list containing clustering results |
A data frame where each row represents a binary encoding of cluster assignments across different methods.
Zaoqu Liu; Email: [email protected]
Extract cluster assignments from Consensus Clustering results for a specific number of clusters.
get.class(cc.res, k)get.class(cc.res, k)
cc.res |
Consensus clustering results from ConsensusClusterPlus. |
k |
Integer. The number of clusters to extract. |
A data frame with the following columns: - ID: The sample identifier. - Cluster: The assigned cluster label, prefixed by 'C'.
Zaoqu Liu; Email: [email protected]
data <- mtcars cc_res <- RunCC(data) clu <- get.class(cc_res, 2) cludata <- mtcars cc_res <- RunCC(data) clu <- get.class(cc_res, 2) clu
This function calculates the Jaccard distance or similarity for a binary matrix. It is typically used to evaluate the similarity or dissimilarity between columns of a binary matrix.
get.Jaccard.Distance(data, dissimilarity = TRUE)get.Jaccard.Distance(data, dissimilarity = TRUE)
data |
A binary matrix where rows represent features and columns represent samples. |
dissimilarity |
Logical. If TRUE, returns the Jaccard distance; if FALSE, returns the Jaccard similarity (default: TRUE). |
The function computes the Jaccard distance (or similarity) between each pair of columns in the input binary matrix. The Jaccard distance is calculated as 1 minus the Jaccard similarity.
A matrix containing the Jaccard distance or similarity between each pair of columns in the input matrix.
Zaoqu Liu; Email: [email protected]
data <- matrix(sample(0:1, 1500, replace = TRUE), nrow = 30, ncol = 50) jaccard_dist <- get.Jaccard.Distance(as.data.frame(data), dissimilarity = TRUE) jaccard_distdata <- matrix(sample(0:1, 1500, replace = TRUE), nrow = 30, ncol = 50) jaccard_dist <- get.Jaccard.Distance(as.data.frame(data), dissimilarity = TRUE) jaccard_dist
Imputes or removes missing values in multi-omics data.
handle_missing(data_list, method = "median", threshold = 0.5, k = 5)handle_missing(data_list, method = "median", threshold = 0.5, k = 5)
data_list |
List of data matrices. |
method |
Method: "remove_features", "remove_samples", "mean", "median", "knn". |
threshold |
For remove methods: maximum proportion of NA allowed. |
k |
For KNN imputation: number of neighbors. |
List of processed matrices.
Faster version using variational approximation.
icluster_bayes_fast(data_list, k, max_iter = 100, tol = 1e-04, verbose = FALSE)icluster_bayes_fast(data_list, k, max_iter = 100, tol = 1e-04, verbose = FALSE)
data_list |
List of data matrices. |
k |
Number of clusters. |
max_iter |
Maximum iterations (default: 100). |
tol |
Convergence tolerance (default: 1e-4). |
verbose |
Print progress. |
List with cluster assignments and parameters.
Installs optional packages for extended functionality.
init(classifier = TRUE, survival = FALSE, gsva = FALSE)init(classifier = TRUE, survival = FALSE, gsva = FALSE)
classifier |
Logical. Install classifier dependencies (default: TRUE). |
survival |
Logical. Install survival analysis dependencies (default: FALSE). |
gsva |
Logical. Install GSVA dependencies (default: FALSE). |
Zaoqu Liu; Email: [email protected]
Main IntNMF algorithm using non-negative alternating least squares.
intnmf_cluster( dat, k, maxiter = 200, st_count = 20, n_ini = 30, ini_nndsvd = TRUE, seed = TRUE, wt = NULL )intnmf_cluster( dat, k, maxiter = 200, st_count = 20, n_ini = 30, ini_nndsvd = TRUE, seed = TRUE, wt = NULL )
dat |
List of data matrices (samples x features). |
k |
Number of clusters. |
maxiter |
Maximum iterations (default: 200). |
st_count |
Stability count for convergence (default: 20). |
n_ini |
Number of initializations (default: 30). |
ini_nndsvd |
Use NNDSVD initialization (default: TRUE). |
seed |
Use random seed (default: TRUE). |
wt |
Weights for each data set. |
List with W, H matrices, clusters, and convergence info.
Selects optimal number of clusters using cross-validation.
intnmf_opt_k( dat, n_runs = 30, n_fold = 5, k_range = 2:8, maxiter = 100, st_count = 10, wt = NULL, verbose = TRUE )intnmf_opt_k( dat, n_runs = 30, n_fold = 5, k_range = 2:8, maxiter = 100, st_count = 10, wt = NULL, verbose = TRUE )
dat |
List of data matrices. |
n_runs |
Number of runs (default: 30). |
n_fold |
Number of CV folds (default: 5). |
k_range |
Range of K values to test (default: 2:8). |
maxiter |
Maximum iterations (default: 100). |
st_count |
Stability count (default: 10). |
wt |
Weights for datasets. |
verbose |
Print progress. |
Matrix of CPI values for each K and run.
Performs low-rank approximation clustering.
lracluster(data, types, dimension = 2, names = NULL)lracluster(data, types, dimension = 2, names = NULL)
data |
List of data matrices. |
types |
Character vector of data types ("binary", "gaussian", "poisson"). |
dimension |
Target dimension/rank (default: 2). |
names |
Names for each dataset. |
List with coordinate matrix and potential (quality measure).
This function calculates the median absolute deviation (MAD) for each column in a data frame.
MAD.df(df)MAD.df(df)
df |
A data frame with row samples and column features. |
A sorted vector of MAD values for each column in the data frame, in decreasing order.
Zaoqu Liu; E-mail: [email protected]
Performs MCIA on multiple data matrices.
mcia(data_list, ncomp = 2)mcia(data_list, ncomp = 2)
data_list |
List of data matrices (features x samples). |
ncomp |
Number of components (default: 2). |
List with global scores, block scores, loadings, and eigenvalues.
This function calculates the mean value for each column in a data frame.
Mean.df(df)Mean.df(df)
df |
A data frame with row samples and column features. |
A vector of mean values for each column in the data frame.
Zaoqu Liu; E-mail: [email protected]
This function calculates the median value for each column in a data frame.
Median.df(df)Median.df(df)
df |
A data frame with row samples and column features. |
A vector of median values for each column in the data frame.
Zaoqu Liu; E-mail: [email protected]
This function performs min-max normalization on a numeric vector.
minmax(x)minmax(x)
x |
A numeric vector. |
A normalized numeric vector with values between 0 and 1.
Zaoqu Liu; E-mail: [email protected]
This function performs min-max normalization on each column in a data frame.
minmax.df(data)minmax.df(data)
data |
A data frame with row samples and column features. |
A data frame with normalized values between 0 and 1 for each column.
Zaoqu Liu; E-mail: [email protected]
Constructs a single affinity graph from multi-omics data.
nemo_affinity_graph(raw_data, k = NA)nemo_affinity_graph(raw_data, k = NA)
raw_data |
List of data matrices (features x samples). |
k |
Number of neighbors. Can be a number, vector, or NA. |
Affinity matrix measuring similarity across all omics.
Performs multi-omic clustering using the NEMO algorithm.
nemo_clustering(omics_list, num_clusters = NULL, num_neighbors = NA)nemo_clustering(omics_list, num_clusters = NULL, num_neighbors = NA)
omics_list |
List of data matrices (features x samples). |
num_clusters |
Number of clusters (NULL for automatic estimation). |
num_neighbors |
Number of neighbors (NA for automatic selection). |
Named vector of cluster assignments.
Uses eigengap heuristic to estimate optimal cluster number.
nemo_num_clusters(W, NUMC = 2:15)nemo_num_clusters(W, NUMC = 2:15)
W |
Affinity matrix. |
NUMC |
Possible cluster numbers to evaluate (default: 2:15). |
Estimated number of clusters.
Functions for parallel execution of multi-omics analysis.
Sets up parallel processing using the future framework.
setup_parallel(workers = NULL, strategy = "multisession", verbose = TRUE)setup_parallel(workers = NULL, strategy = "multisession", verbose = TRUE)
workers |
Number of worker processes (default: detectCores() - 1). |
strategy |
Parallel strategy: "multisession", "multicore", or "sequential". |
verbose |
Print configuration info (default: TRUE). |
- "multisession": Works on all platforms (Windows, macOS, Linux) - "multicore": Faster on Unix-like systems, not available on Windows - "sequential": No parallelism (useful for debugging)
Invisibly returns the previous plan.
Zaoqu Liu; Email: [email protected]
Performs bootstrap-based feature selection in parallel.
parallel_bootstrap_features( data, n_bootstrap = 1000, n_workers = NULL, p_threshold = 0.05, progress = TRUE )parallel_bootstrap_features( data, n_bootstrap = 1000, n_workers = NULL, p_threshold = 0.05, progress = TRUE )
data |
Data frame with cluster variable in first column. |
n_bootstrap |
Number of bootstrap iterations (default: 1000). |
n_workers |
Number of workers (default: auto). |
p_threshold |
P-value threshold (default: 0.05). |
progress |
Show progress (default: TRUE). |
List of significant features per cluster.
Performs consensus clustering with parallel resampling.
parallel_consensus_cluster( data, max_k = 6, n_reps = 1000, n_workers = NULL, prop_sample = 0.8, cluster_method = "hclust", verbose = TRUE )parallel_consensus_cluster( data, max_k = 6, n_reps = 1000, n_workers = NULL, prop_sample = 0.8, cluster_method = "hclust", verbose = TRUE )
data |
Data matrix (features x samples). |
max_k |
Maximum number of clusters (default: 6). |
n_reps |
Number of resampling iterations (default: 1000). |
n_workers |
Number of workers (default: auto). |
prop_sample |
Proportion of samples to include (default: 0.8). |
cluster_method |
Base clustering method (default: "hclust"). |
verbose |
Print progress (default: TRUE). |
List with consensus matrices and cluster results.
This function performs pathway differential expression analysis based on clustering results and pathway activity scores derived from ssMwwGST.
PathDEA( Cluster_data, ssMwwGST_results, dea_FDR_threshold = 0.001, dea_gap_threshold = 1.5 )PathDEA( Cluster_data, ssMwwGST_results, dea_FDR_threshold = 0.001, dea_gap_threshold = 1.5 )
Cluster_data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. |
ssMwwGST_results |
A list of results from ssMwwGST, including NES (Normalized Enrichment Scores). |
dea_FDR_threshold |
Numeric. The FDR threshold to use for filtering significant pathways. Default is 0.001. |
dea_gap_threshold |
Numeric. The median gap threshold to use for filtering significant pathways. Default is 1.5. |
The function operates as follows: 1. Extracts Normalized Enrichment Scores (NES) from the ssMwwGST results. 2. Performs Wilcoxon rank-sum tests to compare pathway activity between clusters for each pathway. 3. Calculates median and mean differences in pathway activity between clusters. 4. Adjusts p-values using the Benjamini-Hochberg method to control the false discovery rate (FDR).
A list containing: - dea_path: A list of data frames, each containing the differential expression analysis results for each cluster. - dea_path2: A list of data frames containing the filtered differential expression analysis results for each cluster based on FDR and median gap thresholds. - NES: A data frame of Normalized Enrichment Scores for each gene set and each sample. - Cluster: A data frame of sample IDs and their corresponding cluster assignments.
Zaoqu Liu; Email: [email protected]
# Example usage: Cluster_data <- data.frame(Sample = paste0("Sample", 1:10), Cluster = rep(1:2, each = 5)) ssMwwGST_results <- list(NES = matrix(rnorm(100), nrow = 10, ncol = 10, dimnames = list(paste0("Pathway", 1:10), paste0("Sample", 1:10)) )) result <- PathDEA(Cluster_data, ssMwwGST_results)# Example usage: Cluster_data <- data.frame(Sample = paste0("Sample", 1:10), Cluster = rep(1:2, each = 5)) ssMwwGST_results <- list(NES = matrix(rnorm(100), nrow = 10, ncol = 10, dimnames = list(paste0("Pathway", 1:10), paste0("Sample", 1:10)) )) result <- PathDEA(Cluster_data, ssMwwGST_results)
Performs perturbation-based clustering to find optimal k.
perturbation_clustering( data, kMin = 2, kMax = 5, k = NULL, n_iter = 50, clustering_method = "kmeans", verbose = FALSE )perturbation_clustering( data, kMin = 2, kMax = 5, k = NULL, n_iter = 50, clustering_method = "kmeans", verbose = FALSE )
data |
Data matrix (samples x features). |
kMin |
Minimum number of clusters (default: 2). |
kMax |
Maximum number of clusters (default: 5). |
k |
Fixed number of clusters (optional). |
n_iter |
Number of perturbation iterations (default: 50). |
clustering_method |
Clustering method: "kmeans", "hclust", "pam". |
verbose |
Print progress. |
List with k, cluster, origS, pertS.
Visualizes agreement between different clustering algorithms.
plot_algorithm_comparison(results, title = "Algorithm Agreement (ARI)")plot_algorithm_comparison(results, title = "Algorithm Agreement (ARI)")
results |
List of clustering results from run_multiple_algorithms(). |
title |
Plot title. |
A ggplot object or base R plot.
Visualizes PAC, CHI, and silhouette scores across different K values.
plot_cluster_quality( pac_values, chi_values = NULL, title = "Cluster Quality Assessment" )plot_cluster_quality( pac_values, chi_values = NULL, title = "Cluster Quality Assessment" )
pac_values |
PAC values from CalPAC(). |
chi_values |
CHI values from CalCHI() (optional). |
title |
Plot title. |
A ggplot object or base R plot.
Visualizes the consensus matrix from consensus clustering.
plot_consensus_heatmap( consensus_matrix, clusters = NULL, title = "Consensus Matrix", colors = NULL )plot_consensus_heatmap( consensus_matrix, clusters = NULL, title = "Consensus Matrix", colors = NULL )
consensus_matrix |
Consensus matrix (from RunCC or consensus_cluster). |
clusters |
Vector of cluster assignments for annotation. |
title |
Plot title. |
colors |
Color palette for heatmap. |
Invisibly returns the plot object.
Creates a silhouette plot for cluster quality evaluation.
plot_silhouette( clusters, dist_matrix, title = "Silhouette Analysis", colors = NULL )plot_silhouette( clusters, dist_matrix, title = "Silhouette Analysis", colors = NULL )
clusters |
Vector of cluster assignments. |
dist_matrix |
Distance matrix. |
title |
Plot title. |
colors |
Cluster colors. |
A ggplot object or base R plot.
Visualizes survival differences between clusters.
plot_survival( time, event, clusters, title = "Kaplan-Meier Survival Curves", colors = NULL, conf_int = TRUE, risk_table = FALSE )plot_survival( time, event, clusters, title = "Kaplan-Meier Survival Curves", colors = NULL, conf_int = TRUE, risk_table = FALSE )
time |
Survival time vector. |
event |
Event indicator (1 = event, 0 = censored). |
clusters |
Vector of cluster assignments. |
title |
Plot title. |
colors |
Cluster colors. |
conf_int |
Show confidence intervals (default: TRUE). |
risk_table |
Show risk table (default: FALSE). |
A ggplot object (if survminer available) or base R plot.
Creates a UMAP plot colored by cluster assignments.
plot_umap( umap_coords, clusters, title = "UMAP Visualization", point_size = 2, colors = NULL, show_legend = TRUE )plot_umap( umap_coords, clusters, title = "UMAP Visualization", point_size = 2, colors = NULL, show_legend = TRUE )
umap_coords |
UMAP coordinates from compute_umap(). |
clusters |
Vector of cluster assignments or data frame with Cluster column. |
title |
Plot title (default: "UMAP Visualization"). |
point_size |
Point size (default: 2). |
colors |
Custom color palette (optional). |
show_legend |
Show legend (default: TRUE). |
A ggplot object if ggplot2 is available, otherwise base R plot.
Functions for preprocessing multi-omics data before integration.
Applies normalization to each data matrix in the list.
normalize_omics(data_list, method = "zscore", by_feature = TRUE)normalize_omics(data_list, method = "zscore", by_feature = TRUE)
data_list |
List of data matrices (features x samples). |
method |
Normalization method: "zscore", "minmax", "quantile", "log2", "vst". |
by_feature |
Normalize by feature (row) or sample (column). |
List of normalized matrices.
Zaoqu Liu
Generates quality control summary for multi-omics data.
qc_summary(data_list)qc_summary(data_list)
data_list |
List of data matrices. |
Data frame with QC metrics for each dataset.
Performs RGCCA for multi-block data analysis.
rgcca( A, C = 1 - diag(length(A)), tau = rep(1, length(A)), ncomp = rep(1, length(A)), scheme = "centroid", scale = TRUE, init = "svd", bias = TRUE, tol = 1e-08, verbose = FALSE )rgcca( A, C = 1 - diag(length(A)), tau = rep(1, length(A)), ncomp = rep(1, length(A)), scheme = "centroid", scale = TRUE, init = "svd", bias = TRUE, tol = 1e-08, verbose = FALSE )
A |
List of data blocks (samples x variables). |
C |
Design matrix (default: complete design). |
tau |
Shrinkage parameters (default: 1 for each block). |
ncomp |
Number of components per block (default: 1). |
scheme |
Scheme: "horst", "factorial", or "centroid". |
scale |
Scale blocks (default: TRUE). |
init |
Initialization: "svd" or "random". |
bias |
Biased estimator (default: TRUE). |
tol |
Convergence tolerance. |
verbose |
Print progress. |
List with Y, a, astar, C, tau, scheme, ncomp, crit, AVE.
Performs BCC-based clustering on multi-omics data.
run_bcc(data, n_clusters, method = "fast", n_iter = 1000)run_bcc(data, n_clusters, method = "fast", n_iter = 1000)
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
method |
Method: "full" for full Bayesian, "fast" for EM-like. |
n_iter |
MCMC iterations (for full method). |
Data frame with sample IDs and cluster assignments.
Performs CIMLR-based clustering on multi-omics data.
run_cimlr(data, n_clusters, n_kernels = 10)run_cimlr(data, n_clusters, n_kernels = 10)
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
n_kernels |
Number of kernels per data type (default: 10). |
Data frame with sample IDs and cluster assignments.
Performs CPCA-based clustering on multi-omics data.
run_cpca(data, n_clusters, ncomp = NULL, cluster_method = "ward.D2")run_cpca(data, n_clusters, ncomp = NULL, cluster_method = "ward.D2")
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
ncomp |
Number of components. |
cluster_method |
Clustering method. |
Data frame with cluster assignments.
Performs iClusterBayes-based clustering on multi-omics data.
run_iclusterbayes( data, n_clusters, method = "fast", n_burnin = 500, n_sample = 1000 )run_iclusterbayes( data, n_clusters, method = "fast", n_burnin = 500, n_sample = 1000 )
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
method |
Method: "full" for full Bayesian, "fast" for variational. |
n_burnin |
Burn-in iterations (for full method). |
n_sample |
Sampling iterations (for full method). |
Data frame with sample IDs and cluster assignments.
Unified function to run any multi-omics clustering algorithm.
run_integration(data, algorithm, n_clusters, ...)run_integration(data, algorithm, n_clusters, ...)
data |
List of matrices (features x samples). All matrices must have the same samples (columns). |
algorithm |
Clustering algorithm name. Use list_clustering_algorithms() to see available options. |
n_clusters |
Number of clusters. |
... |
Additional arguments passed to specific algorithm. |
Data frame with sample IDs, cluster assignments, and cluster labels.
## Not run: # Create example data data1 <- matrix(rnorm(100 * 50), nrow = 100, ncol = 50) data2 <- matrix(rnorm(80 * 50), nrow = 80, ncol = 50) colnames(data1) <- colnames(data2) <- paste0("Sample", 1:50) data_list <- list(GE = data1, ME = data2) # Run SNF clustering result <- run_integration(data_list, "SNF", n_clusters = 3) ## End(Not run)## Not run: # Create example data data1 <- matrix(rnorm(100 * 50), nrow = 100, ncol = 50) data2 <- matrix(rnorm(80 * 50), nrow = 80, ncol = 50) colnames(data1) <- colnames(data2) <- paste0("Sample", 1:50) data_list <- list(GE = data1, ME = data2) # Run SNF clustering result <- run_integration(data_list, "SNF", n_clusters = 3) ## End(Not run)
High-level function for IntNMF clustering on multi-omics data.
run_intnmf(data, n_clusters, maxiter = 200, n_init = 30, weight = NULL)run_intnmf(data, n_clusters, maxiter = 200, n_init = 30, weight = NULL)
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
maxiter |
Maximum iterations (default: 200). |
n_init |
Number of initializations (default: 30). |
weight |
Weights for each data type (default: equal weights). |
Data frame with sample IDs and cluster assignments.
Ensemble clustering by combining single-omics results.
run_late_fusion( data, n_clusters, single_method = "kmeans", consensus_method = "similarity" )run_late_fusion( data, n_clusters, single_method = "kmeans", consensus_method = "similarity" )
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
single_method |
Method for single-omics clustering ("kmeans", "hclust", "pam"). |
consensus_method |
Method for combining results ("voting", "similarity"). |
Data frame with cluster assignments.
Performs LRAcluster-based clustering on multi-omics data.
run_lracluster(data, n_clusters, types = NULL, cluster_method = "ward.D2")run_lracluster(data, n_clusters, types = NULL, cluster_method = "ward.D2")
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
types |
Character vector of data types. |
cluster_method |
Hierarchical clustering method (default: "ward.D2"). |
Data frame with sample IDs and cluster assignments.
Performs MCIA-based clustering on multi-omics data.
run_mcia(data, n_clusters, ncomp = NULL, cluster_method = "ward.D2")run_mcia(data, n_clusters, ncomp = NULL, cluster_method = "ward.D2")
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
ncomp |
Number of MCIA components (default: auto). |
cluster_method |
Clustering method (default: "ward.D2"). |
Data frame with sample IDs and cluster assignments.
Performs factor analysis based clustering on multi-omics data. Uses multi-view factor analysis to extract latent factors, then clusters samples based on these factors. For full MOFA, use MOFA2 package.
run_mofa(data, n_clusters, n_factors = 10, cluster_method = "ward.D2")run_mofa(data, n_clusters, n_factors = 10, cluster_method = "ward.D2")
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
n_factors |
Number of factors (default: 10). |
cluster_method |
Clustering method (default: "ward.D2"). |
Data frame with sample IDs and cluster assignments.
Runs multiple algorithms and returns combined results.
run_multiple_algorithms(data, algorithms = NULL, n_clusters, ...)run_multiple_algorithms(data, algorithms = NULL, n_clusters, ...)
data |
List of data matrices. |
algorithms |
Character vector of algorithm names. |
n_clusters |
Number of clusters. |
... |
Additional arguments. |
List of results for each algorithm.
Performs NEMO-based clustering on multi-omics data.
run_nemo(data, n_clusters = NULL, n_neighbors = NA)run_nemo(data, n_clusters = NULL, n_neighbors = NA)
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters (NULL for automatic). |
n_neighbors |
Number of neighbors (NA for automatic). |
Data frame with sample IDs and cluster assignments.
Executes multiple clustering algorithms in parallel.
run_parallel_algorithms( data, algorithms = NULL, n_clusters, n_workers = NULL, ... )run_parallel_algorithms( data, algorithms = NULL, n_clusters, n_workers = NULL, ... )
data |
List of data matrices (features x samples). |
algorithms |
Character vector of algorithm names. |
n_clusters |
Number of clusters. |
n_workers |
Number of parallel workers (default: auto). |
... |
Additional arguments passed to algorithms. |
List of results for each algorithm.
Performs PINSPlus-based clustering on multi-omics data.
run_pinsplus(data, n_clusters = NULL, kMin = 2, kMax = 5)run_pinsplus(data, n_clusters = NULL, kMin = 2, kMax = 5)
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters (optional). |
kMin |
Minimum clusters (default: 2). |
kMax |
Maximum clusters (default: 5). |
Data frame with sample IDs and cluster assignments.
Performs RGCCA-based clustering on multi-omics data.
run_rgcca( data, n_clusters, ncomp = NULL, tau = NULL, scheme = "centroid", cluster_method = "ward.D2" )run_rgcca( data, n_clusters, ncomp = NULL, tau = NULL, scheme = "centroid", cluster_method = "ward.D2" )
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
ncomp |
Number of components (default: 1 per block). |
tau |
Shrinkage parameters. |
scheme |
Scheme type (default: "centroid"). |
cluster_method |
Hierarchical clustering method (default: "ward.D2"). |
Data frame with sample IDs and cluster assignments.
Performs SGCCA-based clustering on multi-omics data.
run_sgcca( data, n_clusters, ncomp = NULL, c1 = NULL, scheme = "centroid", cluster_method = "ward.D2" )run_sgcca( data, n_clusters, ncomp = NULL, c1 = NULL, scheme = "centroid", cluster_method = "ward.D2" )
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
ncomp |
Number of components. |
c1 |
L1 penalty parameters. |
scheme |
Scheme type. |
cluster_method |
Clustering method. |
Data frame with cluster assignments.
Performs SNF-based clustering on multi-omics data.
run_snf(data, n_clusters, n_neighbors = 20, sigma = 0.5, n_iterations = 20)run_snf(data, n_clusters, n_neighbors = 20, sigma = 0.5, n_iterations = 20)
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
n_neighbors |
Number of neighbors for affinity matrix (default: 20). |
sigma |
Variance for local model (default: 0.5). |
n_iterations |
Number of SNF iterations (default: 20). |
Data frame with sample IDs, cluster assignments, and cluster labels.
SNF with custom weights for each data type.
run_wsnf( data, n_clusters, weights = NULL, n_neighbors = 20, sigma = 0.5, n_iterations = 20 )run_wsnf( data, n_clusters, weights = NULL, n_neighbors = 20, sigma = 0.5, n_iterations = 20 )
data |
List of matrices (features x samples). |
n_clusters |
Number of clusters. |
weights |
Numeric vector of weights for each data type. |
n_neighbors |
Number of neighbors (default: 20). |
sigma |
Kernel bandwidth (default: 0.5). |
n_iterations |
SNF iterations (default: 20). |
Data frame with cluster assignments.
Provides a unified interface for all clustering algorithms in MOFSR.
Returns available multi-omics clustering algorithms.
list_clustering_algorithms()list_clustering_algorithms()
Character vector of algorithm names.
Zaoqu Liu
Performs multi-omics clustering using BCC algorithm.
RunBCC(data, N.clust, method = "fast", max.iterations = 1000)RunBCC(data, N.clust, method = "fast", max.iterations = 1000)
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
method |
Character. "fast" or "full" (default: "fast"). |
max.iterations |
Integer. Maximum iterations (default: 1000). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Lock EF, Dunson DB. Bioinformatics. 2013.
Performs consensus clustering on the given data.
RunCC( data, maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1, clusterAlg = "hc", distance = "euclidean", innerLinkage = "ward.D2", finalLinkage = "ward.D2", seed = 1234, verbose = FALSE )RunCC( data, maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1, clusterAlg = "hc", distance = "euclidean", innerLinkage = "ward.D2", finalLinkage = "ward.D2", seed = 1234, verbose = FALSE )
data |
A numeric matrix (features x samples). |
maxK |
Maximum number of clusters (default: 6). |
reps |
Number of subsamples (default: 1000). |
pItem |
Proportion of items to sample (default: 0.8). |
pFeature |
Proportion of features to sample (default: 1). |
clusterAlg |
Clustering algorithm: "hc", "km", or "pam" (default: "hc"). |
distance |
Distance metric (default: "euclidean"). |
innerLinkage |
Linkage method for HC (default: "ward.D2"). |
finalLinkage |
Linkage for final clustering (default: "ward.D2"). |
seed |
Random seed (default: 1234). |
verbose |
Print progress (default: FALSE). |
A list containing consensus clustering results.
Zaoqu Liu; Email: [email protected]
Performs multi-omics clustering using CIMLR algorithm.
RunCIMLR(data, N.clust, n.kernels = 10)RunCIMLR(data, N.clust, n.kernels = 10)
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
n.kernels |
Integer. Number of kernels per data type (default: 10). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Ramazzotti D, et al. Nature Communications. 2018.
This function runs different classification models based on user input to predict cluster assignments for test data.
RunClassifier( algorithm, data.test, data.train, cluster.data, cluster.markers, scale = TRUE )RunClassifier( algorithm, data.test, data.train, cluster.data, cluster.markers, scale = TRUE )
algorithm |
A character string indicating the classifier to use. Supported algorithms include: "Adaboost", "DT", "Enet", "Enrichment", "GBDT", "LASSO", "LDA", "NBayes", "NNet", "PCA", "Ridge", "StepLR", "SVD", "SVM", "XGBoost", "kNN", "RF". |
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
scale |
A logical value indicating whether to scale the test data. Default is TRUE. |
The function dynamically selects and runs a classification model based on user input. The supported classifiers include a range of machine learning models such as Random Forest, kNN, PCA, SVM, LASSO, Ridge, and others.
A data frame containing the prediction results based on the selected algorithm.
Zaoqu Liu; Email: [email protected]
# Example usage: data.test <- matrix(rnorm(1000), nrow = 100, ncol = 10) data.train <- matrix(rnorm(1000), nrow = 100, ncol = 10) cluster.data <- data.frame(Sample = paste0("Sample", 1:10), Cluster = rep(1:2, each = 5)) cluster.markers <- setNames(lapply(unique(cluster.data$Cluster), function(c) data.frame(Gene = paste0("Gene", sample(1:30, 10)), OR = runif(10, 0.5, 2), AUC = runif(10))), unique(cluster.data$Cluster)) result <- RunClassifier(algorithm = "RF", data.test, data.train, cluster.data, cluster.markers, scale = TRUE)# Example usage: data.test <- matrix(rnorm(1000), nrow = 100, ncol = 10) data.train <- matrix(rnorm(1000), nrow = 100, ncol = 10) cluster.data <- data.frame(Sample = paste0("Sample", 1:10), Cluster = rep(1:2, each = 5)) cluster.markers <- setNames(lapply(unique(cluster.data$Cluster), function(c) data.frame(Gene = paste0("Gene", sample(1:30, 10)), OR = runif(10, 0.5, 2), AUC = runif(10))), unique(cluster.data$Cluster)) result <- RunClassifier(algorithm = "RF", data.test, data.train, cluster.data, cluster.markers, scale = TRUE)
Performs Consensus Clustering Analysis to identify stable clusters.
RunCOCA( jaccard.matrix, max.clusters = 6, optimal.clusters = 3, linkage.method = "ward.D2", clustering.algorithm = "hc", distance.metric = "euclidean", resampling.iterations = 10000, resample.proportion = 0.7 )RunCOCA( jaccard.matrix, max.clusters = 6, optimal.clusters = 3, linkage.method = "ward.D2", clustering.algorithm = "hc", distance.metric = "euclidean", resampling.iterations = 10000, resample.proportion = 0.7 )
jaccard.matrix |
A Jaccard distance matrix. |
max.clusters |
Integer. Maximum number of clusters (default: 6). |
optimal.clusters |
Integer. Optimal number of clusters (default: 3). |
linkage.method |
Character. Linkage method (default: "ward.D2"). |
clustering.algorithm |
Character. Clustering algorithm (default: "hc"). |
distance.metric |
Character. Distance metric (default: "euclidean"). |
resampling.iterations |
Integer. Resampling iterations (default: 10000). |
resample.proportion |
Numeric. Resample proportion (default: 0.7). |
A list with consensus clustering results.
Zaoqu Liu; Email: [email protected]
Performs multi-omics clustering using CPCA algorithm.
RunCPCA(data, N.clust, ncomp = NULL, clustering.algorithm = "ward.D2")RunCPCA(data, N.clust, ncomp = NULL, clustering.algorithm = "ward.D2")
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
ncomp |
Integer. Number of components (default: auto). |
clustering.algorithm |
Character. Clustering method (default: "ward.D2"). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
This function runs an ensemble of different classification models to predict cluster assignments for test data, considering consensus among the models.
RunEnsemble( data.test, data.train, cluster.data, cluster.markers, surdata = NULL, time = "time", event = "event", methods = NULL, sur.trend.rank = NULL, cutoff.P = 0.05 )RunEnsemble( data.test, data.train, cluster.data, cluster.markers, surdata = NULL, time = "time", event = "event", methods = NULL, sur.trend.rank = NULL, cutoff.P = 0.05 )
data.test |
A numeric matrix or data frame of test data. Rows represent genes, and columns represent samples. |
data.train |
A numeric matrix or data frame of training data. Rows represent genes, and columns represent samples. |
cluster.data |
A data frame where the first column must be the sample IDs and the second column must be the cluster assignments. The sample IDs must match the column names of the training data. |
cluster.markers |
A list of data frames, each containing markers for a specific cluster, with columns 'Gene' indicating gene names. |
surdata |
A data frame containing survival information for the samples. The first column must be sample IDs. |
time |
A character string specifying the column name in 'surdata' representing survival time (default: "time"). |
event |
A character string specifying the column name in 'surdata' representing the event status (default: "event"). |
methods |
A character vector specifying which classifiers to use in the ensemble. If NULL, all available methods will be used (default: NULL). |
sur.trend.rank |
A character vector specifying the desired order of survival trends (e.g., c("C2", "C3", "C1")) to filter the models (default: NULL). Note: the order should be from risk to protective. |
cutoff.P |
Numeric value for the p-value cutoff for survival analysis (default: 0.05). |
This function runs an ensemble of classifiers, checks the consistency among classifiers, and optionally performs survival analysis to filter models based on trends in clinical outcomes.
A list containing the ensemble prediction results and optional survival analysis.
Zaoqu Liu; Email: [email protected]
# Example usage: data.test <- matrix(rnorm(1000), nrow = 100, ncol = 10) data.train <- matrix(rnorm(1000), nrow = 100, ncol = 10) cluster.data <- data.frame(Sample = paste0("Sample", 1:10), Cluster = rep(1:2, each = 5)) cluster.markers <- setNames(lapply(unique(cluster.data$Cluster), function(c) data.frame(Gene = paste0("Gene", sample(1:30, 10)), OR = runif(10, 0.5, 2), AUC = runif(10))), unique(cluster.data$Cluster)) surdata <- data.frame(ID = paste0("Sample", 1:10), time = runif(10, 1, 1000), event = sample(0:1, 10, replace = TRUE)) result <- RunEnsemble(data.test, data.train, cluster.data, cluster.markers, surdata, time = "time", event = "event", methods = c("rf", "xgboost"), sur.trend.rank = c("C1", "C3", "C2"))# Example usage: data.test <- matrix(rnorm(1000), nrow = 100, ncol = 10) data.train <- matrix(rnorm(1000), nrow = 100, ncol = 10) cluster.data <- data.frame(Sample = paste0("Sample", 1:10), Cluster = rep(1:2, each = 5)) cluster.markers <- setNames(lapply(unique(cluster.data$Cluster), function(c) data.frame(Gene = paste0("Gene", sample(1:30, 10)), OR = runif(10, 0.5, 2), AUC = runif(10))), unique(cluster.data$Cluster)) surdata <- data.frame(ID = paste0("Sample", 1:10), time = runif(10, 1, 1000), event = sample(0:1, 10, replace = TRUE)) result <- RunEnsemble(data.test, data.train, cluster.data, cluster.markers, surdata, time = "time", event = "event", methods = c("rf", "xgboost"), sur.trend.rank = c("C1", "C3", "C2"))
This function estimates gene-set enrichment scores across all samples using various methods.
RunGSVA( exp, gene.list, min.size = 3, max.size = 1000, method = "ssgsea", ssgsea.normalize = TRUE, ssgsea.alpha = 0.25, gsva.kcdf = "Gaussian", gsva.tau = 1, gsva.maxDiff = TRUE, gsva.absRanking = FALSE, verbose = TRUE, nCores = parallel::detectCores() - 3 )RunGSVA( exp, gene.list, min.size = 3, max.size = 1000, method = "ssgsea", ssgsea.normalize = TRUE, ssgsea.alpha = 0.25, gsva.kcdf = "Gaussian", gsva.tau = 1, gsva.maxDiff = TRUE, gsva.absRanking = FALSE, verbose = TRUE, nCores = parallel::detectCores() - 3 )
exp |
Numeric matrix containing the expression data or gene expression signatures, with samples in columns and genes in rows. |
gene.list |
Gene sets provided either as a list object or as a GeneSetCollection object. |
min.size |
Minimum size of the gene sets to be considered in the analysis. Default is 3. |
max.size |
Maximum size of the gene sets to be considered in the analysis. Default is 1000. |
method |
Method to employ in the estimation of gene-set enrichment scores per sample. Options are "gsva" (default), "ssgsea", "zscore", or "plage". |
ssgsea.normalize |
Logical vector of length 1; if TRUE runs the ssGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. Otherwise this last normalization step is skipped. |
ssgsea.alpha |
Numeric vector of length 1. The exponent defining the weight of the tail in the random walk performed by the ssGSEA (Barbie et al., 2009) method. The default value is 0.25 as described in the paper. |
gsva.kcdf |
Character vector of length 1 denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples. By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson". |
gsva.tau |
Numeric vector of length 1. The exponent defining the weight of the tail in the random walk performed by the GSVA (Hänzelmann et al., 2013) method. The default value is 1 as described in the paper. |
gsva.maxDiff |
Logical vector of length 1 which offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic. FALSE: ES is calculated as the maximum distance of the random walk from 0. TRUE (the default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations. |
gsva.absRanking |
Logical vector of length 1 used only when maxDiff=TRUE. When absRanking=FALSE (default) a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When absRanking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as ’highly’ activated. |
verbose |
Logical indicating whether to print progress messages. Default is TRUE. |
nCores |
The number of cores to use for parallel computation. Default is 'parallel::detectCores() - 2', which detects the number of cores available on the system and reserves 2 cores for other tasks. |
This function supports multiple methods for estimating gene-set enrichment scores, including ssGSEA, GSVA, zscore, and plage. The scores are calculated for each gene set across all samples. The 'ssGSES' function is flexible and allows for customization of the minimum and maximum size of gene sets considered in the analysis. By providing different methods, the function can adapt to various types of gene-set enrichment analysis, each having its own strengths and suitable applications.
"gsva": Gene Set Variation Analysis, suitable for detecting subtle changes in pathway activity.
"ssgsea": Single-Sample Gene Set Enrichment Analysis, useful for individual sample analysis.
"zscore": Z-score transformation, a simpler approach to standardize expression values.
"plage": Pathway Level Analysis of Gene Expression, which focuses on correlating pathway components.
A gene-set by sample matrix of gene-set enrichment scores.
Zaoqu Liu; Email: [email protected]
Performs multi-omics clustering using iClusterBayes algorithm.
RuniClusterBayes(data, N.clust, method = "fast", burnin = 500, nsample = 1000)RuniClusterBayes(data, N.clust, method = "fast", burnin = 500, nsample = 1000)
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
method |
Character. "fast" or "full" (default: "fast"). |
burnin |
Integer. Burn-in iterations for full method (default: 500). |
nsample |
Integer. Sampling iterations for full method (default: 1000). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Mo Q, et al. Biostatistics. 2018.
This function runs multi-omics integration analysis using a specified clustering algorithm. Users can choose from a variety of algorithms to perform data integration on multiple modalities.
RunIntegration(data, algorithm, N.clust, ...) RunIF(data, algorithm, N.clust, ...)RunIntegration(data, algorithm, N.clust, ...) RunIF(data, algorithm, N.clust, ...)
data |
A list of matrices where each element represents a different modality (e.g., RNA, protein, methylation). Each matrix should have rows as features and columns as samples. |
algorithm |
Character. The integration algorithm to use. Options include "cpca", "iclusterbayes", "intnmf", "lracluster", "mcia", "nemo", "pinsplus", "rgcca", "sgcca", "snf", "cimlr", "bcc". |
N.clust |
Integer. Number of clusters to create (recommended). |
... |
Additional algorithm-specific arguments passed to the underlying functions. |
This function provides a unified interface to multiple multi-omics integration algorithms. Each algorithm has its own characteristics:
SNF: Similarity Network Fusion
CPCA: Consensus PCA
iClusterBayes: Bayesian integrative clustering
IntNMF: Integrative Non-negative Matrix Factorization
LRAcluster: Low-Rank Approximation clustering
MCIA: Multiple Co-inertia Analysis
NEMO: Neighborhood based multi-omics clustering
PINSPlus: Perturbation clustering for data integration
RGCCA: Regularized Generalized CCA
SGCCA: Sparse Generalized CCA
CIMLR: Cancer Integration via Multi-kernel Learning
BCC: Bayesian Consensus Clustering
A data frame with clustering results based on the selected algorithm.
Zaoqu Liu; Email: [email protected]
## Not run: # Create example data data1 <- matrix(rnorm(5000), nrow = 50, ncol = 100) data2 <- matrix(rnorm(5000), nrow = 50, ncol = 100) colnames(data1) <- colnames(data2) <- paste0("Sample", 1:100) data_list <- list(data1, data2) # Run integration clustering using SNF result <- RunIntegration(data = data_list, algorithm = "snf", N.clust = 3) ## End(Not run)## Not run: # Create example data data1 <- matrix(rnorm(5000), nrow = 50, ncol = 100) data2 <- matrix(rnorm(5000), nrow = 50, ncol = 100) colnames(data1) <- colnames(data2) <- paste0("Sample", 1:100) data_list <- list(data1, data2) # Run integration clustering using SNF result <- RunIntegration(data = data_list, algorithm = "snf", N.clust = 3) ## End(Not run)
Performs multi-omics clustering using IntNMF algorithm.
RunIntNMF(data, N.clust, maxiter = 200, n.ini = 30, wt = NULL)RunIntNMF(data, N.clust, maxiter = 200, n.ini = 30, wt = NULL)
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
maxiter |
Integer. Maximum iterations (default: 200). |
n.ini |
Integer. Number of initializations (default: 30). |
wt |
Numeric vector. Weights for each data type. |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Chalise P, Fridley BL. PLoS One. 2017.
Performs multi-omics clustering using LRAcluster algorithm.
RunLRAcluster(data, N.clust, data.types = NULL, cluster.algorithm = "ward.D2")RunLRAcluster(data, N.clust, data.types = NULL, cluster.algorithm = "ward.D2")
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
data.types |
Character vector. Data types ("binary", "gaussian", "poisson"). |
cluster.algorithm |
Character. Clustering method (default: "ward.D2"). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Wu D, et al. BMC Genomics. 2015.
Performs multi-omics clustering using MCIA algorithm.
RunMCIA(data, N.clust, ncomp = NULL, clustering.algorithm = "ward.D2")RunMCIA(data, N.clust, ncomp = NULL, clustering.algorithm = "ward.D2")
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
ncomp |
Integer. Number of components (default: auto). |
clustering.algorithm |
Character. Clustering method (default: "ward.D2"). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Meng C, et al. BMC Bioinformatics. 2014.
This function performs MultiModality Fusion Subtyping (MOFS) analysis by utilizing multiple clustering algorithms for multi-modality data integration. The user can flexibly select the desired clustering algorithms and adjust relevant parameters.
RunMOFS( data, methods, max.clusters = 6, optimal.clusters = 3, linkage.method = "ward.D2", clustering.algorithm = "hc", distance.metric = "euclidean", resampling.iterations = 10000, resample.proportion = 0.7, silhouette.cutoff = 0.4, ... )RunMOFS( data, methods, max.clusters = 6, optimal.clusters = 3, linkage.method = "ward.D2", clustering.algorithm = "hc", distance.metric = "euclidean", resampling.iterations = 10000, resample.proportion = 0.7, silhouette.cutoff = 0.4, ... )
data |
A list of matrices where each element represents a different modality (e.g., RNA, protein, methylation). Each matrix should have rows as features and columns as samples. |
methods |
Character vector. The clustering algorithms to use. Options are: "CPCA", "iClusterBayes", "IntNMF", "LRAcluster", "MCIA", "NEMO", "PINSPlus", "RGCCA", "SGCCA", "SNF", "CIMLR", "BCC". At least two methods must be specified. |
max.clusters |
Integer. The maximum number of clusters to evaluate during consensus clustering analysis (default: 6). |
optimal.clusters |
Integer. The optimal number of clusters to select from the consensus clustering analysis (default: 3). |
linkage.method |
Character. The linkage method to use for hierarchical clustering (default: "ward.D2"). |
clustering.algorithm |
Character. The clustering algorithm to use during consensus clustering (default: 'hc'). |
distance.metric |
Character. The distance metric to use for clustering (default: "euclidean"). |
resampling.iterations |
Integer. The number of resampling iterations for consensus clustering (default: 10000). |
resample.proportion |
Numeric. The proportion of items to resample in each iteration for consensus clustering (default: 0.7). |
silhouette.cutoff |
Numeric. Silhouette coefficient cutoff value for selecting core set samples (default: 0.4). |
... |
Additional parameters specific to the chosen clustering algorithms. |
The function performs MultiModality Fusion Subtyping (MOFS) by running multiple clustering algorithms on the input multi-modality data. The results of each clustering algorithm are stored for further analysis, including binary cluster assignments, Jaccard distance calculation, consensus clustering analysis, Calinski-Harabasz index calculation, silhouette analysis, and PCA.
The steps involved are: 1. Running the specified clustering algorithms on the input data. 2. Extracting binary cluster assignments from the clustering results. 3. Calculating Jaccard distance between clusters. 4. Performing consensus clustering analysis to identify stable clusters. 5. Calculating the Calinski-Harabasz index to assess clustering quality. 6. Performing silhouette analysis to evaluate cluster cohesion and separation. 7. Identifying a core set of samples based on the silhouette coefficient cutoff. 8. Performing PCA on the core set for dimensionality reduction and visualization.
A list containing the results for each specified clustering algorithm, as well as the results of further analysis including consensus clustering, silhouette scores, core set identification, and PCA.
Zaoqu Liu; Email: [email protected]
# Example usage: data1 <- matrix(rnorm(10000), nrow = 100, ncol = 100) data2 <- matrix(rnorm(10000), nrow = 100, ncol = 100) colnames(data1) <- colnames(data2) <- paste0("Sample", 1:100) data_list <- list(data1, data2) # Run MultiModality Fusion Subtyping using CPCA and CIMLR result <- RunMOFS(data = data_list, methods = c("CPCA", "CIMLR"), max.clusters = 6, optimal.clusters = 3, linkage.method = "ward.D2", clustering.algorithm = "hc", distance.metric = "euclidean", resampling.iterations = 10000, resample.proportion = 0.7, silhouette.cutoff = 0.4)# Example usage: data1 <- matrix(rnorm(10000), nrow = 100, ncol = 100) data2 <- matrix(rnorm(10000), nrow = 100, ncol = 100) colnames(data1) <- colnames(data2) <- paste0("Sample", 1:100) data_list <- list(data1, data2) # Run MultiModality Fusion Subtyping using CPCA and CIMLR result <- RunMOFS(data = data_list, methods = c("CPCA", "CIMLR"), max.clusters = 6, optimal.clusters = 3, linkage.method = "ward.D2", clustering.algorithm = "hc", distance.metric = "euclidean", resampling.iterations = 10000, resample.proportion = 0.7, silhouette.cutoff = 0.4)
Performs multi-omics clustering using NEMO algorithm.
RunNEMO(data, N.clust = NULL, num.neighbors = NA)RunNEMO(data, N.clust = NULL, num.neighbors = NA)
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters (NULL for automatic). |
num.neighbors |
Number of neighbors (NA for automatic). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Rappoport N, Shamir R. Bioinformatics. 2019.
Performs PCA on the given data using base R.
RunPCA(data, scale = TRUE, center = TRUE, ncomp = NULL)RunPCA(data, scale = TRUE, center = TRUE, ncomp = NULL)
data |
A data matrix or distance matrix. |
scale |
Logical. Whether to scale variables (default: TRUE). |
center |
Logical. Whether to center variables (default: TRUE). |
ncomp |
Number of components to return (default: all). |
A list with class "prcomp" containing PCA results.
Zaoqu Liu; Email: [email protected]
Performs multi-omics clustering using PINSPlus algorithm.
RunPINSPlus(data, N.clust = NULL, kMin = 2, kMax = 5)RunPINSPlus(data, N.clust = NULL, kMin = 2, kMax = 5)
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters (NULL for automatic). |
kMin |
Integer. Minimum clusters (default: 2). |
kMax |
Integer. Maximum clusters (default: 5). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Nguyen T, et al. Genome Research. 2017;27(12):2025-2039.
Performs multi-omics clustering using RGCCA algorithm.
RunRGCCA( data, N.clust, ncomp = NULL, tau = NULL, scheme = "centroid", clustering.algorithm = "ward.D2" )RunRGCCA( data, N.clust, ncomp = NULL, tau = NULL, scheme = "centroid", clustering.algorithm = "ward.D2" )
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
ncomp |
Integer vector. Number of components per block. |
tau |
Numeric vector. Shrinkage parameters. |
scheme |
Character. Scheme type ("centroid", "factorial", "horst"). |
clustering.algorithm |
Character. Clustering method (default: "ward.D2"). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Tenenhaus A, Tenenhaus M. Psychometrika. 2011;76(2):257-284.
Performs multi-omics clustering using SGCCA algorithm.
RunSGCCA( data, N.clust, ncomp = NULL, c1 = NULL, scheme = "centroid", clustering.algorithm = "ward.D2" )RunSGCCA( data, N.clust, ncomp = NULL, c1 = NULL, scheme = "centroid", clustering.algorithm = "ward.D2" )
data |
A list of matrices (features x samples). |
N.clust |
Integer. Number of clusters. |
ncomp |
Integer vector. Number of components per block. |
c1 |
Numeric vector. L1 penalty parameters (0-1). |
scheme |
Character. Scheme type (default: "centroid"). |
clustering.algorithm |
Character. Clustering method (default: "ward.D2"). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Tenenhaus et al. Biostatistics. 2013.
This function performs clustering analysis using Similarity Network Fusion (SNF), which integrates multiple data types to provide a unified clustering result.
RunSNF( data, N.clust = NULL, num.neighbors = 20, variance = 0.5, num.iterations = 20 )RunSNF( data, N.clust = NULL, num.neighbors = 20, variance = 0.5, num.iterations = 20 )
data |
A list of matrices where each element represents a different modality. Each matrix should have rows as features and columns as samples. |
N.clust |
Integer. Number of clusters for spectral clustering. |
num.neighbors |
Integer. Number of nearest neighbors (default: 20). |
variance |
Numeric. Variance for local model (default: 0.5). |
num.iterations |
Integer. Number of SNF iterations (default: 20). |
A data frame with Sample, Cluster, and Cluster2 columns.
Zaoqu Liu; Email: [email protected]
Wang B, et al. Nat Methods. 2014;11(3):333-337.
This function calculates the standard deviation (SD) for each column in a data frame.
SD.df(df)SD.df(df)
df |
A data frame with row samples and column features. |
A sorted vector of SD values for each column in the data frame, in decreasing order.
Zaoqu Liu; E-mail: [email protected]
This function selects hypervariable features from a data frame based on specified variance calculation methods.
Select.Features( data, method = "mad", top.percent = NULL, top.number = 1000, custom.features = NULL )Select.Features( data, method = "mad", top.percent = NULL, top.number = 1000, custom.features = NULL )
data |
A data frame with row features and column samples. |
method |
Method of calculating variance ("sd", "mad", "cv"). |
top.percent |
The top percent of hypervariable features based on variance. |
top.number |
The top number of hypervariable features based on variance. |
custom.features |
A vector of custom features to select. |
A vector of selected hypervariable features.
Zaoqu Liu; E-mail: [email protected]
## Not run: df <- data.frame(matrix(rnorm(1000), nrow = 100, ncol = 10)) selected_features <- Select.Features(df, method = "mad", top.percent = 10) ## End(Not run)## Not run: df <- data.frame(matrix(rnorm(1000), nrow = 100, ncol = 10)) selected_features <- Select.Features(df, method = "mad", top.percent = 10) ## End(Not run)
RGCCA with L1 sparsity for variable selection.
sgcca( A, C = 1 - diag(length(A)), c1 = rep(1, length(A)), ncomp = rep(1, length(A)), scheme = "centroid", scale = TRUE, init = "svd", bias = TRUE, tol = .Machine$double.eps, verbose = FALSE )sgcca( A, C = 1 - diag(length(A)), c1 = rep(1, length(A)), ncomp = rep(1, length(A)), scheme = "centroid", scale = TRUE, init = "svd", bias = TRUE, tol = .Machine$double.eps, verbose = FALSE )
A |
List of data blocks. |
C |
Design matrix. |
c1 |
L1 penalty (between 0 and 1). |
ncomp |
Number of components per block. |
scheme |
Scheme type. |
scale |
Scale blocks. |
init |
Initialization method. |
bias |
Biased estimator. |
tol |
Tolerance. |
verbose |
Print progress. |
List with Y, a, astar, C, c1, scheme, ncomp, crit, AVE.
Fuses multiple affinity matrices into a unified similarity network.
snf_fuse(Wall, K = 20, t = 20)snf_fuse(Wall, K = 20, t = 20)
Wall |
List of affinity matrices (one per data type). |
K |
Number of neighbors for KNN step (default: 20). |
t |
Number of iterations for diffusion process (default: 20). |
Unified similarity matrix.
Performs spectral clustering on an affinity matrix.
spectral_clustering(affinity, K, type = 3)spectral_clustering(affinity, K, type = 3)
affinity |
Affinity matrix (N x N). |
K |
Number of clusters. |
type |
Type of Laplacian normalization (1, 2, or 3; default: 3). |
Vector of cluster labels (1 to K).
Performs single-sample pathway activity analysis using Mann-Whitney-Wilcoxon Gene Set Test (MWW-GST) method.
ssMwwGST(geData, geneSet, nCores = 1, minLenGeneSet = 15)ssMwwGST(geData, geneSet, nCores = 1, minLenGeneSet = 15)
geData |
A numeric matrix of gene expression (genes x samples). |
geneSet |
A list of gene sets (vectors of gene names). |
nCores |
Integer. Number of cores for parallel processing (default: 1). |
minLenGeneSet |
Minimum gene set size (default: 15). |
A list containing NES, pValue, and FDR matrices.
Zaoqu Liu; Email: [email protected]
Frattini V, et al. Nature. 2018;553(7687):222-227.
Resets to sequential processing.
stop_parallel()stop_parallel()
Performs multi-omic clustering using perturbation clustering.
subtyping_omics_data( data_list, kMin = 2, kMax = 5, k = NULL, agreement_cutoff = 0.5, verbose = FALSE )subtyping_omics_data( data_list, kMin = 2, kMax = 5, k = NULL, agreement_cutoff = 0.5, verbose = FALSE )
data_list |
List of data matrices (samples x features). |
kMin |
Minimum clusters (default: 2). |
kMax |
Maximum clusters (default: 5). |
k |
Fixed number of clusters (optional). |
agreement_cutoff |
Agreement threshold (default: 0.5). |
verbose |
Print progress. |
List with cluster1, cluster2 and data type results.
This function performs single-sample Gene Set Enrichment Analysis (ssGSEA) for Glioblastoma Multiforme (GBM) data based on the Wang et al. 2017 classification system. It predicts sample subtypes (Classical, Mesenchymal, Proneural) based on enrichment scores using established marker gene sets.
WangGBM( data.test, dir.file = ".", gct.filename = "data.gct", number.perms = 100, tolerate.mixed = FALSE, method = c("internal", "GSVA", "external") )WangGBM( data.test, dir.file = ".", gct.filename = "data.gct", number.perms = 100, tolerate.mixed = FALSE, method = c("internal", "GSVA", "external") )
data.test |
A matrix or data frame representing the input expression data, where rows are genes and columns are samples. |
dir.file |
Character. Directory for saving the output files (default: '.'). Set to NULL to use a temporary directory. |
gct.filename |
Character. The filename for the generated GCT file (default: 'data.gct'). |
number.perms |
Integer. Number of permutations for ssGSEA analysis (default: 100). |
tolerate.mixed |
Logical. Whether to allow "Mixed" predictions when multiple gene sets have the same minimum p-value (default: FALSE). |
method |
Character. The ssGSEA implementation to use: "internal" (built-in), "GSVA" (requires GSVA package), or "external" (requires ssgsea.GBM.classification package from GitHub). Default: "internal". |
The function uses the Wang et al. 2017 GBM subtyping system which classifies samples into three subtypes:
Classical (CL)
Mesenchymal (MES)
Proneural (PN)
For the "external" method, the ssgsea.GBM.classification package is required:
devtools::install_github("Zaoqu-Liu/ssgsea.GBM.classification")
A data frame with the following columns:
ID: Sample identifiers.
Predict: Predicted subtype for each sample.
Columns with _pval: P-values for each subtype.
Zaoqu Liu; Email: [email protected]
Wang Q, Hu B, Hu X, Kim H, Squatrito M, Scarpace L, et al. Tumor Evolution of Glioma-Intrinsic Gene Expression Subtypes Associates with Immunological Changes in the Microenvironment. Cancer Cell. July 2017;32(1):42-56.e6.
## Not run: # Simulated expression data data.test <- matrix(rnorm(10000), nrow = 100, ncol = 100) rownames(data.test) <- paste0("Gene", 1:100) colnames(data.test) <- paste0("Sample", 1:100) # Run GBM ssGSEA-based subtyping result <- WangGBM( data.test = data.test, number.perms = 50, tolerate.mixed = TRUE ) print(result) ## End(Not run)## Not run: # Simulated expression data data.test <- matrix(rnorm(10000), nrow = 100, ncol = 100) rownames(data.test) <- paste0("Gene", 1:100) colnames(data.test) <- paste0("Sample", 1:100) # Run GBM ssGSEA-based subtyping result <- WangGBM( data.test = data.test, number.perms = 50, tolerate.mixed = TRUE ) print(result) ## End(Not run)
This function performs single-sample Gene Set Enrichment Analysis (ssGSEA) to classify GBM samples into subtypes based on Wu et al. 2024 molecular markers, including Proneural (PN), Mesenchymal (MES), and Oxidative Phosphorylation (OXPHOS).
WuGBM( data.test, dir.file = ".", gct.filename = "data.gct", number.perms = 100, tolerate.mixed = FALSE, method = c("internal", "GSVA", "external") )WuGBM( data.test, dir.file = ".", gct.filename = "data.gct", number.perms = 100, tolerate.mixed = FALSE, method = c("internal", "GSVA", "external") )
data.test |
A matrix or data frame representing the input expression data, where rows are genes and columns are samples. |
dir.file |
Character. Directory for saving the output files (default: '.'). Set to NULL to use a temporary directory. |
gct.filename |
Character. The filename for the generated GCT file (default: 'data.gct'). |
number.perms |
Integer. Number of permutations for ssGSEA analysis (default: 100). |
tolerate.mixed |
Logical. Whether to allow "Mixed" predictions when multiple gene sets have the same minimum p-value (default: FALSE). |
method |
Character. The ssGSEA implementation to use: "internal" (built-in), "GSVA" (requires GSVA package), or "external". Default: "internal". |
The function uses predefined marker gene sets for GBM subtypes from Wu et al. 2024:
PN: Proneural subtype markers.
OXPHOS: Oxidative phosphorylation subtype markers.
MES: Mesenchymal subtype markers.
A data frame with the following columns:
ID: Sample identifiers.
Predict: Predicted subtype for each sample.
Columns with _pval: P-values for each subtype.
Zaoqu Liu; Email: [email protected]
Wu M, Wang T, Ji N, Lu T, Yuan R, Wu L, et al. Multi-omics and pharmacological characterization of patient-derived glioma cell lines. Nat Commun. 2024;15:6740. doi:10.1038/s41467-024-51214-y.
## Not run: # Simulated expression data data.test <- matrix(rnorm(10000), nrow = 100, ncol = 100) rownames(data.test) <- paste0("Gene", 1:100) colnames(data.test) <- paste0("Sample", 1:100) # Run WuGBM subtyping result <- WuGBM( data.test = data.test, number.perms = 50, tolerate.mixed = TRUE ) print(result) ## End(Not run)## Not run: # Simulated expression data data.test <- matrix(rnorm(10000), nrow = 100, ncol = 100) rownames(data.test) <- paste0("Gene", 1:100) colnames(data.test) <- paste0("Sample", 1:100) # Run WuGBM subtyping result <- WuGBM( data.test = data.test, number.perms = 50, tolerate.mixed = TRUE ) print(result) ## End(Not run)