| Title: | Single Cell Clustering Evaluation and Optimization Framework |
|---|---|
| Description: | A comprehensive framework for evaluating and optimizing single-cell RNA-seq clustering results using self-projection machine learning approaches. The package implements an iterative optimization strategy that merges poorly discriminated clusters based on confusion matrix analysis, achieving robust and reliable cell type identification. Features include multiple classifier support (logistic regression, random forest, SVM, etc.), ROC curve analysis, confusion matrix visualization, and seamless integration with Seurat objects. This is an R implementation inspired by the SCCAF Python package. |
| Authors: | Zaoqu Liu [aut, cre], Chichau Miao [ctb] (Original SCCAF Python implementation) |
| Maintainer: | Zaoqu Liu <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.0 |
| Built: | 2026-05-26 05:56:11 UTC |
| Source: | https://github.com/Zaoqu-Liu/scClustEval |
A comprehensive framework for evaluating and optimizing single-cell RNA-seq clustering results using self-projection machine learning approaches.
The scClustEval package provides tools for:
Clustering Assessment: Evaluate the quality of cell clustering using self-projection with various machine learning classifiers
Clustering Optimization: Iteratively merge poorly discriminated clusters to achieve robust cell type identification
Visualization: ROC curves, confusion matrices, Sankey diagrams, and comprehensive assessment plots
Seurat Integration: Seamless workflow with Seurat objects
The core algorithm works by:
Training a classifier to distinguish between clusters
Evaluating prediction accuracy via cross-validation and hold-out testing
Identifying cluster pairs that are difficult to discriminate
Merging confused clusters and iterating until target accuracy is reached
sc_assessmentCore function for clustering assessment
sc_optimizeSingle round of clustering optimization
sc_optimize_allFull iterative optimization pipeline
RunAssessmentSeurat-style assessment function
RunOptimizationSeurat-style optimization function
The package supports multiple classifiers:
LR: Logistic Regression (L1/L2 regularization)
RF: Random Forest
SVM: Support Vector Machine
NB: Naive Bayes
DT: Decision Tree
XGB: XGBoost (if installed)
Zaoqu Liu [email protected]
This package is an R implementation inspired by the SCCAF Python package: https://github.com/SCCAF/sccaf
Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods.
Useful links:
Report bugs at https://github.com/Zaoqu-Liu/scClustEval/issues
Calculate and add per-cluster reliability scores
AddClusterReliability( object, result, cluster_col = NULL, reliability_col = NULL )AddClusterReliability( object, result, cluster_col = NULL, reliability_col = NULL )
object |
Seurat object |
result |
scClustEval assessment result |
cluster_col |
Cluster column that was assessed |
reliability_col |
Name for reliability column |
Seurat object with reliability scores added
Core functions for clustering quality assessment using self-projection
Calculate Bhattacharyya distance between two probability distributions
bhattacharyya_distance(p, q)bhattacharyya_distance(p, q)
p |
First probability distribution |
q |
Second probability distribution |
Bhattacharyya distance (non-negative value)
Compute pairwise Bhattacharyya distances between distributions
bhattacharyya_matrix(prob_matrix, flags = NULL)bhattacharyya_matrix(prob_matrix, flags = NULL)
prob_matrix |
Matrix where each column is a probability distribution |
flags |
Optional logical vector to filter rows |
Distance matrix
Compute confusion matrix from predictions and true labels
calc_confusion_matrix(y_true, y_pred, labels = NULL)calc_confusion_matrix(y_true, y_pred, labels = NULL)
y_true |
True class labels |
y_pred |
Predicted class labels |
labels |
Optional vector of labels to use (in order) |
A matrix with rows as true labels and columns as predicted labels
y_true <- c("A", "A", "B", "B", "C", "C") y_pred <- c("A", "B", "B", "B", "C", "A") calc_confusion_matrix(y_true, y_pred)y_true <- c("A", "A", "B", "B", "C", "C") y_pred <- c("A", "B", "B", "B", "C", "A") calc_confusion_matrix(y_true, y_pred)
Unified interface for multiple machine learning classifiers
Cluster groups based on an adjacency/confusion matrix using Louvain
cluster_adjacency_matrix( adj_matrix, cutoff = 0.1, resolution = 1, algorithm = "louvain" )cluster_adjacency_matrix( adj_matrix, cutoff = 0.1, resolution = 1, algorithm = "louvain" )
adj_matrix |
Adjacency matrix (e.g., normalized confusion matrix) |
cutoff |
Threshold for binarizing the matrix (default: 0.1) |
resolution |
Resolution parameter for Louvain clustering (default: 1.0) |
algorithm |
Clustering algorithm: "louvain" or "leiden" (if igraph supports it) |
The function:
Binarizes the adjacency matrix using the cutoff
Creates a graph from the binary matrix
Applies Louvain/Leiden clustering to identify groups of connected clusters
Integer vector of cluster assignments
# Create a sample adjacency matrix adj <- matrix(c(0, 0.3, 0.05, 0.3, 0, 0.02, 0.05, 0.02, 0), nrow = 3) rownames(adj) <- colnames(adj) <- c("A", "B", "C") cluster_adjacency_matrix(adj, cutoff = 0.1)# Create a sample adjacency matrix adj <- matrix(c(0, 0.3, 0.05, 0.3, 0, 0.02, 0.05, 0.02, 0), nrow = 3) rownames(adj) <- colnames(adj) <- c("A", "B", "C") cluster_adjacency_matrix(adj, cutoff = 0.1)
Functions for graph-based clustering and cluster manipulation
Functions for computing and normalizing confusion matrices
Create a unified classifier object that wraps various ML algorithms
create_classifier( type = "LR", penalty = "l1", alpha = NULL, lambda = NULL, n_trees = 500, max_depth = NULL, kernel = "radial", seed = NULL, ... )create_classifier( type = "LR", penalty = "l1", alpha = NULL, lambda = NULL, n_trees = 500, max_depth = NULL, kernel = "radial", seed = NULL, ... )
type |
Classifier type: "LR", "RF", "SVM", "NB", "DT", "XGB", "RANGER" |
penalty |
For LR: "l1" (lasso), "l2" (ridge), or "elasticnet" |
alpha |
For LR: elasticnet mixing parameter (1=lasso, 0=ridge). Default: 1 for L1 |
lambda |
For LR: regularization strength (smaller = more regularization). If NULL, uses cross-validation to select |
n_trees |
For RF/RANGER/XGB: number of trees |
max_depth |
For DT/XGB: maximum tree depth |
kernel |
For SVM: kernel type ("linear", "radial", "polynomial") |
seed |
Random seed for reproducibility |
... |
Additional arguments passed to the underlying classifier |
The returned classifier object has the following methods:
Train the classifier on data
Get class predictions
Get class probabilities
Get model coefficients (for LR)
Get feature importance (for tree-based)
A classifier object with fit, predict, and predict_prob methods
## Not run: # Create a logistic regression classifier clf <- create_classifier("LR", penalty = "l1") # Train clf$fit(X_train, y_train) # Predict predictions <- clf$predict(X_test) probabilities <- clf$predict_prob(X_test) ## End(Not run)## Not run: # Create a logistic regression classifier clf <- create_classifier("LR", penalty = "l1") # Train clf$fit(X_train, y_train) # Predict predictions <- clf$predict(X_test) probabilities <- clf$predict_prob(X_test) ## End(Not run)
List all available classifiers and their requirements
get_available_classifiers()get_available_classifiers()
A data.frame with classifier information
get_available_classifiers()get_available_classifiers()
Compute connection matrix showing overlap between two clusterings
get_connection_matrix(labels1, labels2, min_percent = 0.1)get_connection_matrix(labels1, labels2, min_percent = 0.1)
labels1 |
First clustering labels (e.g., low resolution) |
labels2 |
Second clustering labels (e.g., high resolution) |
min_percent |
Minimum percentage threshold to consider a connection (default: 0.1) |
Binary connection matrix indicating which clusters from labels2 are connected
Compute pairwise distances between cluster centroids
get_distance_matrix(X, clusters, labels = NULL, method = "euclidean")get_distance_matrix(X, clusters, labels = NULL, method = "euclidean")
X |
Expression matrix (cells x features) |
clusters |
Cluster assignments |
labels |
Optional: specific labels to include (in order) |
method |
Distance method: "euclidean", "manhattan", "cosine" |
Distance matrix between cluster centroids
Extract top weighted features from a logistic regression classifier
get_top_markers(result, feature_names = NULL, top_n = 10)get_top_markers(result, feature_names = NULL, top_n = 10)
result |
scClustEval result object (from self_projection) |
feature_names |
Names of features (genes). If NULL, uses column indices |
top_n |
Number of top features per class (default: 10) |
Data frame with columns: class, feature, weight
Helper function to extract feature matrix with V4/V5 compatibility
GetExpressionMatrix(object, assay = NULL, slot = "data", features = NULL)GetExpressionMatrix(object, assay = NULL, slot = "data", features = NULL)
object |
Seurat object |
assay |
Assay name |
slot |
Slot name: "data", "counts", "scale.data" |
features |
Features to extract |
Matrix (cells x features)
Make a name vector unique by adding suffix "_n"
make_unique_names(x)make_unique_names(x)
x |
Character vector with potential duplicates |
Character vector with unique names
make_unique_names(c("A", "B", "A", "C", "A")) # Returns: c("A", "B", "A_1", "C", "A_2")make_unique_names(c("A", "B", "A", "C", "A")) # Returns: c("A", "B", "A_1", "C", "A_2")
Merge clusters based on group assignments
merge_clusters(labels, groups, label_map = NULL)merge_clusters(labels, groups, label_map = NULL)
labels |
Original cluster labels |
groups |
New group assignments (from cluster_adjacency_matrix) |
label_map |
Optional: named vector mapping old labels to new labels |
Factor with merged cluster labels
Normalize confusion matrix relative to correctly classified cells
normalize_confmat_r1(cmat, mode = "1")normalize_confmat_r1(cmat, mode = "1")
cmat |
Confusion matrix (from calc_confusion_matrix). Rows represent true labels, columns represent predicted labels. |
mode |
Normalization mode: "1" (default, as in SCCAF) or "2" |
R1 normalization measures the confusion rate between cluster pairs relative to the number of correctly classified cells.
For each pair (i, j), compute:
where:
= cells truly in cluster i but predicted as cluster j
= cells truly in cluster j and correctly predicted (diagonal)
The ratio represents how many cells from cluster i
are misclassified as j, relative to the correctly classified cells in j.
A high R1 value indicates substantial confusion between the cluster pair.
Symmetric matrix of pairwise R1-normalized confusion values. Values typically range from 0 to >1 (can exceed 1 when misclassifications outnumber correct classifications).
Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods.
cmat <- matrix(c(90, 5, 5, 3, 85, 12, 2, 10, 88), nrow = 3) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") normalize_confmat_r1(cmat)cmat <- matrix(c(90, 5, 5, 3, 85, 12, 2, 10, 88), nrow = 3) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") normalize_confmat_r1(cmat)
Normalize confusion matrix relative to total cell count
normalize_confmat_r2(cmat)normalize_confmat_r2(cmat)
cmat |
Confusion matrix (from calc_confusion_matrix). Rows represent true labels, columns represent predicted labels. |
R2 normalization measures the overall impact of confusion between cluster pairs on the entire dataset.
For each pair (i, j), compute:
where:
= cells truly in cluster i but predicted as cluster j
= cells truly in cluster j but predicted as cluster i
= total number of cells in the test set
This gives the fraction of total cells that are confused between clusters i and j. Unlike R1, R2 values are always between 0 and 1.
Symmetric matrix of pairwise R2-normalized confusion values. Values range from 0 to 1, representing the fraction of total cells misclassified between each cluster pair.
Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods.
cmat <- matrix(c(90, 5, 5, 3, 85, 12, 2, 10, 88), nrow = 3) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") normalize_confmat_r2(cmat)cmat <- matrix(c(90, 5, 5, 3, 85, 12, 2, 10, 88), nrow = 3) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") normalize_confmat_r2(cmat)
Wrapper function for confusion matrix normalization
normalize_confusion_matrix(cmat, method = "R1", mode = "1")normalize_confusion_matrix(cmat, method = "R1", mode = "1")
cmat |
Confusion matrix |
method |
Normalization method: "R1" or "R2" |
mode |
For R1: normalization mode "1" or "2" |
Normalized confusion matrix
Calculate the prediction confidence for each cell
per_cell_accuracy(X, y, clf)per_cell_accuracy(X, y, clf)
X |
Feature matrix (cells x features) |
y |
True cell labels |
clf |
Trained classifier object |
Numeric vector of per-cell accuracy scores
Calculate the classification accuracy for each cluster
per_cluster_accuracy(cmat)per_cluster_accuracy(cmat)
cmat |
Confusion matrix (rows = true, cols = predicted) |
Named numeric vector of per-cluster accuracies
cmat <- matrix(c(90, 5, 5, 3, 85, 12, 2, 10, 88), nrow = 3) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") per_cluster_accuracy(cmat)cmat <- matrix(c(90, 5, 5, 3, 85, 12, 2, 10, 88), nrow = 3) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") per_cluster_accuracy(cmat)
Create a comprehensive summary plot of assessment results
plot_assessment_summary(result, include = c("roc", "confusion", "accuracy"))plot_assessment_summary(result, include = c("roc", "confusion", "accuracy"))
result |
scClustEval result object |
include |
Which plots to include: combination of "roc", "confusion", "accuracy" |
A combined ggplot2 object
Mark cluster centroids on an embedding plot
plot_cluster_centers( embeddings, labels, point_color = "white", point_size = 8, point_alpha = 0.6, add_to_plot = NULL )plot_cluster_centers( embeddings, labels, point_color = "white", point_size = 8, point_alpha = 0.6, add_to_plot = NULL )
embeddings |
Matrix of 2D embeddings (cells x 2) |
labels |
Cluster labels for each cell |
point_color |
Color for centroid markers (default: "white") |
point_size |
Size of centroid markers (default: 8) |
point_alpha |
Transparency of markers (default: 0.6) |
add_to_plot |
If provided, add to existing ggplot object |
A ggplot2 object with cluster centroids marked
Draw lines between cluster centroids on embedding space based on confusion matrix values. This visualizes which clusters are frequently confused.
plot_cluster_links( embeddings, labels, confusion_matrix, threshold = 0, line_color = "#ffa500", line_scale = 10, show_labels = TRUE, label_size = 4, point_color = "white", point_size = 5, point_alpha = 0.7, add_to_plot = NULL )plot_cluster_links( embeddings, labels, confusion_matrix, threshold = 0, line_color = "#ffa500", line_scale = 10, show_labels = TRUE, label_size = 4, point_color = "white", point_size = 5, point_alpha = 0.7, add_to_plot = NULL )
embeddings |
Matrix of 2D embeddings (cells x 2), e.g., UMAP or t-SNE coordinates |
labels |
Cluster labels for each cell |
confusion_matrix |
Confusion/connection matrix between clusters (e.g., R1-normalized) |
threshold |
Only draw lines for values above this threshold (default: 0) |
line_color |
Color for connection lines (default: "#ffa500", orange) |
line_scale |
Scale factor for line width (default: 10) |
show_labels |
Show cluster labels at centroids (default: TRUE) |
label_size |
Size of cluster labels (default: 4) |
point_color |
Color for centroid points (default: "white") |
point_size |
Size of centroid points (default: 5) |
point_alpha |
Transparency of centroid points (default: 0.7) |
add_to_plot |
If provided, add to existing ggplot object |
This function computes the median position (centroid) of each cluster in the embedding space, then draws lines between centroids where the confusion matrix value exceeds the threshold. Line width is proportional to the confusion value.
A ggplot2 object showing cluster connections
## Not run: # Get UMAP coordinates from Seurat object embeddings <- Seurat::Embeddings(seurat_obj, "umap") labels <- seurat_obj$seurat_clusters # Run assessment and get R1 matrix result <- sc_assessment(X, labels) # Plot connections plot_cluster_links(embeddings, labels, result$r1_normalized, threshold = 0.1) ## End(Not run)## Not run: # Get UMAP coordinates from Seurat object embeddings <- Seurat::Embeddings(seurat_obj, "umap") labels <- seurat_obj$seurat_clusters # Run assessment and get R1 matrix result <- sc_assessment(X, labels) # Plot connections plot_cluster_links(embeddings, labels, result$r1_normalized, threshold = 0.1) ## End(Not run)
Visualize cluster reassignment between rounds or annotations
plot_cluster_sankey( labels_from, labels_to, title = "Cluster Reassignment", colors = NULL, alpha = 0.6 )plot_cluster_sankey( labels_from, labels_to, title = "Cluster Reassignment", colors = NULL, alpha = 0.6 )
labels_from |
Original cluster labels |
labels_to |
New cluster labels |
title |
Plot title |
colors |
Color palette |
alpha |
Transparency of flows |
A ggplot2 object (using ggalluvial if available)
Visualize confusion matrix as a heatmap
plot_confusion_heatmap( result, normalized = "R1", title = NULL, colors = c("white", "gray20"), show_values = TRUE, text_size = 3 )plot_confusion_heatmap( result, normalized = "R1", title = NULL, colors = c("white", "gray20"), show_values = TRUE, text_size = 3 )
result |
scClustEval result object, or a confusion matrix directly |
normalized |
Which matrix to plot: "raw", "R1", or "R2" |
title |
Plot title |
colors |
Color gradient (low to high) |
show_values |
Show values in cells |
text_size |
Size of text in cells |
A ggplot2 object
Combined plot showing cells colored by cluster with confusion-based connections
plot_embedding_with_links( embeddings, labels, confusion_matrix, threshold = 0.1, title = NULL, colors = NULL, point_size = 0.5, line_color = "#ffa500", line_scale = 10, show_legend = TRUE )plot_embedding_with_links( embeddings, labels, confusion_matrix, threshold = 0.1, title = NULL, colors = NULL, point_size = 0.5, line_color = "#ffa500", line_scale = 10, show_legend = TRUE )
embeddings |
Matrix of 2D embeddings (cells x 2) |
labels |
Cluster labels for each cell |
confusion_matrix |
Confusion matrix between clusters |
threshold |
Threshold for drawing connection lines (default: 0.1) |
title |
Plot title |
colors |
Optional color palette for clusters |
point_size |
Size of cell points (default: 0.5) |
line_color |
Color for connection lines (default: "#ffa500") |
line_scale |
Scale factor for line widths (default: 10) |
show_legend |
Show cluster legend (default: TRUE) |
This function creates a scatter plot of cells colored by cluster assignment, with lines connecting cluster centroids based on confusion matrix values. Useful for visualizing which clusters are transcriptionally similar and frequently confused by the classifier.
A ggplot2 object
## Not run: # With Seurat object embeddings <- Seurat::Embeddings(seurat_obj, "umap") labels <- seurat_obj$seurat_clusters result <- sc_assessment(GetAssayData(seurat_obj), labels) plot_embedding_with_links( embeddings, labels, result$r1_normalized, threshold = 0.1, title = "UMAP with Confusion Links" ) ## End(Not run)## Not run: # With Seurat object embeddings <- Seurat::Embeddings(seurat_obj, "umap") labels <- seurat_obj$seurat_clusters result <- sc_assessment(GetAssayData(seurat_obj), labels) plot_embedding_with_links( embeddings, labels, result$r1_normalized, threshold = 0.1, title = "UMAP with Confusion Links" ) ## End(Not run)
Visualize the progression of optimization rounds
plot_optimization_history(result, metric = "both")plot_optimization_history(result, metric = "both")
result |
scClustEval_optim result object from sc_optimize_all |
metric |
Which metric to plot: "accuracy", "clusters", or "both" |
A ggplot2 object
Plot ROC and Precision-Recall curves for assessment results
plot_roc( result, plot_type = "both", show_auc = TRUE, show_cv = TRUE, show_acc = TRUE, colors = NULL, title = NULL, legend_position = "right" )plot_roc( result, plot_type = "both", show_auc = TRUE, show_cv = TRUE, show_acc = TRUE, colors = NULL, title = NULL, legend_position = "right" )
result |
scClustEval result object from self_projection/sc_assessment |
plot_type |
Type of plot: "both", "roc", or "prc" |
show_auc |
Show AUC values on plot |
show_cv |
Show cross-validation accuracy |
show_acc |
Show test accuracy |
colors |
Custom color palette |
title |
Plot title |
legend_position |
Legend position: "right", "bottom", or "none" |
A ggplot2 object
## Not run: result <- sc_assessment(X, labels) plot_roc(result) plot_roc(result, plot_type = "roc") ## End(Not run)## Not run: result <- sc_assessment(X, labels) plot_roc(result) plot_roc(result, plot_type = "roc") ## End(Not run)
Plot method for scClustEval objects
## S3 method for class 'scClustEval' plot(x, type = "roc", ...)## S3 method for class 'scClustEval' plot(x, type = "roc", ...)
x |
scClustEval object |
type |
Plot type: "roc", "confusion", "accuracy", "summary" |
... |
Additional arguments passed to specific plot functions |
Visualize clustering confusion on Seurat UMAP/t-SNE embedding
PlotConfusionLinks( object, result, reduction = "umap", cluster_col = NULL, matrix_type = "R1", threshold = 0.1, line_color = "#ffa500", line_scale = 10, point_size = 0.5, title = NULL, show_legend = TRUE )PlotConfusionLinks( object, result, reduction = "umap", cluster_col = NULL, matrix_type = "R1", threshold = 0.1, line_color = "#ffa500", line_scale = 10, point_size = 0.5, title = NULL, show_legend = TRUE )
object |
Seurat object |
result |
scClustEval assessment result |
reduction |
Name of reduction to plot (default: "umap") |
cluster_col |
Cluster column to use (default: uses result info or Idents) |
matrix_type |
Which confusion matrix to use: "R1" or "R2" |
threshold |
Threshold for drawing lines (default: 0.1) |
line_color |
Color for connection lines |
line_scale |
Scale factor for line widths |
point_size |
Size of cell points |
title |
Plot title |
show_legend |
Show cluster legend |
This function creates a UMAP/t-SNE plot showing cells colored by cluster, with lines connecting cluster centroids based on confusion matrix values. This helps identify which clusters are transcriptionally similar.
A ggplot2 object
## Not run: # Run assessment result <- RunAssessment(seurat_obj) # Plot UMAP with confusion links PlotConfusionLinks(seurat_obj, result, threshold = 0.1) # Use t-SNE instead PlotConfusionLinks(seurat_obj, result, reduction = "tsne") ## End(Not run)## Not run: # Run assessment result <- RunAssessment(seurat_obj) # Plot UMAP with confusion links PlotConfusionLinks(seurat_obj, result, threshold = 0.1) # Use t-SNE instead PlotConfusionLinks(seurat_obj, result, reduction = "tsne") ## End(Not run)
Print method for scClustEval
## S3 method for class 'scClustEval' print(x, ...)## S3 method for class 'scClustEval' print(x, ...)
x |
scClustEval object |
... |
Additional arguments (ignored) |
Print method for classifier
## S3 method for class 'scClustEval_classifier' print(x, ...)## S3 method for class 'scClustEval_classifier' print(x, ...)
x |
Classifier object |
... |
Additional arguments |
Print method for optimization result
## S3 method for class 'scClustEval_optim' print(x, ...)## S3 method for class 'scClustEval_optim' print(x, ...)
x |
scClustEval_optim object |
... |
Additional arguments (ignored) |
One-liner for quick clustering quality check
QuickAssess(object, ...)QuickAssess(object, ...)
object |
Seurat object |
... |
Additional arguments passed to RunAssessment |
Prints summary and returns accuracy
## Not run: # Quick check accuracy <- QuickAssess(seurat_obj) ## End(Not run)## Not run: # Quick check accuracy <- QuickAssess(seurat_obj) ## End(Not run)
Assess clustering quality directly on a Seurat object
RunAssessment( object, cluster_col = NULL, assay = NULL, use = "pca", dims = 1:50, features = NULL, classifier = "LR", penalty = "l1", test_size = 0.5, n_per_class = 100, cv = 5, seed = 1, n_cores = NULL, verbose = TRUE, ... )RunAssessment( object, cluster_col = NULL, assay = NULL, use = "pca", dims = 1:50, features = NULL, classifier = "LR", penalty = "l1", test_size = 0.5, n_per_class = 100, cv = 5, seed = 1, n_cores = NULL, verbose = TRUE, ... )
object |
Seurat object |
cluster_col |
Column name in meta.data containing cluster labels (default: uses Idents) |
assay |
Assay to use (default: DefaultAssay) |
use |
Feature space to use: "raw" (normalized data), "pca", or other reduction name |
dims |
Dimensions to use for PCA/reduction (default: 1:50) |
features |
Optional: specific features to use. If NULL, uses all (for raw) or VariableFeatures |
classifier |
Classifier type: "LR", "RF", etc. |
penalty |
For LR: "l1" or "l2" |
test_size |
Fraction for test set |
n_per_class |
Max samples per class |
cv |
Cross-validation folds |
seed |
Random seed |
n_cores |
Number of cores |
verbose |
Print progress |
... |
Additional arguments passed to self_projection |
This function extracts the appropriate data from the Seurat object and runs self_projection assessment. By default, it uses PCA coordinates if available, otherwise normalized data.
For Seurat V4, uses GetAssayData with slots. For Seurat V5, automatically uses LayerData when appropriate.
An scClustEval result object
self_projection, RunOptimization
## Not run: # Assess default clustering result <- RunAssessment(seurat_obj) # Assess specific clustering with PCA result <- RunAssessment( seurat_obj, cluster_col = "seurat_clusters", use = "pca", dims = 1:30 ) # Assess with raw expression result <- RunAssessment( seurat_obj, cluster_col = "manual_annotation", use = "raw" ) ## End(Not run)## Not run: # Assess default clustering result <- RunAssessment(seurat_obj) # Assess specific clustering with PCA result <- RunAssessment( seurat_obj, cluster_col = "seurat_clusters", use = "pca", dims = 1:30 ) # Assess with raw expression result <- RunAssessment( seurat_obj, cluster_col = "manual_annotation", use = "raw" ) ## End(Not run)
Optimize clustering directly on a Seurat object
RunOptimization( object, cluster_col, result_col = "scClustEval_clusters", prefix = "scClustEval", store_rounds = TRUE, assay = NULL, use = "pca", dims = 1:50, features = NULL, min_accuracy = 0.9, max_rounds = 10, classifier = "LR", penalty = "l1", test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.5, r2_cutoff = 0.05, under_cluster_col = NULL, seed = 1, n_cores = NULL, verbose = TRUE, ... )RunOptimization( object, cluster_col, result_col = "scClustEval_clusters", prefix = "scClustEval", store_rounds = TRUE, assay = NULL, use = "pca", dims = 1:50, features = NULL, min_accuracy = 0.9, max_rounds = 10, classifier = "LR", penalty = "l1", test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.5, r2_cutoff = 0.05, under_cluster_col = NULL, seed = 1, n_cores = NULL, verbose = TRUE, ... )
object |
Seurat object |
cluster_col |
Column name containing initial over-clustering |
result_col |
Column name to store final optimized clustering |
prefix |
Prefix for intermediate round columns (default: "scClustEval") |
store_rounds |
Whether to store intermediate round results |
assay |
Assay to use |
use |
Feature space: "raw", "pca", etc. |
dims |
Dimensions for PCA/reduction |
features |
Specific features to use |
min_accuracy |
Target accuracy |
max_rounds |
Maximum rounds |
classifier |
Classifier type |
penalty |
For LR: regularization type |
test_size |
Test fraction |
n_per_class |
Max samples per class |
cv |
CV folds |
n_iter |
Confusion matrix iterations |
r1_cutoff |
Initial R1 cutoff |
r2_cutoff |
Initial R2 cutoff |
under_cluster_col |
Optional: column with under-clustering constraint |
seed |
Random seed |
n_cores |
Number of cores |
verbose |
Print progress |
... |
Additional arguments |
The function modifies the Seurat object by adding:
Final optimized clustering in result_col (default: "scClustEval_clusters")
Intermediate round results (if store_rounds = TRUE)
Cluster reliability scores (optional)
Modified Seurat object with optimized clustering added to meta.data
sc_optimize_all, RunAssessment
## Not run: # Basic optimization seurat_obj <- RunOptimization( seurat_obj, cluster_col = "seurat_clusters", min_accuracy = 0.9 ) # With under-clustering constraint seurat_obj <- RunOptimization( seurat_obj, cluster_col = "high_res_clusters", under_cluster_col = "low_res_clusters", min_accuracy = 0.95 ) ## End(Not run)## Not run: # Basic optimization seurat_obj <- RunOptimization( seurat_obj, cluster_col = "seurat_clusters", min_accuracy = 0.9 ) # With under-clustering constraint seurat_obj <- RunOptimization( seurat_obj, cluster_col = "high_res_clusters", under_cluster_col = "low_res_clusters", min_accuracy = 0.95 ) ## End(Not run)
Main assessment function with user-friendly interface
sc_assessment( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, seed = 1, n_cores = NULL, verbose = TRUE )sc_assessment( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, seed = 1, n_cores = NULL, verbose = TRUE )
X |
Expression/feature matrix (cells x features). Can be sparse. |
labels |
Cluster labels for each cell |
classifier |
Classifier type: "LR", "RF", "SVM", "NB", "DT", "XGB", "RANGER" |
penalty |
For LR: regularization type "l1", "l2", or "elasticnet" |
lambda |
For LR: regularization strength. If NULL, uses CV to select |
test_size |
Fraction of data for testing (default: 0.5) |
n_per_class |
Maximum samples per class in training set. If NULL, uses test_size |
cv |
Number of cross-validation folds on training set (0 to skip CV) |
seed |
Random seed for reproducibility |
n_cores |
Number of cores for parallel processing (NULL = auto-detect) |
verbose |
Print progress messages |
An scClustEval result object
self_projection, RunAssessment
## Not run: # Assess clustering quality result <- sc_assessment( X = expression_matrix, labels = seurat_object$seurat_clusters ) # Print summary print(result) # Plot ROC curves plot_roc(result) ## End(Not run)## Not run: # Assess clustering quality result <- sc_assessment( X = expression_matrix, labels = seurat_object$seurat_clusters ) # Print summary print(result) # Plot ROC curves plot_roc(result) ## End(Not run)
Perform one round of confusion-matrix-guided cluster merging
sc_optimize( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.1, r2_cutoff = 0.05, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, connection_matrix = NULL, resolution = 1, seed = 1, n_cores = NULL, verbose = TRUE )sc_optimize( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.1, r2_cutoff = 0.05, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, connection_matrix = NULL, resolution = 1, seed = 1, n_cores = NULL, verbose = TRUE )
X |
Expression/feature matrix (cells x features) |
labels |
Current cluster labels |
classifier |
Classifier type: "LR", "RF", "SVM", etc. |
penalty |
For LR: regularization type |
lambda |
For LR: regularization strength |
test_size |
Fraction for test set |
n_per_class |
Max samples per class in training |
cv |
Cross-validation folds (0 to skip) |
n_iter |
Number of sampling iterations for confusion matrix (default: 3) |
r1_cutoff |
Threshold for R1-normalized confusion (default: 0.1) |
r2_cutoff |
Threshold for R2-normalized confusion (default: 0.05) |
r1_mode |
R1 normalization mode: "1" or "2" (default: "1", as in SCCAF) |
use_r1_only |
Use only R1 normalization for merging decisions |
use_r2_only |
Use only R2 normalization for merging decisions |
use_distance |
Use distance matrix in merging decision (default: FALSE) |
dist_cutoff |
Distance cutoff for merging (default: 8.0) |
use_projection |
Use self-projection labels for subsequent iterations (default: FALSE) |
connection_matrix |
Optional connection matrix for constrained merging |
resolution |
Louvain resolution for merging (default: 1.0) |
seed |
Random seed |
n_cores |
Number of cores for parallel processing |
verbose |
Print progress |
A list containing:
Merged cluster labels
Original cluster labels
Mapping from old clusters to new groups
Test accuracy from assessment
CV accuracy
Maximum R1 confusion after this round
Maximum R2 confusion after this round
Aggregated R1-normalized confusion matrix
Aggregated R2-normalized confusion matrix
Number of clusters before merging
Number of clusters after merging
Whether optimization has converged (no merging possible)
List of assessment results from iterations
Self-projection predicted labels (if use_projection=TRUE)
Iteratively optimize clustering until target accuracy is reached
sc_optimize_all( X, labels, min_accuracy = 0.9, max_rounds = 10, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.5, r2_cutoff = 0.05, r1_step = 0.01, r2_step = 0.001, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, under_cluster_labels = NULL, min_outer_iter = 3, seed = 1, n_cores = NULL, verbose = TRUE )sc_optimize_all( X, labels, min_accuracy = 0.9, max_rounds = 10, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.5, r2_cutoff = 0.05, r1_step = 0.01, r2_step = 0.001, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, under_cluster_labels = NULL, min_outer_iter = 3, seed = 1, n_cores = NULL, verbose = TRUE )
X |
Expression/feature matrix (cells x features) |
labels |
Initial cluster labels (should be over-clustered) |
min_accuracy |
Target minimum accuracy to achieve (default: 0.9) |
max_rounds |
Maximum optimization rounds (default: 10) |
classifier |
Classifier type |
penalty |
For LR: regularization type |
lambda |
For LR: regularization strength |
test_size |
Fraction for test set |
n_per_class |
Max samples per class |
cv |
CV folds |
n_iter |
Iterations per round for confusion matrix |
r1_cutoff |
Initial R1 cutoff (default: 0.5) |
r2_cutoff |
Initial R2 cutoff (default: 0.05) |
r1_step |
Step to reduce R1 cutoff each outer iteration (default: 0.01) |
r2_step |
Step to reduce R2 cutoff each outer iteration (default: 0.001) |
r1_mode |
R1 normalization mode: "1" or "2" (default: "1") |
use_r1_only |
Use only R1 for merging |
use_r2_only |
Use only R2 for merging |
use_distance |
Use distance matrix in merging decisions (default: FALSE) |
dist_cutoff |
Distance cutoff for merging (default: 8.0) |
use_projection |
Use self-projection labels for subsequent iterations (default: FALSE) |
under_cluster_labels |
Optional: under-clustering labels as constraint |
min_outer_iter |
Minimum outer iterations before allowing convergence |
seed |
Random seed |
n_cores |
Number of cores |
verbose |
Print progress |
The optimization proceeds in two levels:
Outer iterations: Progressively lower the R1/R2 cutoffs
Inner rounds: Merge clusters based on current cutoffs
The process continues until:
Target accuracy is reached
Maximum rounds are exceeded
No more clusters can be merged
A list containing:
Final optimized cluster labels
Initial cluster labels
List of results from each round
Vector of accuracies per round
Vector of cluster counts per round
Final achieved accuracy
Number of rounds performed
Whether target accuracy was reached
## Not run: # Optimize over-clustered result result <- sc_optimize_all( X = expression_matrix, labels = over_clustered_labels, min_accuracy = 0.9, classifier = "LR" ) # Get final labels final_clusters <- result$final_labels ## End(Not run)## Not run: # Optimize over-clustered result result <- sc_optimize_all( X = expression_matrix, labels = over_clustered_labels, min_accuracy = 0.9, classifier = "LR" ) # Get final labels final_clusters <- result$final_labels ## End(Not run)
Alias for self_projection for compatibility with SCCAF Python package
SCCAF_assessment(...)SCCAF_assessment(...)
... |
Arguments passed to |
A list (class "scClustEval") containing:
Overall accuracy on test set
Mean cross-validation accuracy (if cv > 0)
Accuracy on training set
Predicted labels for test set
True labels for test set
Prediction probabilities (matrix)
Confusion matrix
R1-normalized confusion matrix
R2-normalized confusion matrix
Per-cluster accuracy
Trained classifier object
Unique class labels
Maximum R1 confusion value
Maximum R2 confusion value
Alias for sc_optimize for compatibility
SCCAF_optimize( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.1, r2_cutoff = 0.05, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, connection_matrix = NULL, resolution = 1, seed = 1, n_cores = NULL, verbose = TRUE )SCCAF_optimize( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.1, r2_cutoff = 0.05, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, connection_matrix = NULL, resolution = 1, seed = 1, n_cores = NULL, verbose = TRUE )
X |
Expression/feature matrix (cells x features) |
labels |
Current cluster labels |
classifier |
Classifier type: "LR", "RF", "SVM", etc. |
penalty |
For LR: regularization type |
lambda |
For LR: regularization strength |
test_size |
Fraction for test set |
n_per_class |
Max samples per class in training |
cv |
Cross-validation folds (0 to skip) |
n_iter |
Number of sampling iterations for confusion matrix (default: 3) |
r1_cutoff |
Threshold for R1-normalized confusion (default: 0.1) |
r2_cutoff |
Threshold for R2-normalized confusion (default: 0.05) |
r1_mode |
R1 normalization mode: "1" or "2" (default: "1", as in SCCAF) |
use_r1_only |
Use only R1 normalization for merging decisions |
use_r2_only |
Use only R2 normalization for merging decisions |
use_distance |
Use distance matrix in merging decision (default: FALSE) |
dist_cutoff |
Distance cutoff for merging (default: 8.0) |
use_projection |
Use self-projection labels for subsequent iterations (default: FALSE) |
connection_matrix |
Optional connection matrix for constrained merging |
resolution |
Louvain resolution for merging (default: 1.0) |
seed |
Random seed |
n_cores |
Number of cores for parallel processing |
verbose |
Print progress |
Alias for sc_optimize_all for compatibility
SCCAF_optimize_all( X, labels, min_accuracy = 0.9, max_rounds = 10, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.5, r2_cutoff = 0.05, r1_step = 0.01, r2_step = 0.001, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, under_cluster_labels = NULL, min_outer_iter = 3, seed = 1, n_cores = NULL, verbose = TRUE )SCCAF_optimize_all( X, labels, min_accuracy = 0.9, max_rounds = 10, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = 100, cv = 5, n_iter = 3, r1_cutoff = 0.5, r2_cutoff = 0.05, r1_step = 0.01, r2_step = 0.001, r1_mode = "1", use_r1_only = FALSE, use_r2_only = FALSE, use_distance = FALSE, dist_cutoff = 8, use_projection = FALSE, under_cluster_labels = NULL, min_outer_iter = 3, seed = 1, n_cores = NULL, verbose = TRUE )
X |
Expression/feature matrix (cells x features) |
labels |
Initial cluster labels (should be over-clustered) |
min_accuracy |
Target minimum accuracy to achieve (default: 0.9) |
max_rounds |
Maximum optimization rounds (default: 10) |
classifier |
Classifier type |
penalty |
For LR: regularization type |
lambda |
For LR: regularization strength |
test_size |
Fraction for test set |
n_per_class |
Max samples per class |
cv |
CV folds |
n_iter |
Iterations per round for confusion matrix |
r1_cutoff |
Initial R1 cutoff (default: 0.5) |
r2_cutoff |
Initial R2 cutoff (default: 0.05) |
r1_step |
Step to reduce R1 cutoff each outer iteration (default: 0.01) |
r2_step |
Step to reduce R2 cutoff each outer iteration (default: 0.001) |
r1_mode |
R1 normalization mode: "1" or "2" (default: "1") |
use_r1_only |
Use only R1 for merging |
use_r2_only |
Use only R2 for merging |
use_distance |
Use distance matrix in merging decisions (default: FALSE) |
dist_cutoff |
Distance cutoff for merging (default: 8.0) |
use_projection |
Use self-projection labels for subsequent iterations (default: FALSE) |
under_cluster_labels |
Optional: under-clustering labels as constraint |
min_outer_iter |
Minimum outer iterations before allowing convergence |
seed |
Random seed |
n_cores |
Number of cores |
verbose |
Print progress |
Core function for evaluating clustering quality using self-projection
self_projection( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = NULL, cv = 5, seed = 1, n_cores = NULL, verbose = TRUE )self_projection( X, labels, classifier = "LR", penalty = "l1", lambda = NULL, test_size = 0.5, n_per_class = NULL, cv = 5, seed = 1, n_cores = NULL, verbose = TRUE )
X |
Expression/feature matrix (cells x features). Can be sparse. |
labels |
Cluster labels for each cell |
classifier |
Classifier type: "LR", "RF", "SVM", "NB", "DT", "XGB", "RANGER" |
penalty |
For LR: regularization type "l1", "l2", or "elasticnet" |
lambda |
For LR: regularization strength. If NULL, uses CV to select |
test_size |
Fraction of data for testing (default: 0.5) |
n_per_class |
Maximum samples per class in training set. If NULL, uses test_size |
cv |
Number of cross-validation folds on training set (0 to skip CV) |
seed |
Random seed for reproducibility |
n_cores |
Number of cores for parallel processing (NULL = auto-detect) |
verbose |
Print progress messages |
The self-projection method works by:
Splitting data into training and test sets (stratified by cluster)
Training a classifier on the training set
Evaluating prediction accuracy on the held-out test set
Computing confusion matrices to identify poorly discriminated clusters
High accuracy indicates that clusters are well-separated. Pairs of clusters that are frequently confused may need to be merged.
A list (class "scClustEval") containing:
Overall accuracy on test set
Mean cross-validation accuracy (if cv > 0)
Accuracy on training set
Predicted labels for test set
True labels for test set
Prediction probabilities (matrix)
Confusion matrix
R1-normalized confusion matrix
R2-normalized confusion matrix
Per-cluster accuracy
Trained classifier object
Unique class labels
Maximum R1 confusion value
Maximum R2 confusion value
## Not run: # Basic usage with expression matrix result <- self_projection( X = expression_matrix, labels = cluster_assignments, classifier = "LR" ) print(result$accuracy) # With random forest result <- self_projection( X = expression_matrix, labels = cluster_assignments, classifier = "RF", n_per_class = 100 ) ## End(Not run)## Not run: # Basic usage with expression matrix result <- self_projection( X = expression_matrix, labels = cluster_assignments, classifier = "LR" ) print(result$accuracy) # With random forest result <- self_projection( X = expression_matrix, labels = cluster_assignments, classifier = "RF", n_per_class = 100 ) ## End(Not run)
Summary method for scClustEval
## S3 method for class 'scClustEval' summary(object, ...)## S3 method for class 'scClustEval' summary(object, ...)
object |
scClustEval object |
... |
Additional arguments (ignored) |
Summary method for optimization result
## S3 method for class 'scClustEval_optim' summary(object, ...)## S3 method for class 'scClustEval_optim' summary(object, ...)
object |
scClustEval_optim object |
... |
Additional arguments (ignored) |
Wrapper around train_test_split_stratified for simple usage
train_test_split(X, y, test_size = 0.5, n_per_class = NULL, seed = NULL)train_test_split(X, y, test_size = 0.5, n_per_class = NULL, seed = NULL)
X |
Feature matrix (cells x features) |
y |
Class labels (factor or character vector) |
test_size |
Fraction of data for testing (default: 0.5) |
n_per_class |
Maximum number of samples per class in training set. If NULL, uses test_size fraction |
seed |
Random seed for reproducibility |
Split data into training and test sets while maintaining class proportions
train_test_split_stratified( X, y, test_size = 0.5, n_per_class = NULL, seed = NULL )train_test_split_stratified( X, y, test_size = 0.5, n_per_class = NULL, seed = NULL )
X |
Feature matrix (cells x features) |
y |
Class labels (factor or character vector) |
test_size |
Fraction of data for testing (default: 0.5) |
n_per_class |
Maximum number of samples per class in training set. If NULL, uses test_size fraction |
seed |
Random seed for reproducibility |
A list with components:
Training feature matrix
Test feature matrix
Training labels
Test labels
Indices of training samples
Indices of test samples
Internal utility functions for data processing and manipulation
Functions for plotting assessment and optimization results