--- title: "Introduction to scClustEval" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to scClustEval} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Overview **scClustEval** (Single Cell Clustering Evaluation) is an R package for evaluating and optimizing single-cell RNA-seq clustering results using self-projection machine learning approaches. The package implements an iterative optimization strategy that: 1. Trains a classifier to distinguish between clusters 2. Evaluates prediction accuracy via cross-validation 3. Identifies cluster pairs that are difficult to discriminate 4. Merges confused clusters iteratively until target accuracy is reached ## Installation ```{r install, eval=FALSE} # From GitHub devtools::install_github("Zaoqu-Liu/scClustEval") ``` ## Quick Start ### Loading the package ```{r load, message=FALSE, warning=FALSE} library(scClustEval) ``` ### Basic Assessment with Matrix Input ```{r basic_assessment, eval=FALSE} # Create example data set.seed(42) n_cells <- 500 n_features <- 100 n_clusters <- 5 # Generate expression matrix with cluster structure X <- matrix(0, nrow = n_cells, ncol = n_features) labels <- character(n_cells) for (i in 1:n_clusters) { idx <- ((i-1) * 100 + 1):(i * 100) X[idx, ] <- matrix(rnorm(100 * n_features, mean = i), nrow = 100) labels[idx] <- paste0("Cluster_", i) } # Run assessment result <- sc_assessment( X = X, labels = labels, classifier = "LR", n_per_class = 50, cv = 5 ) # Print result print(result) ``` ### With Seurat Objects ```{r seurat_example, eval=FALSE} library(Seurat) # Load your Seurat object seurat_obj <- readRDS("your_data.rds") # Run assessment on existing clustering result <- RunAssessment( seurat_obj, cluster_col = "seurat_clusters", use = "pca", dims = 1:30 ) # View results print(result) # Plot ROC curves plot_roc(result) ``` ## Clustering Optimization ### The Optimization Process The optimization process works as follows: 1. Start with an over-clustered result (high resolution) 2. Assess the clustering using self-projection 3. Build a confusion matrix to identify confused cluster pairs 4. Merge clusters that cannot be well discriminated 5. Repeat until target accuracy is reached ```{r optimization, eval=FALSE} # Start with over-clustering seurat_obj <- FindClusters(seurat_obj, resolution = 2.0) # Run optimization seurat_obj <- RunOptimization( seurat_obj, cluster_col = "seurat_clusters", min_accuracy = 0.9, result_col = "optimized_clusters" ) # Compare before and after DimPlot(seurat_obj, group.by = c("seurat_clusters", "optimized_clusters")) ``` ## Visualization Functions ### ROC Curves ```{r roc, eval=FALSE} # Plot ROC and Precision-Recall curves plot_roc(result, plot_type = "both") # ROC only plot_roc(result, plot_type = "roc") ``` ### Confusion Matrix Heatmap ```{r confusion, eval=FALSE} # Raw confusion matrix plot_confusion_heatmap(result, normalized = "raw") # R1-normalized (used for merging decisions) plot_confusion_heatmap(result, normalized = "R1") ``` ### Optimization History ```{r history, eval=FALSE} # Run optimization with matrix input optim_result <- sc_optimize_all( X = X, labels = initial_labels, min_accuracy = 0.9 ) # Plot optimization progress plot_optimization_history(optim_result) ``` ## Classifier Options The package supports multiple classifiers: | Classifier | Code | Description | |------------|------|-------------| | Logistic Regression | `"LR"` | L1/L2 regularized (default) | | Random Forest | `"RF"` | Using randomForest package | | Ranger | `"RANGER"` | Fast random forest | | SVM | `"SVM"` | Support Vector Machine | | Naive Bayes | `"NB"` | Gaussian Naive Bayes | | Decision Tree | `"DT"` | Using rpart | | XGBoost | `"XGB"` | Gradient boosting | ```{r classifiers, eval=FALSE} # Using different classifiers result_lr <- sc_assessment(X, labels, classifier = "LR") result_rf <- sc_assessment(X, labels, classifier = "RF") result_svm <- sc_assessment(X, labels, classifier = "SVM") ``` ## Advanced Usage ### Using Constraints You can constrain the optimization process using an under-clustering as a boundary: ```{r constraints, eval=FALSE} # Create low and high resolution clusterings seurat_obj <- FindClusters(seurat_obj, resolution = 0.2, key_added = "low_res") seurat_obj <- FindClusters(seurat_obj, resolution = 2.0, key_added = "high_res") # Optimize with constraint seurat_obj <- RunOptimization( seurat_obj, cluster_col = "high_res", under_cluster_col = "low_res", # Constraint min_accuracy = 0.95 ) ``` ### Parallel Processing ```{r parallel, eval=FALSE} # Assessment uses parallel processing automatically # Control with n_cores parameter result <- sc_assessment( X, labels, n_cores = 4 # Use 4 cores ) ``` ## Session Info ```{r session} sessionInfo() ``` ## References This package is an R implementation inspired by the SCCAF Python package: - Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. *Nature Methods*. - SCCAF GitHub: https://github.com/SCCAF/sccaf