--- title: "Introduction to MultiK" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to MultiK} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 10, fig.height = 5, warning = FALSE, message = FALSE, fig.align = "center", dpi = 100 ) ``` ## Overview **MultiK** is a computational framework for objective determination of optimal cluster numbers in single-cell RNA sequencing (scRNA-seq) data. Selecting the appropriate number of clusters (K) is a fundamental challenge in unsupervised clustering analysis, and MultiK addresses this through a rigorous statistical approach. ### The Challenge In scRNA-seq analysis, researchers often face the question: *"How many distinct cell populations exist in my data?"* Traditional approaches rely on: - Arbitrary selection based on visualization - Rule-of-thumb methods (e.g., elbow method) - Biological prior knowledge These approaches are subjective and may miss important biological structures or over-partition the data. ### The MultiK Solution MultiK provides an objective, data-driven approach by: 1. **Subsampling-based consensus clustering** to assess clustering stability 2. **PAC (Proportion of Ambiguous Clustering)** metric to quantify clustering uncertainty 3. **SigClust statistical testing** to validate cluster separability ## Installation ```{r install, eval=FALSE} # From R-universe (recommended) install.packages("MultiK", repos = "https://zaoqu-liu.r-universe.dev") # From GitHub remotes::install_github("Zaoqu-Liu/MultiK") ``` ## Quick Start ### Load Package and Data ```{r load} library(MultiK) library(Seurat) library(ggplot2) # Load example data data(p3cl) p3cl ``` The `p3cl` dataset contains approximately 2,600 cells from a three cell line mixture (H2228, H1975, HCC827), providing a benchmark with known ground truth (K=3). ### Step 1: Run MultiK Algorithm ```{r multik, cache=TRUE} # Run with 50 subsampling iterations (use 100+ for production) set.seed(42) result <- MultiK(p3cl, reps = 50, resolution = seq(0.1, 1.5, 0.1), nPC = 20, cores = 1, seed = 42) # Check the distribution of K values table(result$k) ``` ### Step 2: Diagnostic Visualization ```{r diag, fig.width=12, fig.height=4} # Generate diagnostic plots DiagMultiKPlot(result$k, result$consensus) ``` **Interpretation of diagnostic plots:** - **Left panel (Frequency)**: Shows how often each K value was observed across all clustering runs. K=3 appears most frequently, suggesting it's a stable solution. - **Middle panel (rPAC)**: Relative PAC scores for each K. Lower values indicate more stable clustering. K=3 shows low rPAC. - **Right panel (Trade-off)**: Combines frequency and stability. Points in the upper-right corner (high frequency, high stability) are optimal. The red point indicates the recommended K. ### Step 3: Extract Clusters ```{r clusters} # Get cluster assignments at optimal K=3 clusters <- getClusters(p3cl, optK = 3, nPC = 20) # View cluster distribution table(clusters$clusters[, 1]) # Add clusters to Seurat object for visualization p3cl$multik_clusters <- factor(clusters$clusters[, 1]) ``` ### Step 4: Visualize Clusters ```{r umap, fig.width=8, fig.height=6} # Run UMAP for visualization p3cl <- NormalizeData(p3cl, verbose = FALSE) p3cl <- FindVariableFeatures(p3cl, verbose = FALSE) p3cl <- ScaleData(p3cl, verbose = FALSE) p3cl <- RunPCA(p3cl, npcs = 20, verbose = FALSE) p3cl <- RunUMAP(p3cl, dims = 1:20, verbose = FALSE) # Plot UMAP with MultiK clusters DimPlot(p3cl, group.by = "multik_clusters", cols = c("#E41A1C", "#377EB8", "#4DAF4A"), pt.size = 0.5) + ggtitle("MultiK Clustering (K=3)") + theme(plot.title = element_text(hjust = 0.5, face = "bold")) ``` ### Step 5: Statistical Validation ```{r sigclust, fig.width=10, fig.height=5} # Pairwise SigClust tests pval <- CalcSigClust(p3cl, clusters$clusters[, 1], nsim = 50, cores = 1) # View p-value matrix print(round(pval, 4)) # Visualize results PlotSigClust(p3cl, clusters$clusters[, 1], pval) ``` **Interpretation:** - **Dendrogram (left)**: Shows hierarchical relationships between clusters based on expression similarity - **Filled circles (●)**: All pairwise comparisons at this node are significant (p < 0.05) - true distinct clusters - **Open circles (○)**: At least one non-significant comparison - potential subclasses - **Heatmap (right)**: Pairwise p-values. Red/orange = significant separation, white = similar In this example, all three clusters show significant separation (p < 0.05), confirming they represent distinct cell populations. ## Key Functions | Function | Purpose | |----------|---------| | `MultiK()` | Core consensus clustering algorithm | | `DiagMultiKPlot()` | Diagnostic visualization for K selection | | `getClusters()` | Extract cluster assignments | | `CalcSigClust()` | Statistical significance testing | | `PlotSigClust()` | Visualize cluster relationships | ## Summary MultiK provides a rigorous, data-driven approach to determine optimal cluster numbers: 1. **Run MultiK** with multiple subsampling iterations 2. **Examine diagnostic plots** to identify optimal K 3. **Extract clusters** at the chosen K 4. **Validate** with SigClust statistical testing This workflow ensures reproducible and statistically justified clustering results. ## Author **Zaoqu Liu, PhD** - Email: liuzaoqu@163.com - ORCID: [0000-0002-0452-742X](https://orcid.org/0000-0002-0452-742X) - GitHub: [Zaoqu-Liu](https://github.com/Zaoqu-Liu) ## Session Info ```{r session} sessionInfo() ```