Introduction to MultiK

Overview

MultiK is a computational framework for objective determination of optimal cluster numbers in single-cell RNA sequencing (scRNA-seq) data. Selecting the appropriate number of clusters (K) is a fundamental challenge in unsupervised clustering analysis, and MultiK addresses this through a rigorous statistical approach.

The Challenge

In scRNA-seq analysis, researchers often face the question: “How many distinct cell populations exist in my data?” Traditional approaches rely on:

  • Arbitrary selection based on visualization
  • Rule-of-thumb methods (e.g., elbow method)
  • Biological prior knowledge

These approaches are subjective and may miss important biological structures or over-partition the data.

The MultiK Solution

MultiK provides an objective, data-driven approach by:

  1. Subsampling-based consensus clustering to assess clustering stability
  2. PAC (Proportion of Ambiguous Clustering) metric to quantify clustering uncertainty
  3. SigClust statistical testing to validate cluster separability

Installation

# From R-universe (recommended)
install.packages("MultiK", repos = "https://zaoqu-liu.r-universe.dev")

# From GitHub
remotes::install_github("Zaoqu-Liu/MultiK")

Quick Start

Load Package and Data

library(MultiK)
library(Seurat)
library(ggplot2)

# Load example data
data(p3cl)
p3cl
#> An object of class Seurat 
#> 21247 features across 2609 samples within 1 assay 
#> Active assay: RNA (21247 features, 0 variable features)
#>  2 layers present: counts, data

The p3cl dataset contains approximately 2,600 cells from a three cell line mixture (H2228, H1975, HCC827), providing a benchmark with known ground truth (K=3).

Step 1: Run MultiK Algorithm

# Run with 50 subsampling iterations (use 100+ for production)
set.seed(42)
result <- MultiK(p3cl, 
                 reps = 50, 
                 resolution = seq(0.1, 1.5, 0.1),
                 nPC = 20,
                 cores = 1,
                 seed = 42)

# Check the distribution of K values
table(result$k)
#> 
#>  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
#> 51 66 60 96 75 65 74 78 73 69 25 14  3  1

Step 2: Diagnostic Visualization

# Generate diagnostic plots
DiagMultiKPlot(result$k, result$consensus)

Interpretation of diagnostic plots:

  • Left panel (Frequency): Shows how often each K value was observed across all clustering runs. K=3 appears most frequently, suggesting it’s a stable solution.
  • Middle panel (rPAC): Relative PAC scores for each K. Lower values indicate more stable clustering. K=3 shows low rPAC.
  • Right panel (Trade-off): Combines frequency and stability. Points in the upper-right corner (high frequency, high stability) are optimal. The red point indicates the recommended K.

Step 3: Extract Clusters

# Get cluster assignments at optimal K=3
clusters <- getClusters(p3cl, optK = 3, nPC = 20)

# View cluster distribution
table(clusters$clusters[, 1])
#> 
#>    0    1    2 
#> 1854  541  214

# Add clusters to Seurat object for visualization
p3cl$multik_clusters <- factor(clusters$clusters[, 1])

Step 4: Visualize Clusters

# Run UMAP for visualization
p3cl <- NormalizeData(p3cl, verbose = FALSE)
p3cl <- FindVariableFeatures(p3cl, verbose = FALSE)
p3cl <- ScaleData(p3cl, verbose = FALSE)
p3cl <- RunPCA(p3cl, npcs = 20, verbose = FALSE)
p3cl <- RunUMAP(p3cl, dims = 1:20, verbose = FALSE)

# Plot UMAP with MultiK clusters
DimPlot(p3cl, group.by = "multik_clusters", 
        cols = c("#E41A1C", "#377EB8", "#4DAF4A"),
        pt.size = 0.5) +
  ggtitle("MultiK Clustering (K=3)") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Step 5: Statistical Validation

# Pairwise SigClust tests
pval <- CalcSigClust(p3cl, clusters$clusters[, 1], nsim = 50, cores = 1)

# View p-value matrix
print(round(pval, 4))
#>          cluster0 cluster1 cluster2
#> cluster0       NA        0        0
#> cluster1        0       NA        0
#> cluster2        0        0       NA

# Visualize results
PlotSigClust(p3cl, clusters$clusters[, 1], pval)

Interpretation:

  • Dendrogram (left): Shows hierarchical relationships between clusters based on expression similarity
  • Filled circles (●): All pairwise comparisons at this node are significant (p < 0.05) - true distinct clusters
  • Open circles (○): At least one non-significant comparison - potential subclasses
  • Heatmap (right): Pairwise p-values. Red/orange = significant separation, white = similar

In this example, all three clusters show significant separation (p < 0.05), confirming they represent distinct cell populations.

Key Functions

Function Purpose
MultiK() Core consensus clustering algorithm
DiagMultiKPlot() Diagnostic visualization for K selection
getClusters() Extract cluster assignments
CalcSigClust() Statistical significance testing
PlotSigClust() Visualize cluster relationships

Summary

MultiK provides a rigorous, data-driven approach to determine optimal cluster numbers:

  1. Run MultiK with multiple subsampling iterations
  2. Examine diagnostic plots to identify optimal K
  3. Extract clusters at the chosen K
  4. Validate with SigClust statistical testing

This workflow ensures reproducible and statistically justified clustering results.

Author

Zaoqu Liu, PhD

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] future_1.70.0      Seurat_5.5.0       SeuratObject_5.4.0 sp_2.2-1          
#> [5] MultiK_1.0.0       ggplot2_4.0.3      rmarkdown_2.31    
#> 
#> loaded via a namespace (and not attached):
#>   [1] deldir_2.0-4           pbapply_1.7-4          gridExtra_2.3         
#>   [4] rlang_1.2.0            magrittr_2.0.5         RcppAnnoy_0.0.23      
#>   [7] otel_0.2.0             spatstat.geom_3.8-1    matrixStats_1.5.0     
#>  [10] ggridges_0.5.7         compiler_4.6.0         png_0.1-9             
#>  [13] vctrs_0.7.3            reshape2_1.4.5         stringr_1.6.0         
#>  [16] pkgconfig_2.0.3        fastmap_1.2.0          labeling_0.4.3        
#>  [19] promises_1.5.0         purrr_1.2.2            xfun_0.59             
#>  [22] cachem_1.1.0           sigclust_1.1.0.1       jsonlite_2.0.0        
#>  [25] goftest_1.2-3          later_1.4.8            spatstat.utils_3.2-3  
#>  [28] irlba_2.3.7            parallel_4.6.0         cluster_2.1.8.2       
#>  [31] R6_2.6.1               ica_1.0-3              spatstat.data_3.1-9   
#>  [34] bslib_0.11.0           stringi_1.8.7          RColorBrewer_1.1-3    
#>  [37] reticulate_1.46.0      spatstat.univar_3.2-0  parallelly_1.47.0     
#>  [40] lmtest_0.9-40          jquerylib_0.1.4        scattermore_1.2       
#>  [43] Rcpp_1.1.1-1.1         knitr_1.51             tensor_1.5.1          
#>  [46] future.apply_1.20.2    zoo_1.8-15             sctransform_0.4.3     
#>  [49] httpuv_1.6.17          Matrix_1.7-5           splines_4.6.0         
#>  [52] igraph_2.3.2           tidyselect_1.2.1       abind_1.4-8           
#>  [55] yaml_2.3.12            spatstat.random_3.5-0  spatstat.explore_3.8-1
#>  [58] codetools_0.2-20       miniUI_0.1.2           listenv_1.0.0         
#>  [61] lattice_0.22-9         tibble_3.3.1           plyr_1.8.9            
#>  [64] shiny_1.14.0           withr_3.0.3            S7_0.2.2              
#>  [67] ROCR_1.0-12            evaluate_1.0.5         Rtsne_0.17            
#>  [70] fastDummies_1.7.6      survival_3.8-6         polyclip_1.10-7       
#>  [73] fitdistrplus_1.2-6     pillar_1.11.1          KernSmooth_2.23-26    
#>  [76] plotly_4.12.0          generics_0.1.4         RcppHNSW_0.7.0        
#>  [79] scales_1.4.0           globals_0.19.1         xtable_1.8-8          
#>  [82] glue_1.8.1             lazyeval_0.2.3         maketools_1.3.2       
#>  [85] tools_4.6.0            sys_3.4.3              data.table_1.18.4     
#>  [88] RSpectra_0.16-2        RANN_2.6.2             buildtools_1.0.0      
#>  [91] dotCall64_1.2          cowplot_1.2.0          grid_4.6.0            
#>  [94] tidyr_1.3.2            nlme_3.1-169           patchwork_1.3.2       
#>  [97] cli_3.6.6              spatstat.sparse_3.2-0  spam_2.11-4           
#> [100] viridisLite_0.4.3      dplyr_1.2.1            uwot_0.2.4            
#> [103] gtable_0.3.6           sass_0.4.10            digest_0.6.39         
#> [106] progressr_0.19.0       ggrepel_0.9.8          htmlwidgets_1.6.4     
#> [109] farver_2.1.2           htmltools_0.5.9        lifecycle_1.0.5       
#> [112] httr_1.4.8             mime_0.13              MASS_7.3-65