MultiK is a computational framework for objective determination of optimal cluster numbers in single-cell RNA sequencing (scRNA-seq) data. Selecting the appropriate number of clusters (K) is a fundamental challenge in unsupervised clustering analysis, and MultiK addresses this through a rigorous statistical approach.
In scRNA-seq analysis, researchers often face the question: “How many distinct cell populations exist in my data?” Traditional approaches rely on:
These approaches are subjective and may miss important biological structures or over-partition the data.
MultiK provides an objective, data-driven approach by:
library(MultiK)
library(Seurat)
library(ggplot2)
# Load example data
data(p3cl)
p3cl
#> An object of class Seurat
#> 21247 features across 2609 samples within 1 assay
#> Active assay: RNA (21247 features, 0 variable features)
#> 2 layers present: counts, dataThe p3cl dataset contains approximately 2,600 cells from
a three cell line mixture (H2228, H1975, HCC827), providing a benchmark
with known ground truth (K=3).
# Run with 50 subsampling iterations (use 100+ for production)
set.seed(42)
result <- MultiK(p3cl,
reps = 50,
resolution = seq(0.1, 1.5, 0.1),
nPC = 20,
cores = 1,
seed = 42)
# Check the distribution of K values
table(result$k)
#>
#> 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 51 66 60 96 75 65 74 78 73 69 25 14 3 1Interpretation of diagnostic plots:
# Run UMAP for visualization
p3cl <- NormalizeData(p3cl, verbose = FALSE)
p3cl <- FindVariableFeatures(p3cl, verbose = FALSE)
p3cl <- ScaleData(p3cl, verbose = FALSE)
p3cl <- RunPCA(p3cl, npcs = 20, verbose = FALSE)
p3cl <- RunUMAP(p3cl, dims = 1:20, verbose = FALSE)
# Plot UMAP with MultiK clusters
DimPlot(p3cl, group.by = "multik_clusters",
cols = c("#E41A1C", "#377EB8", "#4DAF4A"),
pt.size = 0.5) +
ggtitle("MultiK Clustering (K=3)") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))# Pairwise SigClust tests
pval <- CalcSigClust(p3cl, clusters$clusters[, 1], nsim = 50, cores = 1)
# View p-value matrix
print(round(pval, 4))
#> cluster0 cluster1 cluster2
#> cluster0 NA 0 0
#> cluster1 0 NA 0
#> cluster2 0 0 NA
# Visualize results
PlotSigClust(p3cl, clusters$clusters[, 1], pval)Interpretation:
In this example, all three clusters show significant separation (p < 0.05), confirming they represent distinct cell populations.
| Function | Purpose |
|---|---|
MultiK() |
Core consensus clustering algorithm |
DiagMultiKPlot() |
Diagnostic visualization for K selection |
getClusters() |
Extract cluster assignments |
CalcSigClust() |
Statistical significance testing |
PlotSigClust() |
Visualize cluster relationships |
MultiK provides a rigorous, data-driven approach to determine optimal cluster numbers:
This workflow ensures reproducible and statistically justified clustering results.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] future_1.70.0 Seurat_5.5.0 SeuratObject_5.4.0 sp_2.2-1
#> [5] MultiK_1.0.0 ggplot2_4.0.3 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] deldir_2.0-4 pbapply_1.7-4 gridExtra_2.3
#> [4] rlang_1.2.0 magrittr_2.0.5 RcppAnnoy_0.0.23
#> [7] otel_0.2.0 spatstat.geom_3.8-1 matrixStats_1.5.0
#> [10] ggridges_0.5.7 compiler_4.6.0 png_0.1-9
#> [13] vctrs_0.7.3 reshape2_1.4.5 stringr_1.6.0
#> [16] pkgconfig_2.0.3 fastmap_1.2.0 labeling_0.4.3
#> [19] promises_1.5.0 purrr_1.2.2 xfun_0.59
#> [22] cachem_1.1.0 sigclust_1.1.0.1 jsonlite_2.0.0
#> [25] goftest_1.2-3 later_1.4.8 spatstat.utils_3.2-3
#> [28] irlba_2.3.7 parallel_4.6.0 cluster_2.1.8.2
#> [31] R6_2.6.1 ica_1.0-3 spatstat.data_3.1-9
#> [34] bslib_0.11.0 stringi_1.8.7 RColorBrewer_1.1-3
#> [37] reticulate_1.46.0 spatstat.univar_3.2-0 parallelly_1.47.0
#> [40] lmtest_0.9-40 jquerylib_0.1.4 scattermore_1.2
#> [43] Rcpp_1.1.1-1.1 knitr_1.51 tensor_1.5.1
#> [46] future.apply_1.20.2 zoo_1.8-15 sctransform_0.4.3
#> [49] httpuv_1.6.17 Matrix_1.7-5 splines_4.6.0
#> [52] igraph_2.3.2 tidyselect_1.2.1 abind_1.4-8
#> [55] yaml_2.3.12 spatstat.random_3.5-0 spatstat.explore_3.8-1
#> [58] codetools_0.2-20 miniUI_0.1.2 listenv_1.0.0
#> [61] lattice_0.22-9 tibble_3.3.1 plyr_1.8.9
#> [64] shiny_1.14.0 withr_3.0.3 S7_0.2.2
#> [67] ROCR_1.0-12 evaluate_1.0.5 Rtsne_0.17
#> [70] fastDummies_1.7.6 survival_3.8-6 polyclip_1.10-7
#> [73] fitdistrplus_1.2-6 pillar_1.11.1 KernSmooth_2.23-26
#> [76] plotly_4.12.0 generics_0.1.4 RcppHNSW_0.7.0
#> [79] scales_1.4.0 globals_0.19.1 xtable_1.8-8
#> [82] glue_1.8.1 lazyeval_0.2.3 maketools_1.3.2
#> [85] tools_4.6.0 sys_3.4.3 data.table_1.18.4
#> [88] RSpectra_0.16-2 RANN_2.6.2 buildtools_1.0.0
#> [91] dotCall64_1.2 cowplot_1.2.0 grid_4.6.0
#> [94] tidyr_1.3.2 nlme_3.1-169 patchwork_1.3.2
#> [97] cli_3.6.6 spatstat.sparse_3.2-0 spam_2.11-4
#> [100] viridisLite_0.4.3 dplyr_1.2.1 uwot_0.2.4
#> [103] gtable_0.3.6 sass_0.4.10 digest_0.6.39
#> [106] progressr_0.19.0 ggrepel_0.9.8 htmlwidgets_1.6.4
#> [109] farver_2.1.2 htmltools_0.5.9 lifecycle_1.0.5
#> [112] httr_1.4.8 mime_0.13 MASS_7.3-65