---
title: "Introduction to MultiK"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Introduction to MultiK}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 10,
  fig.height = 5,
  warning = FALSE,
  message = FALSE,
  fig.align = "center",
  dpi = 100
)
```

## Overview

**MultiK** is a computational framework for objective determination of optimal cluster numbers in single-cell RNA sequencing (scRNA-seq) data. Selecting the appropriate number of clusters (K) is a fundamental challenge in unsupervised clustering analysis, and MultiK addresses this through a rigorous statistical approach.

### The Challenge

In scRNA-seq analysis, researchers often face the question: *"How many distinct cell populations exist in my data?"* Traditional approaches rely on:

- Arbitrary selection based on visualization
- Rule-of-thumb methods (e.g., elbow method)
- Biological prior knowledge

These approaches are subjective and may miss important biological structures or over-partition the data.

### The MultiK Solution

MultiK provides an objective, data-driven approach by:

1. **Subsampling-based consensus clustering** to assess clustering stability
2. **PAC (Proportion of Ambiguous Clustering)** metric to quantify clustering uncertainty
3. **SigClust statistical testing** to validate cluster separability

## Installation

```{r install, eval=FALSE}
# From R-universe (recommended)
install.packages("MultiK", repos = "https://zaoqu-liu.r-universe.dev")

# From GitHub
remotes::install_github("Zaoqu-Liu/MultiK")
```

## Quick Start

### Load Package and Data

```{r load}
library(MultiK)
library(Seurat)
library(ggplot2)

# Load example data
data(p3cl)
p3cl
```
The `p3cl` dataset contains approximately 2,600 cells from a three cell line mixture (H2228, H1975, HCC827), providing a benchmark with known ground truth (K=3).

### Step 1: Run MultiK Algorithm

```{r multik, cache=TRUE}
# Run with 50 subsampling iterations (use 100+ for production)
set.seed(42)
result <- MultiK(p3cl, 
                 reps = 50, 
                 resolution = seq(0.1, 1.5, 0.1),
                 nPC = 20,
                 cores = 1,
                 seed = 42)

# Check the distribution of K values
table(result$k)
```

### Step 2: Diagnostic Visualization

```{r diag, fig.width=12, fig.height=4}
# Generate diagnostic plots
DiagMultiKPlot(result$k, result$consensus)
```

**Interpretation of diagnostic plots:**

- **Left panel (Frequency)**: Shows how often each K value was observed across all clustering runs. K=3 appears most frequently, suggesting it's a stable solution.
- **Middle panel (rPAC)**: Relative PAC scores for each K. Lower values indicate more stable clustering. K=3 shows low rPAC.
- **Right panel (Trade-off)**: Combines frequency and stability. Points in the upper-right corner (high frequency, high stability) are optimal. The red point indicates the recommended K.

### Step 3: Extract Clusters

```{r clusters}
# Get cluster assignments at optimal K=3
clusters <- getClusters(p3cl, optK = 3, nPC = 20)

# View cluster distribution
table(clusters$clusters[, 1])

# Add clusters to Seurat object for visualization
p3cl$multik_clusters <- factor(clusters$clusters[, 1])
```

### Step 4: Visualize Clusters

```{r umap, fig.width=8, fig.height=6}
# Run UMAP for visualization
p3cl <- NormalizeData(p3cl, verbose = FALSE)
p3cl <- FindVariableFeatures(p3cl, verbose = FALSE)
p3cl <- ScaleData(p3cl, verbose = FALSE)
p3cl <- RunPCA(p3cl, npcs = 20, verbose = FALSE)
p3cl <- RunUMAP(p3cl, dims = 1:20, verbose = FALSE)

# Plot UMAP with MultiK clusters
DimPlot(p3cl, group.by = "multik_clusters", 
        cols = c("#E41A1C", "#377EB8", "#4DAF4A"),
        pt.size = 0.5) +
  ggtitle("MultiK Clustering (K=3)") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
```

### Step 5: Statistical Validation

```{r sigclust, fig.width=10, fig.height=5}
# Pairwise SigClust tests
pval <- CalcSigClust(p3cl, clusters$clusters[, 1], nsim = 50, cores = 1)

# View p-value matrix
print(round(pval, 4))

# Visualize results
PlotSigClust(p3cl, clusters$clusters[, 1], pval)
```

**Interpretation:**

- **Dendrogram (left)**: Shows hierarchical relationships between clusters based on expression similarity
- **Filled circles (●)**: All pairwise comparisons at this node are significant (p < 0.05) - true distinct clusters
- **Open circles (○)**: At least one non-significant comparison - potential subclasses
- **Heatmap (right)**: Pairwise p-values. Red/orange = significant separation, white = similar

In this example, all three clusters show significant separation (p < 0.05), confirming they represent distinct cell populations.

## Key Functions

| Function | Purpose |
|----------|---------|
| `MultiK()` | Core consensus clustering algorithm |
| `DiagMultiKPlot()` | Diagnostic visualization for K selection |
| `getClusters()` | Extract cluster assignments |
| `CalcSigClust()` | Statistical significance testing |
| `PlotSigClust()` | Visualize cluster relationships |

## Summary

MultiK provides a rigorous, data-driven approach to determine optimal cluster numbers:

1. **Run MultiK** with multiple subsampling iterations
2. **Examine diagnostic plots** to identify optimal K
3. **Extract clusters** at the chosen K
4. **Validate** with SigClust statistical testing

This workflow ensures reproducible and statistically justified clustering results.

## Author

**Zaoqu Liu, PhD**

- Email: liuzaoqu@163.com
- ORCID: [0000-0002-0452-742X](https://orcid.org/0000-0002-0452-742X)
- GitHub: [Zaoqu-Liu](https://github.com/Zaoqu-Liu)

## Session Info

```{r session}
sessionInfo()
```