--- title: "Algorithm Details and Mathematical Framework" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: true vignette: > %\VignetteIndexEntry{Algorithm Details and Mathematical Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE, fig.align = "center", dpi = 100 ) ``` ## Mathematical Framework This vignette provides a detailed description of the mathematical foundations underlying the MultiK algorithm. ## 1. Consensus Clustering ### 1.1 Subsampling Strategy Given a dataset with $N$ cells, MultiK performs $R$ subsampling iterations. In each iteration $r$: 1. Randomly sample $\lfloor p \cdot N \rfloor$ cells without replacement (default $p = 0.8$) 2. Apply the standard Seurat clustering pipeline across $M$ resolution parameters ### 1.2 Consensus Matrix Construction For each cluster number $K$, we construct a **consensus matrix** $\mathbf{C}^{(K)} \in [0,1]^{N \times N}$: $$C_{ij}^{(K)} = \frac{\sum_{r=1}^{R_K} \mathbf{1}[\text{cell } i \text{ and } j \text{ co-cluster in run } r]}{\sum_{r=1}^{R_K} \mathbf{1}[\text{cell } i \text{ and } j \text{ co-sampled in run } r]}$$ where $R_K$ is the number of runs that yielded exactly $K$ clusters. **Properties of the consensus matrix:** - $C_{ii} = 1$ (a cell always co-clusters with itself) - $C_{ij} = C_{ji}$ (symmetric) - $C_{ij} \in [0, 1]$ - $C_{ij} = 1$ implies cells $i$ and $j$ always cluster together - $C_{ij} = 0$ implies cells never cluster together ### 1.3 Visualizing the Consensus Matrix ```{r consensus-demo, fig.width=8, fig.height=6, fig.cap="Example consensus matrix showing clear block-diagonal structure indicating well-separated clusters."} library(ggplot2) # Simulate a consensus matrix with 3 clear clusters set.seed(42) n <- 60 blocks <- c(rep(1, 20), rep(2, 20), rep(3, 20)) mat <- matrix(0, n, n) for (i in 1:n) { for (j in 1:n) { if (blocks[i] == blocks[j]) { mat[i, j] <- runif(1, 0.85, 1.0) } else { mat[i, j] <- runif(1, 0, 0.15) } } } diag(mat) <- 1 mat[lower.tri(mat)] <- t(mat)[lower.tri(mat)] # Reorder by cluster ord <- order(blocks) mat_ord <- mat[ord, ord] df_cons <- expand.grid(Cell1 = 1:n, Cell2 = 1:n) df_cons$value <- as.vector(mat_ord) ggplot(df_cons, aes(x = Cell1, y = Cell2, fill = value)) + geom_tile() + scale_fill_gradient2(low = "#3498DB", mid = "white", high = "#E74C3C", midpoint = 0.5, limits = c(0, 1), name = "Co-clustering\nProbability") + labs(x = "Cell Index", y = "Cell Index", title = "Consensus Matrix Heatmap", subtitle = "Clear block structure indicates stable clusters") + coord_fixed() + theme_minimal(base_size = 12) + theme(plot.title = element_text(face = "bold"), panel.grid = element_blank()) ``` ## 2. Proportion of Ambiguous Clustering (PAC) ### 2.1 Definition The PAC metric quantifies clustering ambiguity by measuring the proportion of cell pairs with intermediate consensus values: $$\text{PAC}(K) = \frac{|\{(i,j): x_1 < C_{ij}^{(K)} < x_2, i < j\}|}{|\{(i,j): i < j\}|}$$ where $x_1 = 0.1$ and $x_2 = 0.9$ are default thresholds. ### 2.2 Interpretation | PAC Value | Interpretation | |-----------|----------------| | PAC ≈ 0 | Clear cluster structure, cells consistently assigned | | PAC ≈ 1 | Ambiguous clustering, unstable cell assignments | ### 2.3 PAC Visualization ```{r pac-demo, fig.width=10, fig.height=4, fig.cap="PAC values across different K. Lower PAC indicates more stable clustering."} # Simulate PAC values for different K df_pac <- data.frame( K = factor(2:8), PAC = c(0.12, 0.05, 0.08, 0.18, 0.35, 0.52, 0.68) ) df_pac$rPAC <- df_pac$PAC / max(df_pac$PAC) p1 <- ggplot(df_pac, aes(x = K, y = PAC)) + geom_col(fill = "#3498DB", color = "black", width = 0.7) + geom_hline(yintercept = 0.1, linetype = "dashed", color = "red") + labs(x = "Number of Clusters (K)", y = "PAC", title = "Absolute PAC") + theme_bw(base_size = 12) + theme(plot.title = element_text(face = "bold")) p2 <- ggplot(df_pac, aes(x = K, y = rPAC)) + geom_col(fill = "#E74C3C", color = "black", width = 0.7) + labs(x = "Number of Clusters (K)", y = "Relative PAC (rPAC)", title = "Relative PAC (normalized)") + theme_bw(base_size = 12) + theme(plot.title = element_text(face = "bold")) cowplot::plot_grid(p1, p2, ncol = 2) ``` ### 2.4 Relative PAC (rPAC) To enable comparison across different K values: $$\text{rPAC}(K) = \frac{\text{PAC}(K)}{\max_k \text{PAC}(k)}$$ Lower rPAC indicates more stable clustering. ## 3. Optimal K Selection ### 3.1 Multi-Objective Optimization Optimal K should satisfy two criteria: 1. **High frequency**: K appears consistently across resolution parameters 2. **Low rPAC**: Clustering is stable (low ambiguity) ### 3.2 Pareto Frontier We identify optimal K values using **Pareto optimality**. A point $(x_i, y_i)$ is Pareto-optimal if no other point $(x_j, y_j)$ satisfies: $$x_j \geq x_i \text{ and } y_j \geq y_i \text{ with at least one strict inequality}$$ where: - $x = 1 - \text{rPAC}$ (stability) - $y = \text{frequency}$ ```{r pareto-concept, fig.width=8, fig.height=6, fig.cap="Pareto frontier for K selection. Red points indicate Pareto-optimal K values."} set.seed(42) df <- data.frame( K = factor(2:8), stability = c(0.88, 0.95, 0.92, 0.82, 0.65, 0.48, 0.32), frequency = c(25, 120, 85, 95, 110, 75, 40) ) # Find Pareto optimal points pareto <- c() for (i in 1:nrow(df)) { dominated <- FALSE for (j in 1:nrow(df)) { if (i != j) { if (df$stability[j] >= df$stability[i] && df$frequency[j] >= df$frequency[i] && (df$stability[j] > df$stability[i] || df$frequency[j] > df$frequency[i])) { dominated <- TRUE break } } } if (!dominated) pareto <- c(pareto, i) } df$optimal <- 1:7 %in% pareto ggplot(df, aes(x = stability, y = frequency)) + geom_point(aes(color = optimal), size = 5) + geom_text(aes(label = paste0("K=", K)), vjust = -1.2, size = 4) + scale_color_manual(values = c("FALSE" = "gray50", "TRUE" = "#E74C3C"), guide = "none") + geom_vline(xintercept = 0.9, linetype = "dashed", alpha = 0.5) + geom_hline(yintercept = 100, linetype = "dashed", alpha = 0.5) + annotate("rect", xmin = 0.9, xmax = 1, ymin = 100, ymax = 130, fill = "green", alpha = 0.1) + annotate("text", x = 0.95, y = 125, label = "Ideal\nRegion", size = 3) + labs(x = "Stability (1 - rPAC)", y = "Frequency", title = "Pareto Frontier for Optimal K Selection", subtitle = "Red points = Pareto-optimal; Green region = ideal") + xlim(0.25, 1) + theme_bw(base_size = 12) + theme(plot.title = element_text(face = "bold")) ``` ## 4. SigClust Statistical Testing ### 4.1 Hypothesis Framework For each pair of clusters $C_a$ and $C_b$, SigClust tests: - $H_0$: Data come from a single Gaussian distribution - $H_1$: Data come from two distinct distributions ### 4.2 Cluster Index The test statistic is the **cluster index (CI)**: $$\text{CI} = \frac{2 \cdot n_a \cdot n_b}{(n_a + n_b)^2} \cdot \frac{\|\bar{x}_a - \bar{x}_b\|^2}{\text{tr}(\hat{\Sigma})}$$ where: - $n_a, n_b$ are cluster sizes - $\bar{x}_a, \bar{x}_b$ are cluster centroids - $\hat{\Sigma}$ is the pooled covariance estimate ### 4.3 P-value Heatmap Interpretation ```{r pval-heatmap, fig.width=7, fig.height=6, fig.cap="Pairwise SigClust p-value heatmap. Green indicates significant separation."} # Create example p-value matrix pval_mat <- matrix(c(NA, 0.001, 0.015, 0.042, 0.001, NA, 0.003, 0.089, 0.015, 0.003, NA, 0.008, 0.042, 0.089, 0.008, NA), 4, 4, byrow = TRUE) rownames(pval_mat) <- colnames(pval_mat) <- paste0("Cluster", 1:4) df_heat <- expand.grid(Cluster1 = rownames(pval_mat), Cluster2 = colnames(pval_mat)) df_heat$pvalue <- as.vector(pval_mat) ggplot(df_heat, aes(x = Cluster1, y = Cluster2, fill = pvalue)) + geom_tile(color = "white", linewidth = 1) + geom_text(aes(label = ifelse(is.na(pvalue), "", sprintf("%.3f", pvalue))), size = 5, fontface = "bold") + scale_fill_gradient2(low = "#27AE60", mid = "#F1C40F", high = "#E74C3C", midpoint = 0.05, limits = c(0, 0.1), na.value = "grey90", name = "P-value") + labs(x = "", y = "", title = "Pairwise SigClust P-values", subtitle = "Green (p < 0.05) = significant separation") + coord_fixed() + theme_minimal(base_size = 12) + theme(plot.title = element_text(face = "bold"), panel.grid = element_blank(), axis.text = element_text(size = 11, face = "bold")) ``` ### 4.4 Interpretation Guide | P-value | Interpretation | Action | |---------|----------------|--------| | $p < 0.01$ | Highly significant separation | Distinct clusters | | $0.01 \leq p < 0.05$ | Significant separation | Likely distinct | | $0.05 \leq p < 0.10$ | Borderline | Consider merging | | $p \geq 0.10$ | Not significant | May be subclasses | ## 5. Complete Workflow Visualization ```{r workflow, fig.width=10, fig.height=3, fig.cap="MultiK analysis workflow."} library(ggplot2) # Create workflow diagram steps <- data.frame( x = 1:5, y = rep(0, 5), label = c("1. Input\nSeurat Object", "2. Subsampling\n& Clustering", "3. Consensus\nMatrix", "4. PAC &\nOptimal K", "5. SigClust\nValidation"), color = c("#3498DB", "#9B59B6", "#E74C3C", "#F1C40F", "#27AE60") ) ggplot(steps, aes(x = x, y = y)) + geom_segment(aes(x = 1.3, xend = 4.7, y = 0, yend = 0), arrow = arrow(length = unit(0.3, "cm"), type = "closed"), color = "gray40", linewidth = 1.5) + geom_point(aes(fill = color), shape = 21, size = 20, color = "black", stroke = 2) + geom_text(aes(label = label), y = -0.15, size = 3.5, lineheight = 0.9) + scale_fill_identity() + coord_cartesian(ylim = c(-0.3, 0.15), xlim = c(0.5, 5.5)) + theme_void() + labs(title = "MultiK Analysis Workflow") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14)) ``` ## 6. Computational Complexity | Component | Complexity | |-----------|------------| | Single clustering run | $O(N \log N)$ | | All subsampling iterations | $O(R \cdot M \cdot N \log N)$ | | Consensus matrix | $O(R_K \cdot N^2)$ | | PAC calculation | $O(N^2)$ | | SigClust (per pair) | $O(B \cdot n_k^2)$ | where $N$ = cells, $R$ = reps, $M$ = resolutions, $B$ = simulations. ## 7. Parameter Recommendations | Parameter | Recommended | Rationale | |-----------|-------------|-----------| | `reps` | 100-200 | Sufficient for stable consensus | | `pSample` | 0.8 | Balance between variation and representation | | `resolution` | 0.05-2.0 | Covers typical cluster range | | `nsim` | 100-500 | Accurate p-value estimation | ## References 1. **Consensus Clustering**: Monti S, et al. (2003) Consensus Clustering. *Machine Learning* 52:91-118. 2. **PAC Metric**: Șenbabaoğlu Y, et al. (2014) Critical limitations of consensus clustering. *Scientific Reports* 4:6207. 3. **SigClust**: Liu Y, et al. (2008) Statistical significance of clustering. *JASA* 103:1281-1293. ## Author **Zaoqu Liu, PhD** - Email: liuzaoqu@163.com - ORCID: [0000-0002-0452-742X](https://orcid.org/0000-0002-0452-742X) --- *This vignette is part of the MultiK package. For questions or issues, please visit [GitHub](https://github.com/Zaoqu-Liu/MultiK).*