--- title: "Algorithm Principles and Mathematical Foundation" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm Principles and Mathematical Foundation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE, eval = FALSE ) ``` ## Overview **scClustEval** implements a self-projection framework for evaluating and optimizing single-cell RNA-seq clustering. This vignette provides a comprehensive explanation of the underlying algorithms and mathematical foundations. ## The Challenge of Cell Clustering Evaluation In single-cell RNA-seq analysis, clustering algorithms aim to group cells with similar transcriptomic profiles. However, several challenges arise: 1. **Parameter Sensitivity**: Clustering results are highly sensitive to resolution parameters 2. **Over-clustering**: High resolution may split biologically homogeneous populations 3. **Under-clustering**: Low resolution may merge distinct cell types 4. **Lack of Ground Truth**: True cell type labels are rarely available **scClustEval** addresses these challenges through a quantitative, data-driven approach. ## Self-Projection Framework ### Core Concept The self-projection method treats cluster evaluation as a classification problem: > *"If a clustering is biologically meaningful, a classifier should be able to accurately distinguish cells from different clusters based on their gene expression profiles."* ### Algorithm Steps ```{r algorithm_flow, echo=FALSE, fig.height=4} library(ggplot2) steps <- data.frame( step = factor(1:5, labels = c( "1. Data Split", "2. Train Classifier", "3. Cross-Validate", "4. Compute Confusion", "5. Merge Clusters" )), description = c( "Stratified train/test split", "Multi-class classification", "K-fold CV on training set", "Build confusion matrix", "Graph-based merging" ), x = 1:5, y = rep(1, 5) ) ggplot(steps, aes(x = x, y = y)) + geom_segment(aes(xend = x, yend = 0), color = "#3cb44b", size = 2) + geom_point(size = 8, color = "#3cb44b") + geom_text(aes(label = step), y = -0.3, size = 3.5, fontface = "bold") + geom_text(aes(label = description), y = 0.3, size = 3) + theme_void() + ylim(-0.8, 0.8) + labs(title = "Self-Projection Algorithm Pipeline") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14)) ``` ### Mathematical Formulation #### 1. Data Partitioning Given expression matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ with $n$ cells and $p$ features, and cluster labels $\mathbf{y} \in \{1, ..., K\}^n$: $$ \mathbf{X} = \mathbf{X}_{train} \cup \mathbf{X}_{test} $$ Stratified sampling ensures each cluster $k$ contributes proportionally: $$ \frac{n_k^{train}}{n^{train}} \approx \frac{n_k^{test}}{n^{test}} \approx \frac{n_k}{n} $$ #### 2. Classification Model We train a multi-class classifier $f: \mathbb{R}^p \rightarrow \{1, ..., K\}$ that predicts cluster membership. The default is L1-regularized logistic regression: $$ \min_{\boldsymbol{\beta}} -\sum_{i=1}^{n_{train}} \log P(y_i | \mathbf{x}_i, \boldsymbol{\beta}) + \lambda ||\boldsymbol{\beta}||_1 $$ #### 3. Confusion Matrix The confusion matrix $\mathbf{C} \in \mathbb{R}^{K \times K}$ is computed on the test set: $$ C_{ij} = |\{x : y_{true}(x) = i \land y_{pred}(x) = j\}| $$ Where: - $C_{ii}$ (diagonal): correctly classified cells from cluster $i$ - $C_{ij}$ (off-diagonal): cells from cluster $i$ misclassified as cluster $j$ ## Confusion Matrix Normalization ### R1 Normalization R1 normalization measures **pairwise confusion rate** relative to correct classifications: $$ R1(i, j) = \max\left(\frac{C_{ij}}{C_{jj}}, \frac{C_{ji}}{C_{ii}}\right) $$ **Interpretation:** - $R1 = 0$: Perfect discrimination - $R1 > 1$: More misclassifications than correct classifications (severe confusion) - High R1 indicates clusters should potentially be merged ```{r r1_example, echo=TRUE} library(scClustEval) # Example confusion matrix cmat <- matrix(c( 90, 5, 3, # Cluster A: 90 correct, 5 misclassified as B, 3 as C 8, 85, 7, # Cluster B: 8 misclassified as A, 85 correct, 7 as C 2, 12, 86 # Cluster C: 2 as A, 12 as B, 86 correct ), nrow = 3, byrow = TRUE) rownames(cmat) <- colnames(cmat) <- c("A", "B", "C") # Compute R1 normalization r1_mat <- normalize_confmat_r1(cmat) print(round(r1_mat, 3)) ``` ### R2 Normalization R2 normalization measures **overall confusion proportion** in the dataset: $$ R2(i, j) = \frac{C_{ij} + C_{ji}}{N_{total}} $$ Where $N_{total} = \sum_{i,j} C_{ij}$. **Interpretation:** - R2 represents the fraction of total cells confused between clusters $i$ and $j$ - Values range from 0 to 1 - Useful for identifying globally important confusions ```{r r2_example, echo=TRUE} # Compute R2 normalization r2_mat <- normalize_confmat_r2(cmat) print(round(r2_mat, 4)) ``` ## Visualization of Confusion Analysis ```{r confusion_viz, echo=TRUE, fig.height=5, fig.width=10} library(ggplot2) library(gridExtra) # Create data for visualization df_r1 <- as.data.frame(as.table(r1_mat)) colnames(df_r1) <- c("True", "Predicted", "R1") df_r2 <- as.data.frame(as.table(r2_mat)) colnames(df_r2) <- c("True", "Predicted", "R2") p1 <- ggplot(df_r1, aes(x = Predicted, y = True, fill = R1)) + geom_tile(color = "white") + geom_text(aes(label = sprintf("%.2f", R1)), color = "white", size = 5) + scale_fill_gradient(low = "gray90", high = "#d62728") + labs(title = "R1 Normalization", subtitle = "Pairwise confusion rate") + theme_minimal() + theme(plot.title = element_text(face = "bold")) p2 <- ggplot(df_r2, aes(x = Predicted, y = True, fill = R2)) + geom_tile(color = "white") + geom_text(aes(label = sprintf("%.3f", R2)), color = "white", size = 5) + scale_fill_gradient(low = "gray90", high = "#1f77b4") + labs(title = "R2 Normalization", subtitle = "Global confusion proportion") + theme_minimal() + theme(plot.title = element_text(face = "bold")) grid.arrange(p1, p2, ncol = 2) ``` ## Cluster Merging Strategy ### Graph-Based Approach Clusters exceeding confusion thresholds are connected in an adjacency graph: $$ A_{ij} = \begin{cases} 1 & \text{if } R1(i,j) > \theta_{R1} \text{ or } R2(i,j) > \theta_{R2} \\ 0 & \text{otherwise} \end{cases} $$ ### Community Detection The Louvain algorithm identifies communities (groups of confused clusters) by maximizing modularity: $$ Q = \frac{1}{2m}\sum_{ij}\left[A_{ij} - \frac{k_i k_j}{2m}\right]\delta(c_i, c_j) $$ Where: - $m$ = total number of edges - $k_i$ = degree of node $i$ - $c_i$ = community assignment of node $i$ - $\delta$ = Kronecker delta ```{r merging_demo, echo=TRUE, fig.height=4} # Demonstrate adjacency matrix clustering adj <- matrix(c( 0, 0.3, 0.02, 0.01, 0.3, 0, 0.25, 0.02, 0.02, 0.25, 0, 0.01, 0.01, 0.02, 0.01, 0 ), nrow = 4, byrow = TRUE) rownames(adj) <- colnames(adj) <- c("C1", "C2", "C3", "C4") # Cluster at threshold 0.1 groups <- cluster_adjacency_matrix(adj, cutoff = 0.1, resolution = 1.0) names(groups) <- c("C1", "C2", "C3", "C4") print(groups) # Visualize adj_df <- as.data.frame(as.table(adj)) colnames(adj_df) <- c("From", "To", "Weight") ggplot(adj_df, aes(x = To, y = From, fill = Weight)) + geom_tile(color = "white") + geom_text(aes(label = sprintf("%.2f", Weight)), size = 4) + scale_fill_gradient2(low = "white", mid = "#ffbb78", high = "#d62728", midpoint = 0.15) + geom_hline(yintercept = 2.5, linetype = "dashed", color = "black", size = 1) + geom_vline(xintercept = 2.5, linetype = "dashed", color = "black", size = 1) + labs(title = "Confusion-Based Adjacency Matrix", subtitle = "Dashed lines show detected communities (C1,C2 | C3,C4)", fill = "R1 Value") + theme_minimal() + theme(plot.title = element_text(face = "bold")) ``` ## Iterative Optimization ### Convergence Criteria The optimization continues until: 1. **Target accuracy reached**: $Accuracy \geq \theta_{acc}$ 2. **Maximum rounds exceeded**: $round > max\_rounds$ 3. **Convergence**: No clusters merged in current round ### Cutoff Adaptation Cutoffs are progressively relaxed each outer iteration: $$ \theta_{R1}^{(t+1)} = \theta_{R1}^{(t)} - \Delta_{R1} $$ $$ \theta_{R2}^{(t+1)} = \theta_{R2}^{(t)} - \Delta_{R2} $$ ```{r optimization_flow, echo=FALSE, fig.height=5} # Simulated optimization trajectory rounds <- 1:8 accuracy <- c(0.72, 0.78, 0.83, 0.86, 0.88, 0.91, 0.93, 0.94) clusters <- c(15, 12, 10, 8, 7, 6, 5, 5) df <- data.frame( Round = rounds, Accuracy = accuracy, Clusters = clusters ) p1 <- ggplot(df, aes(x = Round, y = Accuracy)) + geom_line(color = "#3cb44b", size = 1.5) + geom_point(color = "#3cb44b", size = 3) + geom_hline(yintercept = 0.9, linetype = "dashed", color = "red") + annotate("text", x = 7, y = 0.92, label = "Target: 90%", color = "red") + labs(title = "Accuracy Progression", y = "Test Accuracy") + theme_minimal() + theme(plot.title = element_text(face = "bold")) p2 <- ggplot(df, aes(x = Round, y = Clusters)) + geom_line(color = "#e6194b", size = 1.5) + geom_point(color = "#e6194b", size = 3) + labs(title = "Cluster Count", y = "Number of Clusters") + theme_minimal() + theme(plot.title = element_text(face = "bold")) grid.arrange(p1, p2, ncol = 2) ``` ## Classifier Comparison ### Supported Algorithms | Algorithm | Strengths | Best For | |-----------|-----------|----------| | **Logistic Regression (L1)** | Sparse coefficients, interpretable | High-dimensional data, marker identification | | **Random Forest** | Non-linear, feature importance | Complex boundaries | | **SVM (RBF)** | Effective in high dimensions | Well-separated clusters | | **XGBoost** | High accuracy, handles imbalance | Large datasets | ### Mathematical Details **Logistic Regression** (multinomial with L1): $$ P(y=k|x) = \frac{\exp(\beta_k^T x)}{\sum_{j=1}^{K}\exp(\beta_j^T x)} $$ **Random Forest** prediction: $$ \hat{y} = \text{mode}\{h_1(x), h_2(x), ..., h_B(x)\} $$ Where $h_b$ are individual decision trees. ## Performance Metrics ### Per-Cluster Accuracy $$ Accuracy_k = \frac{C_{kk}}{\sum_{j=1}^{K} C_{kj}} $$ ### Overall Accuracy $$ Accuracy = \frac{\sum_{k=1}^{K} C_{kk}}{N_{total}} $$ ## Summary The scClustEval framework provides: 1. **Quantitative Assessment**: Objective metrics for clustering quality 2. **Principled Optimization**: Data-driven cluster merging 3. **Flexibility**: Multiple classifiers and customizable parameters 4. **Interpretability**: Confusion analysis reveals biological relationships ## References 1. Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. *Nature Methods*, 17, 621-628. 2. Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. *Journal of Statistical Mechanics*, P10008. 3. Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. *Journal of Statistical Software*, 33(1), 1-22. --- **Author**: Zaoqu Liu (liuzaoqu@163.com) **Package**: scClustEval v`r packageVersion("scClustEval")`