Algorithm Principles and Mathematical Foundation

Overview

scClustEval implements a self-projection framework for evaluating and optimizing single-cell RNA-seq clustering. This vignette provides a comprehensive explanation of the underlying algorithms and mathematical foundations.

The Challenge of Cell Clustering Evaluation

In single-cell RNA-seq analysis, clustering algorithms aim to group cells with similar transcriptomic profiles. However, several challenges arise:

  1. Parameter Sensitivity: Clustering results are highly sensitive to resolution parameters
  2. Over-clustering: High resolution may split biologically homogeneous populations
  3. Under-clustering: Low resolution may merge distinct cell types
  4. Lack of Ground Truth: True cell type labels are rarely available

scClustEval addresses these challenges through a quantitative, data-driven approach.

Self-Projection Framework

Core Concept

The self-projection method treats cluster evaluation as a classification problem:

“If a clustering is biologically meaningful, a classifier should be able to accurately distinguish cells from different clusters based on their gene expression profiles.”

Algorithm Steps

Mathematical Formulation

1. Data Partitioning

Given expression matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) with \(n\) cells and \(p\) features, and cluster labels \(\mathbf{y} \in \{1, ..., K\}^n\):

\[ \mathbf{X} = \mathbf{X}_{train} \cup \mathbf{X}_{test} \]

Stratified sampling ensures each cluster \(k\) contributes proportionally:

\[ \frac{n_k^{train}}{n^{train}} \approx \frac{n_k^{test}}{n^{test}} \approx \frac{n_k}{n} \]

2. Classification Model

We train a multi-class classifier \(f: \mathbb{R}^p \rightarrow \{1, ..., K\}\) that predicts cluster membership. The default is L1-regularized logistic regression:

\[ \min_{\boldsymbol{\beta}} -\sum_{i=1}^{n_{train}} \log P(y_i | \mathbf{x}_i, \boldsymbol{\beta}) + \lambda ||\boldsymbol{\beta}||_1 \]

3. Confusion Matrix

The confusion matrix \(\mathbf{C} \in \mathbb{R}^{K \times K}\) is computed on the test set:

\[ C_{ij} = |\{x : y_{true}(x) = i \land y_{pred}(x) = j\}| \]

Where: - \(C_{ii}\) (diagonal): correctly classified cells from cluster \(i\) - \(C_{ij}\) (off-diagonal): cells from cluster \(i\) misclassified as cluster \(j\)

Confusion Matrix Normalization

R1 Normalization

R1 normalization measures pairwise confusion rate relative to correct classifications:

\[ R1(i, j) = \max\left(\frac{C_{ij}}{C_{jj}}, \frac{C_{ji}}{C_{ii}}\right) \]

Interpretation: - \(R1 = 0\): Perfect discrimination - \(R1 > 1\): More misclassifications than correct classifications (severe confusion) - High R1 indicates clusters should potentially be merged

library(scClustEval)

# Example confusion matrix
cmat <- matrix(c(
  90, 5, 3,    # Cluster A: 90 correct, 5 misclassified as B, 3 as C
  8, 85, 7,   # Cluster B: 8 misclassified as A, 85 correct, 7 as C  
  2, 12, 86   # Cluster C: 2 as A, 12 as B, 86 correct
), nrow = 3, byrow = TRUE)
rownames(cmat) <- colnames(cmat) <- c("A", "B", "C")

# Compute R1 normalization
r1_mat <- normalize_confmat_r1(cmat)
print(round(r1_mat, 3))

R2 Normalization

R2 normalization measures overall confusion proportion in the dataset:

\[ R2(i, j) = \frac{C_{ij} + C_{ji}}{N_{total}} \]

Where \(N_{total} = \sum_{i,j} C_{ij}\).

Interpretation: - R2 represents the fraction of total cells confused between clusters \(i\) and \(j\) - Values range from 0 to 1 - Useful for identifying globally important confusions

# Compute R2 normalization
r2_mat <- normalize_confmat_r2(cmat)
print(round(r2_mat, 4))

Visualization of Confusion Analysis

library(ggplot2)
library(gridExtra)

# Create data for visualization
df_r1 <- as.data.frame(as.table(r1_mat))
colnames(df_r1) <- c("True", "Predicted", "R1")

df_r2 <- as.data.frame(as.table(r2_mat))
colnames(df_r2) <- c("True", "Predicted", "R2")

p1 <- ggplot(df_r1, aes(x = Predicted, y = True, fill = R1)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", R1)), color = "white", size = 5) +
  scale_fill_gradient(low = "gray90", high = "#d62728") +
  labs(title = "R1 Normalization", subtitle = "Pairwise confusion rate") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

p2 <- ggplot(df_r2, aes(x = Predicted, y = True, fill = R2)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.3f", R2)), color = "white", size = 5) +
  scale_fill_gradient(low = "gray90", high = "#1f77b4") +
  labs(title = "R2 Normalization", subtitle = "Global confusion proportion") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

grid.arrange(p1, p2, ncol = 2)

Cluster Merging Strategy

Graph-Based Approach

Clusters exceeding confusion thresholds are connected in an adjacency graph:

\[ A_{ij} = \begin{cases} 1 & \text{if } R1(i,j) > \theta_{R1} \text{ or } R2(i,j) > \theta_{R2} \\ 0 & \text{otherwise} \end{cases} \]

Community Detection

The Louvain algorithm identifies communities (groups of confused clusters) by maximizing modularity:

\[ Q = \frac{1}{2m}\sum_{ij}\left[A_{ij} - \frac{k_i k_j}{2m}\right]\delta(c_i, c_j) \]

Where: - \(m\) = total number of edges - \(k_i\) = degree of node \(i\) - \(c_i\) = community assignment of node \(i\) - \(\delta\) = Kronecker delta

# Demonstrate adjacency matrix clustering
adj <- matrix(c(
  0, 0.3, 0.02, 0.01,
  0.3, 0, 0.25, 0.02,
  0.02, 0.25, 0, 0.01,
  0.01, 0.02, 0.01, 0
), nrow = 4, byrow = TRUE)
rownames(adj) <- colnames(adj) <- c("C1", "C2", "C3", "C4")

# Cluster at threshold 0.1
groups <- cluster_adjacency_matrix(adj, cutoff = 0.1, resolution = 1.0)
names(groups) <- c("C1", "C2", "C3", "C4")
print(groups)

# Visualize
adj_df <- as.data.frame(as.table(adj))
colnames(adj_df) <- c("From", "To", "Weight")

ggplot(adj_df, aes(x = To, y = From, fill = Weight)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", Weight)), size = 4) +
  scale_fill_gradient2(low = "white", mid = "#ffbb78", high = "#d62728", midpoint = 0.15) +
  geom_hline(yintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
  geom_vline(xintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
  labs(title = "Confusion-Based Adjacency Matrix",
       subtitle = "Dashed lines show detected communities (C1,C2 | C3,C4)",
       fill = "R1 Value") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Iterative Optimization

Convergence Criteria

The optimization continues until:

  1. Target accuracy reached: \(Accuracy \geq \theta_{acc}\)
  2. Maximum rounds exceeded: \(round > max\_rounds\)
  3. Convergence: No clusters merged in current round

Cutoff Adaptation

Cutoffs are progressively relaxed each outer iteration:

\[ \theta_{R1}^{(t+1)} = \theta_{R1}^{(t)} - \Delta_{R1} \] \[ \theta_{R2}^{(t+1)} = \theta_{R2}^{(t)} - \Delta_{R2} \]

Classifier Comparison

Supported Algorithms

Algorithm Strengths Best For
Logistic Regression (L1) Sparse coefficients, interpretable High-dimensional data, marker identification
Random Forest Non-linear, feature importance Complex boundaries
SVM (RBF) Effective in high dimensions Well-separated clusters
XGBoost High accuracy, handles imbalance Large datasets

Mathematical Details

Logistic Regression (multinomial with L1): \[ P(y=k|x) = \frac{\exp(\beta_k^T x)}{\sum_{j=1}^{K}\exp(\beta_j^T x)} \]

Random Forest prediction: \[ \hat{y} = \text{mode}\{h_1(x), h_2(x), ..., h_B(x)\} \]

Where \(h_b\) are individual decision trees.

Performance Metrics

Per-Cluster Accuracy

\[ Accuracy_k = \frac{C_{kk}}{\sum_{j=1}^{K} C_{kj}} \]

Overall Accuracy

\[ Accuracy = \frac{\sum_{k=1}^{K} C_{kk}}{N_{total}} \]

Summary

The scClustEval framework provides:

  1. Quantitative Assessment: Objective metrics for clustering quality
  2. Principled Optimization: Data-driven cluster merging
  3. Flexibility: Multiple classifiers and customizable parameters
  4. Interpretability: Confusion analysis reveals biological relationships

References

  1. Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods, 17, 621-628.

  2. Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, P10008.

  3. Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.


Author: Zaoqu Liu ()
Package: scClustEval v1.0.0