scClustEval implements a self-projection framework for evaluating and optimizing single-cell RNA-seq clustering. This vignette provides a comprehensive explanation of the underlying algorithms and mathematical foundations.
In single-cell RNA-seq analysis, clustering algorithms aim to group cells with similar transcriptomic profiles. However, several challenges arise:
scClustEval addresses these challenges through a quantitative, data-driven approach.
The self-projection method treats cluster evaluation as a classification problem:
“If a clustering is biologically meaningful, a classifier should be able to accurately distinguish cells from different clusters based on their gene expression profiles.”
Given expression matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) with \(n\) cells and \(p\) features, and cluster labels \(\mathbf{y} \in \{1, ..., K\}^n\):
\[ \mathbf{X} = \mathbf{X}_{train} \cup \mathbf{X}_{test} \]
Stratified sampling ensures each cluster \(k\) contributes proportionally:
\[ \frac{n_k^{train}}{n^{train}} \approx \frac{n_k^{test}}{n^{test}} \approx \frac{n_k}{n} \]
We train a multi-class classifier \(f: \mathbb{R}^p \rightarrow \{1, ..., K\}\) that predicts cluster membership. The default is L1-regularized logistic regression:
\[ \min_{\boldsymbol{\beta}} -\sum_{i=1}^{n_{train}} \log P(y_i | \mathbf{x}_i, \boldsymbol{\beta}) + \lambda ||\boldsymbol{\beta}||_1 \]
The confusion matrix \(\mathbf{C} \in \mathbb{R}^{K \times K}\) is computed on the test set:
\[ C_{ij} = |\{x : y_{true}(x) = i \land y_{pred}(x) = j\}| \]
Where: - \(C_{ii}\) (diagonal): correctly classified cells from cluster \(i\) - \(C_{ij}\) (off-diagonal): cells from cluster \(i\) misclassified as cluster \(j\)
R1 normalization measures pairwise confusion rate relative to correct classifications:
\[ R1(i, j) = \max\left(\frac{C_{ij}}{C_{jj}}, \frac{C_{ji}}{C_{ii}}\right) \]
Interpretation: - \(R1 = 0\): Perfect discrimination - \(R1 > 1\): More misclassifications than correct classifications (severe confusion) - High R1 indicates clusters should potentially be merged
library(scClustEval)
# Example confusion matrix
cmat <- matrix(c(
90, 5, 3, # Cluster A: 90 correct, 5 misclassified as B, 3 as C
8, 85, 7, # Cluster B: 8 misclassified as A, 85 correct, 7 as C
2, 12, 86 # Cluster C: 2 as A, 12 as B, 86 correct
), nrow = 3, byrow = TRUE)
rownames(cmat) <- colnames(cmat) <- c("A", "B", "C")
# Compute R1 normalization
r1_mat <- normalize_confmat_r1(cmat)
print(round(r1_mat, 3))R2 normalization measures overall confusion proportion in the dataset:
\[ R2(i, j) = \frac{C_{ij} + C_{ji}}{N_{total}} \]
Where \(N_{total} = \sum_{i,j} C_{ij}\).
Interpretation: - R2 represents the fraction of total cells confused between clusters \(i\) and \(j\) - Values range from 0 to 1 - Useful for identifying globally important confusions
library(ggplot2)
library(gridExtra)
# Create data for visualization
df_r1 <- as.data.frame(as.table(r1_mat))
colnames(df_r1) <- c("True", "Predicted", "R1")
df_r2 <- as.data.frame(as.table(r2_mat))
colnames(df_r2) <- c("True", "Predicted", "R2")
p1 <- ggplot(df_r1, aes(x = Predicted, y = True, fill = R1)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", R1)), color = "white", size = 5) +
scale_fill_gradient(low = "gray90", high = "#d62728") +
labs(title = "R1 Normalization", subtitle = "Pairwise confusion rate") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
p2 <- ggplot(df_r2, aes(x = Predicted, y = True, fill = R2)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.3f", R2)), color = "white", size = 5) +
scale_fill_gradient(low = "gray90", high = "#1f77b4") +
labs(title = "R2 Normalization", subtitle = "Global confusion proportion") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
grid.arrange(p1, p2, ncol = 2)Clusters exceeding confusion thresholds are connected in an adjacency graph:
\[ A_{ij} = \begin{cases} 1 & \text{if } R1(i,j) > \theta_{R1} \text{ or } R2(i,j) > \theta_{R2} \\ 0 & \text{otherwise} \end{cases} \]
The Louvain algorithm identifies communities (groups of confused clusters) by maximizing modularity:
\[ Q = \frac{1}{2m}\sum_{ij}\left[A_{ij} - \frac{k_i k_j}{2m}\right]\delta(c_i, c_j) \]
Where: - \(m\) = total number of edges - \(k_i\) = degree of node \(i\) - \(c_i\) = community assignment of node \(i\) - \(\delta\) = Kronecker delta
# Demonstrate adjacency matrix clustering
adj <- matrix(c(
0, 0.3, 0.02, 0.01,
0.3, 0, 0.25, 0.02,
0.02, 0.25, 0, 0.01,
0.01, 0.02, 0.01, 0
), nrow = 4, byrow = TRUE)
rownames(adj) <- colnames(adj) <- c("C1", "C2", "C3", "C4")
# Cluster at threshold 0.1
groups <- cluster_adjacency_matrix(adj, cutoff = 0.1, resolution = 1.0)
names(groups) <- c("C1", "C2", "C3", "C4")
print(groups)
# Visualize
adj_df <- as.data.frame(as.table(adj))
colnames(adj_df) <- c("From", "To", "Weight")
ggplot(adj_df, aes(x = To, y = From, fill = Weight)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", Weight)), size = 4) +
scale_fill_gradient2(low = "white", mid = "#ffbb78", high = "#d62728", midpoint = 0.15) +
geom_hline(yintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
geom_vline(xintercept = 2.5, linetype = "dashed", color = "black", size = 1) +
labs(title = "Confusion-Based Adjacency Matrix",
subtitle = "Dashed lines show detected communities (C1,C2 | C3,C4)",
fill = "R1 Value") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))The optimization continues until:
Cutoffs are progressively relaxed each outer iteration:
\[ \theta_{R1}^{(t+1)} = \theta_{R1}^{(t)} - \Delta_{R1} \] \[ \theta_{R2}^{(t+1)} = \theta_{R2}^{(t)} - \Delta_{R2} \]
| Algorithm | Strengths | Best For |
|---|---|---|
| Logistic Regression (L1) | Sparse coefficients, interpretable | High-dimensional data, marker identification |
| Random Forest | Non-linear, feature importance | Complex boundaries |
| SVM (RBF) | Effective in high dimensions | Well-separated clusters |
| XGBoost | High accuracy, handles imbalance | Large datasets |
Logistic Regression (multinomial with L1): \[ P(y=k|x) = \frac{\exp(\beta_k^T x)}{\sum_{j=1}^{K}\exp(\beta_j^T x)} \]
Random Forest prediction: \[ \hat{y} = \text{mode}\{h_1(x), h_2(x), ..., h_B(x)\} \]
Where \(h_b\) are individual decision trees.
\[ Accuracy_k = \frac{C_{kk}}{\sum_{j=1}^{K} C_{kj}} \]
\[ Accuracy = \frac{\sum_{k=1}^{K} C_{kk}}{N_{total}} \]
The scClustEval framework provides:
Miao, Z., et al. (2020). Putative cell type discovery from single-cell gene expression data. Nature Methods, 17, 621-628.
Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, P10008.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Author: Zaoqu Liu ([email protected])
Package: scClustEval v1.0.0