--- title: "Algorithm and Methodology" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm and Methodology} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, eval = FALSE ) ``` ## Introduction **recall** (Calibrated Clustering with Artificial Variables) is a statistical framework designed to address the fundamental problem of **over-clustering** in single-cell RNA-sequencing (scRNA-seq) data analysis. ### The Double-Dipping Problem In standard scRNA-seq analysis pipelines: 1. **Clustering**: Cells are grouped based on gene expression patterns 2. **Differential Expression Testing**: Genes are tested for differential expression between clusters This creates a statistical problem known as **double-dipping**: the same data is used both to define groups and to test for differences between them. This leads to: - Inflated test statistics - Overly optimistic p-values - Increased false discovery rates - **Over-clustering**: finding more clusters than biologically meaningful ## Mathematical Framework ### The Knockoff Filter The knockoff filter (Barber & Candès, 2015) is a powerful method for variable selection with FDR control. Given original features $X_1, \ldots, X_p$, we construct **knockoff features** $\tilde{X}_1, \ldots, \tilde{X}_p$ that: 1. Preserve the correlation structure of the original features 2. Are exchangeable with the originals conditional on the response 3. Are known *a priori* to be null (not associated with the response) ### The W Statistic For each gene $j$, we compute the knockoff W statistic: $$W_j = -\log_{10}(p_j^{\text{original}}) - (-\log_{10}(p_j^{\text{knockoff}}))$$ Where: - $p_j^{\text{original}}$ is the p-value for gene $j$ in differential expression testing - $p_j^{\text{knockoff}}$ is the p-value for the corresponding knockoff gene ### FDR-Controlled Selection The knockoff filter selects genes where $W_j \geq T$, where the threshold $T$ is chosen to control FDR at level $q$: $$T = \min\left\{t > 0: \frac{1 + \#\{j: W_j \leq -t\}}{\#\{j: W_j \geq t\}} \leq q\right\}$$ This provides rigorous FDR control without distributional assumptions. ## The recall Algorithm ### Stage 1: Synthetic Null Variable Generation Given the expression matrix $X \in \mathbb{R}^{n \times p}$ (cells × genes), we generate synthetic "knockoff" genes $\tilde{X} \in \mathbb{R}^{n \times p}$ by: 1. **Estimating marginal distributions** for each gene 2. **Sampling** from the estimated distribution to create knockoffs #### Supported Distributions | Method | Description | Use Case | |--------|-------------|----------| | `ZIP` | Zero-Inflated Poisson | Fast, default for most data | | `NB` | Negative Binomial | Overdispersed count data | | `ZIP-copula` | ZIP with Gaussian copula | Captures gene-gene correlations | | `NB-copula` | NB with Gaussian copula | Overdispersed with correlations | ### Stage 2: Joint Analysis The augmented matrix $[X, \tilde{X}] \in \mathbb{R}^{n \times 2p}$ undergoes: 1. **Normalization**: Log-normalization 2. **Feature Selection**: Selecting top variable features from both original and knockoff genes 3. **Dimensionality Reduction**: PCA on the augmented data 4. **Clustering**: Louvain or Leiden algorithm ### Stage 3: Iterative Calibration ``` Algorithm: FindClustersRecall ───────────────────────────────────────── Input: Seurat object, resolution_start, q (target FDR) Output: Calibrated cluster assignments 1. Generate knockoff features 2. resolution ← resolution_start 3. REPEAT: a. Cluster augmented data at current resolution b. FOR each cluster pair (i, j): - Compute differential expression (original vs knockoff) - Apply knockoff filter at FDR = q - IF no genes selected: Mark pair as "not significantly different" BREAK c. IF all pairs significantly different: RETURN cluster assignments d. resolution ← resolution × (1 - reduction_percentage) 4. RETURN cluster assignments ``` ## Zero-Inflated Poisson Estimation ### Maximum Likelihood Estimation For count data $x_1, \ldots, x_n$ following ZIP($\lambda$, $\pi$): $$P(X = k) = \begin{cases} \pi + (1-\pi)e^{-\lambda} & k = 0 \\ (1-\pi)\frac{\lambda^k e^{-\lambda}}{k!} & k > 0 \end{cases}$$ The MLE has a closed-form solution using the Lambert W function: $$\hat{\lambda} = W_0(-\gamma e^{-\gamma}) + \gamma$$ where $\gamma = \bar{x}/(1-r_0)$ and $r_0$ is the observed zero proportion. ### Implementation ```{r zip-estimation, eval=FALSE} library(recall) # Simulate ZIP data set.seed(42) true_lambda <- 2.5 true_pi <- 0.3 data <- rzipoisson(1000, lambda = true_lambda, prop.zero = true_pi) # Estimate parameters params <- estimate_zi_poisson(data) print(params) # $lambda.hat: ~2.5 # $pi.hat: ~0.3 ``` ## Negative Binomial Estimation For overdispersed count data, the negative binomial distribution is parameterized as: $$\text{Var}(X) = \mu + \frac{\mu^2}{\text{size}}$$ The `estimate_negative_binomial()` function uses a cascade of estimation methods: 1. **MASS::fitdistr** (Nelder-Mead optimization) 2. **fitdistrplus::fitdist** (MLE) 3. **Method of Moments** 4. **Quantile Matching** 5. **Maximum Goodness-of-fit** ## Copula Models When gene-gene correlations are important, copula models preserve the dependency structure: ### Gaussian Copula 1. **Fit marginal distributions** to each gene 2. **Transform** to uniform marginals via probability integral transform 3. **Fit Gaussian copula** to capture dependencies 4. **Simulate** by sampling from copula and transforming back ```{r copula-method, eval=FALSE} # Use copula-based knockoffs for better correlation modeling pbmc <- FindClustersRecall( pbmc, null_method = "NB-copula", # Negative Binomial with Gaussian copula cores = 4 ) ``` ## Theoretical Guarantees ### FDR Control Under the knockoff framework, **recall** provides finite-sample FDR control: $$\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right] \leq q$$ where $V$ is the number of false discoveries and $R$ is the total number of discoveries. ### Comparison with Count Splitting | Method | Mechanism | Advantages | Disadvantages | |--------|-----------|------------|---------------| | **recall** | Knockoff filter | FDR control, uses full data | Requires distribution assumptions | | **Count Splitting** | Data splitting | Distribution-free | Loses statistical power | ## References 1. Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. *The Annals of Statistics*, 43(5), 2055-2085. 2. DenAdel, A., et al. (2025). A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing. *American Journal of Human Genetics*. 3. Neufeld, A., et al. (2022). Inference after latent variable estimation for single-cell RNA sequencing data. *Biostatistics*. 4. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. *Technometrics*, 34(1), 1-14.