---
title: "Algorithm and Methodology"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Algorithm and Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  eval = FALSE
)
```

## Introduction

**recall** (Calibrated Clustering with Artificial Variables) is a statistical framework designed to address the fundamental problem of **over-clustering** in single-cell RNA-sequencing (scRNA-seq) data analysis.

### The Double-Dipping Problem

In standard scRNA-seq analysis pipelines:

1. **Clustering**: Cells are grouped based on gene expression patterns
2. **Differential Expression Testing**: Genes are tested for differential expression between clusters

This creates a statistical problem known as **double-dipping**: the same data is used both to define groups and to test for differences between them. This leads to:

- Inflated test statistics
- Overly optimistic p-values
- Increased false discovery rates
- **Over-clustering**: finding more clusters than biologically meaningful

## Mathematical Framework

### The Knockoff Filter

The knockoff filter (Barber & Candès, 2015) is a powerful method for variable selection with FDR control. Given original features $X_1, \ldots, X_p$, we construct **knockoff features** $\tilde{X}_1, \ldots, \tilde{X}_p$ that:

1. Preserve the correlation structure of the original features
2. Are exchangeable with the originals conditional on the response
3. Are known *a priori* to be null (not associated with the response)

### The W Statistic

For each gene $j$, we compute the knockoff W statistic:

$$W_j = -\log_{10}(p_j^{\text{original}}) - (-\log_{10}(p_j^{\text{knockoff}}))$$

Where:
- $p_j^{\text{original}}$ is the p-value for gene $j$ in differential expression testing
- $p_j^{\text{knockoff}}$ is the p-value for the corresponding knockoff gene

### FDR-Controlled Selection

The knockoff filter selects genes where $W_j \geq T$, where the threshold $T$ is chosen to control FDR at level $q$:

$$T = \min\left\{t > 0: \frac{1 + \#\{j: W_j \leq -t\}}{\#\{j: W_j \geq t\}} \leq q\right\}$$

This provides rigorous FDR control without distributional assumptions.

## The recall Algorithm

### Stage 1: Synthetic Null Variable Generation

Given the expression matrix $X \in \mathbb{R}^{n \times p}$ (cells × genes), we generate synthetic "knockoff" genes $\tilde{X} \in \mathbb{R}^{n \times p}$ by:

1. **Estimating marginal distributions** for each gene
2. **Sampling** from the estimated distribution to create knockoffs

#### Supported Distributions

| Method | Description | Use Case |
|--------|-------------|----------|
| `ZIP` | Zero-Inflated Poisson | Fast, default for most data |
| `NB` | Negative Binomial | Overdispersed count data |
| `ZIP-copula` | ZIP with Gaussian copula | Captures gene-gene correlations |
| `NB-copula` | NB with Gaussian copula | Overdispersed with correlations |

### Stage 2: Joint Analysis

The augmented matrix $[X, \tilde{X}] \in \mathbb{R}^{n \times 2p}$ undergoes:

1. **Normalization**: Log-normalization
2. **Feature Selection**: Selecting top variable features from both original and knockoff genes
3. **Dimensionality Reduction**: PCA on the augmented data
4. **Clustering**: Louvain or Leiden algorithm

### Stage 3: Iterative Calibration

```
Algorithm: FindClustersRecall
─────────────────────────────────────────
Input: Seurat object, resolution_start, q (target FDR)
Output: Calibrated cluster assignments

1. Generate knockoff features
2. resolution ← resolution_start
3. REPEAT:
   a. Cluster augmented data at current resolution
   b. FOR each cluster pair (i, j):
      - Compute differential expression (original vs knockoff)
      - Apply knockoff filter at FDR = q
      - IF no genes selected:
           Mark pair as "not significantly different"
           BREAK
   c. IF all pairs significantly different:
      RETURN cluster assignments
   d. resolution ← resolution × (1 - reduction_percentage)
4. RETURN cluster assignments
```

## Zero-Inflated Poisson Estimation

### Maximum Likelihood Estimation

For count data $x_1, \ldots, x_n$ following ZIP($\lambda$, $\pi$):

$$P(X = k) = \begin{cases}
\pi + (1-\pi)e^{-\lambda} & k = 0 \\
(1-\pi)\frac{\lambda^k e^{-\lambda}}{k!} & k > 0
\end{cases}$$

The MLE has a closed-form solution using the Lambert W function:

$$\hat{\lambda} = W_0(-\gamma e^{-\gamma}) + \gamma$$

where $\gamma = \bar{x}/(1-r_0)$ and $r_0$ is the observed zero proportion.

### Implementation

```{r zip-estimation, eval=FALSE}
library(recall)

# Simulate ZIP data
set.seed(42)
true_lambda <- 2.5
true_pi <- 0.3

data <- rzipoisson(1000, lambda = true_lambda, prop.zero = true_pi)

# Estimate parameters
params <- estimate_zi_poisson(data)
print(params)
# $lambda.hat: ~2.5
# $pi.hat: ~0.3
```

## Negative Binomial Estimation

For overdispersed count data, the negative binomial distribution is parameterized as:

$$\text{Var}(X) = \mu + \frac{\mu^2}{\text{size}}$$

The `estimate_negative_binomial()` function uses a cascade of estimation methods:

1. **MASS::fitdistr** (Nelder-Mead optimization)
2. **fitdistrplus::fitdist** (MLE)
3. **Method of Moments**
4. **Quantile Matching**
5. **Maximum Goodness-of-fit**

## Copula Models

When gene-gene correlations are important, copula models preserve the dependency structure:

### Gaussian Copula

1. **Fit marginal distributions** to each gene
2. **Transform** to uniform marginals via probability integral transform
3. **Fit Gaussian copula** to capture dependencies
4. **Simulate** by sampling from copula and transforming back

```{r copula-method, eval=FALSE}
# Use copula-based knockoffs for better correlation modeling
pbmc <- FindClustersRecall(
  pbmc,
  null_method = "NB-copula",  # Negative Binomial with Gaussian copula
  cores = 4
)
```

## Theoretical Guarantees

### FDR Control

Under the knockoff framework, **recall** provides finite-sample FDR control:

$$\text{FDR} = \mathbb{E}\left[\frac{V}{R \vee 1}\right] \leq q$$

where $V$ is the number of false discoveries and $R$ is the total number of discoveries.

### Comparison with Count Splitting

| Method | Mechanism | Advantages | Disadvantages |
|--------|-----------|------------|---------------|
| **recall** | Knockoff filter | FDR control, uses full data | Requires distribution assumptions |
| **Count Splitting** | Data splitting | Distribution-free | Loses statistical power |

## References

1. Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. *The Annals of Statistics*, 43(5), 2055-2085.

2. DenAdel, A., et al. (2025). A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing. *American Journal of Human Genetics*.

3. Neufeld, A., et al. (2022). Inference after latent variable estimation for single-cell RNA sequencing data. *Biostatistics*.

4. Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. *Technometrics*, 34(1), 1-14.