---
title: "Algorithm and Methodology"
author: 
  - name: "Zaoqu Liu"
    email: "liuzaoqu@163.com"
    affiliation: "Department of Interventional Radiology, The First Affiliated Hospital of Zhengzhou University"
    orcid: "0000-0002-0452-742X"
  - name: "Aimin Xie"
    email: "aiminyy1993@gmail.com"
    affiliation: "Original Author"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Algorithm and Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE
)
```

# Overview

scPAS identifies phenotype-associated cell subpopulations through a multi-step computational pipeline that integrates bulk and single-cell RNA-seq data.

```{r workflow-diagram, echo=FALSE, fig.cap="scPAS Workflow Overview", fig.width=8, fig.height=6}
library(ggplot2)

# Create a simple workflow diagram
workflow_data <- data.frame(
  step = factor(c("1. Input Data", "2. Preprocessing", "3. Network Construction", 
                  "4. Model Training", "5. Risk Scoring", "6. Significance Testing"),
                levels = c("1. Input Data", "2. Preprocessing", "3. Network Construction", 
                          "4. Model Training", "5. Risk Scoring", "6. Significance Testing")),
  y = 6:1,
  description = c(
    "Bulk RNA-seq + Phenotype + scRNA-seq",
    "Quantile normalization + Variable genes",
    "Gene-gene similarity network (SNN)",
    "Network-regularized sparse regression",
    "Risk score calculation per cell",
    "Permutation test + FDR correction"
  )
)

ggplot(workflow_data, aes(x = 1, y = y)) +
  geom_tile(aes(fill = step), width = 0.8, height = 0.8, alpha = 0.8) +
  geom_text(aes(label = step), fontface = "bold", size = 4, hjust = 0.5, vjust = -0.5) +
  geom_text(aes(label = description), size = 3, hjust = 0.5, vjust = 1.5, color = "gray30") +
  scale_fill_brewer(palette = "Blues") +
  theme_void() +
  theme(legend.position = "none") +
  coord_cartesian(xlim = c(0.3, 1.7))
```

# Mathematical Framework

## Problem Formulation

Given:

- **Bulk expression matrix** $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n samples × p genes)
- **Phenotype vector** $\mathbf{y}$ (continuous, binary, or survival)
- **Single-cell expression matrix** $\mathbf{S} \in \mathbb{R}^{m \times p}$ (m cells × p genes)

The goal is to find gene weights $\boldsymbol{\beta}$ that associate gene expression with phenotype, then apply these weights to single-cell data to compute per-cell risk scores.

## Network-Regularized Sparse Regression

scPAS uses the **APML0** (Augmented and Penalized Minimization L0) algorithm with network regularization:

$$
\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left\{ L(\boldsymbol{\beta}; \mathbf{X}, \mathbf{y}) + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \boldsymbol{\beta}^T \mathbf{L} \boldsymbol{\beta} \right\}
$$

Where:

- $L(\boldsymbol{\beta})$ is the loss function (depends on phenotype type)
- $\lambda_1 \|\boldsymbol{\beta}\|_1$ is the L1 penalty (LASSO) for sparsity
- $\lambda_2 \boldsymbol{\beta}^T \mathbf{L} \boldsymbol{\beta}$ is the Laplacian penalty for network regularization
- $\mathbf{L}$ is the Laplacian matrix of the gene-gene network

```{r penalty-illustration, echo=FALSE, fig.cap="Effect of Network Regularization", fig.width=8, fig.height=4}
set.seed(123)

# Illustrate different penalty effects
x_seq <- seq(-3, 3, length.out = 100)

penalties <- data.frame(
  x = rep(x_seq, 3),
  penalty = c(
    abs(x_seq),                    # L1
    x_seq^2,                       # L2
    abs(x_seq) + 0.3 * x_seq^2     # Elastic Net
  ),
  type = rep(c("L1 (LASSO)", "L2 (Ridge)", "Elastic Net"), each = 100)
)

ggplot(penalties, aes(x = x, y = penalty, color = type)) +
  geom_line(size = 1.2) +
  labs(
    x = expression(beta),
    y = "Penalty",
    title = "Penalty Functions in Regularized Regression",
    color = "Penalty Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  scale_color_brewer(palette = "Set1")
```

## Loss Functions by Phenotype Type

### Gaussian Family (Continuous)

For continuous phenotypes, we minimize the squared error:

$$
L(\boldsymbol{\beta}) = \frac{1}{2n} \sum_{i=1}^{n} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2
$$

### Binomial Family (Binary)

For binary outcomes, we use logistic regression:

$$
L(\boldsymbol{\beta}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1-y_i) \log(1-p_i) \right]
$$

where $p_i = \frac{1}{1 + e^{-\mathbf{x}_i^T \boldsymbol{\beta}}}$

### Cox Family (Survival)

For time-to-event data, we minimize the negative partial log-likelihood:

$$
L(\boldsymbol{\beta}) = -\frac{1}{n} \sum_{i: \delta_i = 1} \left[ \mathbf{x}_i^T \boldsymbol{\beta} - \log \sum_{j \in R(t_i)} e^{\mathbf{x}_j^T \boldsymbol{\beta}} \right]
$$

where $R(t_i)$ is the risk set at time $t_i$ and $\delta_i$ is the event indicator.

# Gene Network Construction

## Shared Nearest Neighbor (SNN) Network

scPAS constructs a gene-gene similarity network from single-cell data using the SNN algorithm:

```{r snn-algorithm, echo=FALSE, fig.cap="SNN Network Construction", fig.width=7, fig.height=5}
set.seed(42)
library(Matrix)

# Simulate gene correlations
n_genes <- 50
cor_matrix <- matrix(runif(n_genes^2, -0.3, 0.8), n_genes, n_genes)
cor_matrix <- (cor_matrix + t(cor_matrix)) / 2
diag(cor_matrix) <- 1

# Threshold to create adjacency
adj_matrix <- cor_matrix
adj_matrix[adj_matrix < 0.5] <- 0
adj_matrix[adj_matrix >= 0.5] <- 1
diag(adj_matrix) <- 0

# Visualize
image(
  1:n_genes, 1:n_genes, adj_matrix,
  col = c("white", "steelblue"),
  xlab = "Gene Index", ylab = "Gene Index",
  main = "Gene-Gene Network Adjacency Matrix"
)
```

The network construction process:

1. **Calculate gene correlations** from single-cell expression
2. **Find k-nearest neighbors** for each gene
3. **Compute SNN similarity** based on shared neighbors
4. **Threshold** to create binary adjacency matrix

# Risk Score Calculation

## Per-Cell Risk Score

Once the model is trained, the risk score for each cell is computed as:

$$
RS_j = \sum_{g=1}^{p} \hat{\beta}_g \cdot \tilde{S}_{jg}
$$

where:

- $RS_j$ is the risk score for cell $j$
- $\hat{\beta}_g$ is the learned coefficient for gene $g$
- $\tilde{S}_{jg}$ is the standardized expression of gene $g$ in cell $j$

```{r risk-score-illustration, echo=FALSE, fig.cap="Risk Score Distribution", fig.width=7, fig.height=4}
set.seed(42)

# Simulate risk scores
n_cells <- 500
risk_scores <- c(
  rnorm(150, mean = -1.5, sd = 0.8),   # Low risk (scPAS-)
  rnorm(200, mean = 0, sd = 0.6),       # Neutral
  rnorm(150, mean = 1.5, sd = 0.8)      # High risk (scPAS+)
)

risk_df <- data.frame(
  score = risk_scores,
  category = factor(
    c(rep("scPAS-", 150), rep("Neutral", 200), rep("scPAS+", 150)),
    levels = c("scPAS-", "Neutral", "scPAS+")
  )
)

ggplot(risk_df, aes(x = score, fill = category)) +
  geom_histogram(bins = 40, alpha = 0.7, position = "identity") +
  scale_fill_manual(values = c("scPAS-" = "royalblue", "Neutral" = "gray70", "scPAS+" = "indianred")) +
  labs(
    x = "Normalized Risk Score",
    y = "Cell Count",
    title = "Distribution of Cell Risk Scores",
    fill = "Classification"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5, face = "bold")
  ) +
  geom_vline(xintercept = c(-1.96, 1.96), linetype = "dashed", color = "gray40")
```

## Normalized Risk Score

The raw risk score is converted to a Z-statistic:

$$
NRS_j = \frac{RS_j - \mu_{bg}}{\sigma_{bg}}
$$

where $\mu_{bg}$ and $\sigma_{bg}$ are the mean and standard deviation of the background distribution estimated from permutation.

# Statistical Significance Testing

## Permutation Test

To assess significance, scPAS performs a permutation test:

```{r permutation-illustration, echo=FALSE, fig.cap="Permutation Test Principle", fig.width=8, fig.height=4}
set.seed(42)

# Simulate observed vs permuted
observed <- 2.5
permuted <- rnorm(1000, mean = 0, sd = 1)

perm_df <- data.frame(value = permuted)

p_value <- mean(abs(permuted) >= abs(observed))

ggplot(perm_df, aes(x = value)) +
  geom_histogram(bins = 50, fill = "gray70", color = "white", alpha = 0.8) +
  geom_vline(xintercept = observed, color = "red", size = 1.5, linetype = "solid") +
  geom_vline(xintercept = -observed, color = "red", size = 1.5, linetype = "dashed") +
  annotate("text", x = observed + 0.3, y = 80, label = "Observed", color = "red", hjust = 0) +
  labs(
    x = "Permuted Risk Score",
    y = "Frequency",
    title = paste0("Permutation Distribution (Two-tailed P = ", round(p_value, 4), ")")
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
```

**Algorithm:**

1. For each permutation $b = 1, \ldots, B$:
   - Randomly shuffle the gene coefficients $\boldsymbol{\beta}$
   - Calculate permuted risk scores for all cells
2. Compute two-tailed P-value:

$$
p_j = \frac{1}{B} \sum_{b=1}^{B} \mathbb{I}(|RS_j^{(b)}| \geq |RS_j|)
$$

## FDR Correction

Multiple testing correction using Benjamini-Hochberg procedure:

$$
FDR_j = \min\left(1, \frac{p_j \cdot m}{\text{rank}(p_j)}\right)
$$

where $m$ is the total number of cells.

# Cell Classification

Cells are classified based on:

1. **Statistical significance**: FDR < threshold (default 0.05)
2. **Direction of association**: Sign of normalized risk score

| Category | Criteria |
|----------|----------|
| scPAS+ | FDR < 0.05 AND NRS > 0 |
| scPAS- | FDR < 0.05 AND NRS < 0 |
| 0 | FDR ≥ 0.05 |

```{r classification-illustration, echo=FALSE, fig.cap="Cell Classification Scheme", fig.width=7, fig.height=5}
set.seed(42)

# Simulate NRS vs -log10(FDR)
n_points <- 500
nrs <- rnorm(n_points, 0, 1.5)
fdr <- 10^(-abs(nrs) - runif(n_points, 0, 2))
fdr <- pmin(fdr, 1)

class_df <- data.frame(
  NRS = nrs,
  neg_log_FDR = -log10(fdr),
  Classification = ifelse(
    fdr >= 0.05, "0",
    ifelse(nrs > 0, "scPAS+", "scPAS-")
  )
)
class_df$Classification <- factor(class_df$Classification, levels = c("scPAS-", "0", "scPAS+"))

ggplot(class_df, aes(x = NRS, y = neg_log_FDR, color = Classification)) +
  geom_point(alpha = 0.6, size = 1.5) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
  scale_color_manual(values = c("scPAS-" = "royalblue", "0" = "gray70", "scPAS+" = "indianred")) +
  annotate("text", x = 3, y = -log10(0.05) + 0.3, label = "FDR = 0.05", color = "gray40") +
  labs(
    x = "Normalized Risk Score (NRS)",
    y = "-log10(FDR)",
    title = "Cell Classification Based on NRS and FDR"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(hjust = 0.5, face = "bold")
  )
```

# Implementation Details

## Sparse Matrix Operations

scPAS uses efficient sparse matrix operations for large-scale single-cell data:

```{r sparse-ops, eval=FALSE}
# Efficient correlation calculation for sparse matrices
sparse.cor <- function(x) {
  # Uses optimized algorithm that avoids dense conversion
  # Handles numerical precision issues
  # Returns proper correlation matrix
}

# Efficient row scaling
sparse_row_scale <- function(x, center = TRUE, scale = TRUE) {
  # Row-wise standardization
  # Preserves sparsity when only scaling (not centering)
}
```

## Parallel Computing

For large permutation counts, scPAS supports parallel processing:

```{r parallel-example, eval=FALSE}
result <- scPAS(
  bulk_dataset = bulk_data,
  sc_dataset = sc_obj,
  phenotype = phenotype,
  permutation_times = 5000,
  n_cores = 4  # Use 4 CPU cores
)
```

# References

1. **Original scPAS Paper**: Xie A, et al. (2024). scPAS: single-cell phenotype-associated subpopulation identifier. *Briefings in Bioinformatics*, 26(1):bbae655.

2. **Network-Regularized Regression**: Zou H, Hastie T. (2005). Regularization and variable selection via the elastic net. *JRSS-B*, 67(2):301-320.

3. **Permutation Testing**: Westfall PH, Young SS. (1993). *Resampling-Based Multiple Testing*. Wiley.

# Session Information

```{r session-info}
sessionInfo()
```