---
title: "SecAct Algorithm: Ridge Regression with Permutation Testing"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{SecAct Algorithm: Ridge Regression with Permutation Testing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5
)
library(SecAct)
```

## Overview

**SecAct** (Secreted protein Activity inference) is a computational framework for inferring the signaling activity of over 1,000 secreted proteins from gene expression profiles. This vignette provides a detailed explanation of the mathematical algorithm underlying SecAct.

## Mathematical Framework

### Problem Formulation

Given:
- **Y**: Gene expression matrix (genes × samples)
- **X**: Signature matrix (genes × secreted proteins)

We aim to infer the activity matrix **β** (secreted proteins × samples) that best explains the observed gene expression changes.

### Ridge Regression Model

SecAct employs **ridge regression** to solve this inverse problem:

$$\hat{\beta} = (X^TX + \lambda I)^{-1} X^T Y$$

Where:
- $\hat{\beta}$: Estimated activity coefficients
- $\lambda$: Regularization parameter (default: 5×10⁵)
- $I$: Identity matrix

The ridge penalty $\lambda I$ provides:
1. **Numerical stability** when $X^TX$ is ill-conditioned
2. **Regularization** to prevent overfitting
3. **Unique solution** even when features > samples

### Implementation via Cholesky Decomposition

For computational efficiency, SecAct uses **Cholesky decomposition**:

```
A = X'X + λI    (symmetric positive definite)
R = chol(A)     (A = R'R)
β = R⁻¹ (R')⁻¹ X' Y
```

This approach is:
- **Numerically stable**: Exploits the SPD structure
- **Computationally efficient**: O(n³/3) vs O(n³) for general inverse
- **Memory efficient**: Only stores upper triangular R

## Statistical Significance

### Permutation Testing

SecAct assesses statistical significance through **permutation testing**:

1. **Null hypothesis**: No association between gene expression and secreted protein activity
2. **Procedure**: 
   - Randomly permute sample labels in Y
   - Recompute β for each permutation
   - Build null distribution of coefficients

```{r permutation-concept, echo=FALSE, fig.cap="Permutation testing procedure"}
set.seed(42)
# Simulate permutation null distribution
null_dist <- rnorm(1000, 0, 1)
observed <- 2.5

par(mar = c(4, 4, 2, 1))
hist(null_dist, breaks = 50, col = "lightblue", border = "white",
     main = "Permutation Null Distribution", xlab = "β coefficient",
     xlim = c(-4, 4))
abline(v = observed, col = "red", lwd = 2, lty = 2)
text(observed + 0.3, 50, "Observed", col = "red", pos = 4)

# Calculate p-value
pval <- mean(abs(null_dist) >= abs(observed))
text(-3, 70, paste("p-value =", round(pval, 4)), col = "darkblue", cex = 1.2)
```

### Z-score Calculation

The **z-score** quantifies how many standard deviations the observed coefficient deviates from the null:

$$z = \frac{\hat{\beta} - \mu_{null}}{\sigma_{null}}$$

Where:
- $\mu_{null}$: Mean of permutation coefficients
- $\sigma_{null}$: Standard deviation of permutation coefficients

### P-value Computation

The **empirical p-value** is calculated as:

$$p = \frac{\sum_{i=1}^{n_{rand}} I(|\beta_i^{rand}| \geq |\hat{\beta}|) + 1}{n_{rand} + 1}$$

The "+1" correction ensures p-values are never exactly zero.

## Signature Matrix Construction

### SecAct Signature Database

The SecAct signature matrix contains:
- **1,170** secreted proteins
- **7,919** downstream target genes
- Curated from published literature and databases

```{r signature-stats, echo=TRUE, eval=FALSE}
library(SecAct)

# Download signature matrix (first time only)
SecAct.download.data()

# Load signature matrix
sig_path <- SecAct.get.sigmat.path()
sig_mat <- read.table(gzfile(sig_path), sep = "\t", header = TRUE, row.names = 1, nrows = 100)

cat("Signature matrix dimensions:\n")
cat("  Genes:", nrow(sig_mat), "(showing first 100)\n")
cat("  Secreted proteins:", ncol(sig_mat), "\n")
```

### Signature Grouping

To reduce multicollinearity, SecAct groups highly correlated signatures:

```{r grouping-demo, echo=TRUE, fig.cap="Hierarchical clustering of signature correlation"}
# Demonstrate signature grouping concept
set.seed(123)
# Simulate correlation matrix for 20 signatures
n_sig <- 20
cor_mat <- matrix(runif(n_sig^2, 0.2, 0.9), n_sig, n_sig)
cor_mat <- (cor_mat + t(cor_mat)) / 2
diag(cor_mat) <- 1
rownames(cor_mat) <- colnames(cor_mat) <- paste0("SP", 1:n_sig)

# Hierarchical clustering
hc <- hclust(as.dist(1 - cor_mat), method = "complete")
plot(hc, main = "Signature Grouping by Correlation", xlab = "", sub = "")
abline(h = 0.1, col = "red", lty = 2)
text(15, 0.15, "Correlation threshold = 0.9", col = "red")
```

## Algorithm Comparison: R vs GSL

SecAct provides two implementations:

| Feature | `SecAct.inference.r` | `SecAct.inference.gsl` |
|---------|----------------------|------------------------|
| Language | Pure R | C with GSL |
| Speed | Moderate | Fast |
| Platform | All | Unix/macOS |
| Precision | 64-bit | 64-bit |

```{r compare-implementations, echo=TRUE, message=FALSE, eval=FALSE}
# Compare R and GSL implementations
data_path <- system.file("extdata/GSE100093.IFNG.expr.gz", package = "SecAct")
Y <- read.table(gzfile(data_path), sep = "\t", header = TRUE, row.names = 1)

# Download signature matrix (first time only)
SecAct.download.data()

# Run both implementations
set.seed(42)
result_r <- SecAct.inference.r(Y[, 1:3], lambda = 5e5, nrand = 100)

set.seed(42)
result_gsl <- SecAct.inference.gsl(Y[, 1:3], lambda = 5e5, nrand = 100)

# Compare results
cor_beta <- cor(as.vector(result_r$beta), as.vector(result_gsl$beta))
cor_zscore <- cor(as.vector(result_r$zscore), as.vector(result_gsl$zscore))

cat("Implementation Consistency:\n")
cat("  Beta correlation:", round(cor_beta, 4), "\n")
cat("  Z-score correlation:", round(cor_zscore, 4), "\n")
```

## Practical Considerations

### Lambda Selection

The regularization parameter λ controls the bias-variance tradeoff:

```{r lambda-effect, echo=TRUE, fig.cap="Effect of lambda on coefficient estimates", eval=FALSE}
# Demonstrate lambda effect
lambdas <- c(1e3, 1e4, 1e5, 5e5, 1e6, 1e7)
Y_small <- Y[, 1:2]

results <- lapply(lambdas, function(l) {
  SecAct.inference.r(Y_small, lambda = l, nrand = 50)
})

# Plot coefficient ranges
coef_ranges <- sapply(results, function(r) range(r$beta))
par(mar = c(4, 4, 2, 1))
plot(log10(lambdas), coef_ranges[2,], type = "b", pch = 19, col = "blue",
     ylim = range(coef_ranges), xlab = "log10(lambda)", ylab = "Coefficient range",
     main = "Lambda Effect on Coefficients")
lines(log10(lambdas), coef_ranges[1,], type = "b", pch = 19, col = "red")
legend("topright", c("Max", "Min"), col = c("blue", "red"), pch = 19)
abline(v = log10(5e5), lty = 2, col = "gray")
text(log10(5e5), mean(coef_ranges), "Default", pos = 4)
```

### Number of Permutations

More permutations yield more precise p-values:

| n_rand | P-value precision | Computation time |
|--------|-------------------|------------------|
| 100 | 0.01 | Fast |
| 1000 | 0.001 | Moderate |
| 10000 | 0.0001 | Slow |

**Recommendation**: Use 1000 permutations for publication-quality results.

## Summary

SecAct combines:

1. **Ridge regression** for robust activity inference
2. **Permutation testing** for statistical significance
3. **Efficient implementation** in C/GSL for speed
4. **Comprehensive signature database** covering 1,170+ secreted proteins

For questions or issues, please contact:

- **Author**: Zaoqu Liu (liuzaoqu@163.com)
- **GitHub**: https://github.com/Zaoqu-Liu/SecAct

## References

1. Hoerl, A.E. and Kennard, R.W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. *Technometrics*.
2. Good, P. (2005). Permutation, Parametric and Bootstrap Tests of Hypotheses. *Springer*.

## Session Info

```{r session-info}
sessionInfo()
```