--- title: "SecAct Algorithm: Ridge Regression with Permutation Testing" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{SecAct Algorithm: Ridge Regression with Permutation Testing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) library(SecAct) ``` ## Overview **SecAct** (Secreted protein Activity inference) is a computational framework for inferring the signaling activity of over 1,000 secreted proteins from gene expression profiles. This vignette provides a detailed explanation of the mathematical algorithm underlying SecAct. ## Mathematical Framework ### Problem Formulation Given: - **Y**: Gene expression matrix (genes × samples) - **X**: Signature matrix (genes × secreted proteins) We aim to infer the activity matrix **β** (secreted proteins × samples) that best explains the observed gene expression changes. ### Ridge Regression Model SecAct employs **ridge regression** to solve this inverse problem: $$\hat{\beta} = (X^TX + \lambda I)^{-1} X^T Y$$ Where: - $\hat{\beta}$: Estimated activity coefficients - $\lambda$: Regularization parameter (default: 5×10⁵) - $I$: Identity matrix The ridge penalty $\lambda I$ provides: 1. **Numerical stability** when $X^TX$ is ill-conditioned 2. **Regularization** to prevent overfitting 3. **Unique solution** even when features > samples ### Implementation via Cholesky Decomposition For computational efficiency, SecAct uses **Cholesky decomposition**: ``` A = X'X + λI (symmetric positive definite) R = chol(A) (A = R'R) β = R⁻¹ (R')⁻¹ X' Y ``` This approach is: - **Numerically stable**: Exploits the SPD structure - **Computationally efficient**: O(n³/3) vs O(n³) for general inverse - **Memory efficient**: Only stores upper triangular R ## Statistical Significance ### Permutation Testing SecAct assesses statistical significance through **permutation testing**: 1. **Null hypothesis**: No association between gene expression and secreted protein activity 2. **Procedure**: - Randomly permute sample labels in Y - Recompute β for each permutation - Build null distribution of coefficients ```{r permutation-concept, echo=FALSE, fig.cap="Permutation testing procedure"} set.seed(42) # Simulate permutation null distribution null_dist <- rnorm(1000, 0, 1) observed <- 2.5 par(mar = c(4, 4, 2, 1)) hist(null_dist, breaks = 50, col = "lightblue", border = "white", main = "Permutation Null Distribution", xlab = "β coefficient", xlim = c(-4, 4)) abline(v = observed, col = "red", lwd = 2, lty = 2) text(observed + 0.3, 50, "Observed", col = "red", pos = 4) # Calculate p-value pval <- mean(abs(null_dist) >= abs(observed)) text(-3, 70, paste("p-value =", round(pval, 4)), col = "darkblue", cex = 1.2) ``` ### Z-score Calculation The **z-score** quantifies how many standard deviations the observed coefficient deviates from the null: $$z = \frac{\hat{\beta} - \mu_{null}}{\sigma_{null}}$$ Where: - $\mu_{null}$: Mean of permutation coefficients - $\sigma_{null}$: Standard deviation of permutation coefficients ### P-value Computation The **empirical p-value** is calculated as: $$p = \frac{\sum_{i=1}^{n_{rand}} I(|\beta_i^{rand}| \geq |\hat{\beta}|) + 1}{n_{rand} + 1}$$ The "+1" correction ensures p-values are never exactly zero. ## Signature Matrix Construction ### SecAct Signature Database The SecAct signature matrix contains: - **1,170** secreted proteins - **7,919** downstream target genes - Curated from published literature and databases ```{r signature-stats, echo=TRUE, eval=FALSE} library(SecAct) # Download signature matrix (first time only) SecAct.download.data() # Load signature matrix sig_path <- SecAct.get.sigmat.path() sig_mat <- read.table(gzfile(sig_path), sep = "\t", header = TRUE, row.names = 1, nrows = 100) cat("Signature matrix dimensions:\n") cat(" Genes:", nrow(sig_mat), "(showing first 100)\n") cat(" Secreted proteins:", ncol(sig_mat), "\n") ``` ### Signature Grouping To reduce multicollinearity, SecAct groups highly correlated signatures: ```{r grouping-demo, echo=TRUE, fig.cap="Hierarchical clustering of signature correlation"} # Demonstrate signature grouping concept set.seed(123) # Simulate correlation matrix for 20 signatures n_sig <- 20 cor_mat <- matrix(runif(n_sig^2, 0.2, 0.9), n_sig, n_sig) cor_mat <- (cor_mat + t(cor_mat)) / 2 diag(cor_mat) <- 1 rownames(cor_mat) <- colnames(cor_mat) <- paste0("SP", 1:n_sig) # Hierarchical clustering hc <- hclust(as.dist(1 - cor_mat), method = "complete") plot(hc, main = "Signature Grouping by Correlation", xlab = "", sub = "") abline(h = 0.1, col = "red", lty = 2) text(15, 0.15, "Correlation threshold = 0.9", col = "red") ``` ## Algorithm Comparison: R vs GSL SecAct provides two implementations: | Feature | `SecAct.inference.r` | `SecAct.inference.gsl` | |---------|----------------------|------------------------| | Language | Pure R | C with GSL | | Speed | Moderate | Fast | | Platform | All | Unix/macOS | | Precision | 64-bit | 64-bit | ```{r compare-implementations, echo=TRUE, message=FALSE, eval=FALSE} # Compare R and GSL implementations data_path <- system.file("extdata/GSE100093.IFNG.expr.gz", package = "SecAct") Y <- read.table(gzfile(data_path), sep = "\t", header = TRUE, row.names = 1) # Download signature matrix (first time only) SecAct.download.data() # Run both implementations set.seed(42) result_r <- SecAct.inference.r(Y[, 1:3], lambda = 5e5, nrand = 100) set.seed(42) result_gsl <- SecAct.inference.gsl(Y[, 1:3], lambda = 5e5, nrand = 100) # Compare results cor_beta <- cor(as.vector(result_r$beta), as.vector(result_gsl$beta)) cor_zscore <- cor(as.vector(result_r$zscore), as.vector(result_gsl$zscore)) cat("Implementation Consistency:\n") cat(" Beta correlation:", round(cor_beta, 4), "\n") cat(" Z-score correlation:", round(cor_zscore, 4), "\n") ``` ## Practical Considerations ### Lambda Selection The regularization parameter λ controls the bias-variance tradeoff: ```{r lambda-effect, echo=TRUE, fig.cap="Effect of lambda on coefficient estimates", eval=FALSE} # Demonstrate lambda effect lambdas <- c(1e3, 1e4, 1e5, 5e5, 1e6, 1e7) Y_small <- Y[, 1:2] results <- lapply(lambdas, function(l) { SecAct.inference.r(Y_small, lambda = l, nrand = 50) }) # Plot coefficient ranges coef_ranges <- sapply(results, function(r) range(r$beta)) par(mar = c(4, 4, 2, 1)) plot(log10(lambdas), coef_ranges[2,], type = "b", pch = 19, col = "blue", ylim = range(coef_ranges), xlab = "log10(lambda)", ylab = "Coefficient range", main = "Lambda Effect on Coefficients") lines(log10(lambdas), coef_ranges[1,], type = "b", pch = 19, col = "red") legend("topright", c("Max", "Min"), col = c("blue", "red"), pch = 19) abline(v = log10(5e5), lty = 2, col = "gray") text(log10(5e5), mean(coef_ranges), "Default", pos = 4) ``` ### Number of Permutations More permutations yield more precise p-values: | n_rand | P-value precision | Computation time | |--------|-------------------|------------------| | 100 | 0.01 | Fast | | 1000 | 0.001 | Moderate | | 10000 | 0.0001 | Slow | **Recommendation**: Use 1000 permutations for publication-quality results. ## Summary SecAct combines: 1. **Ridge regression** for robust activity inference 2. **Permutation testing** for statistical significance 3. **Efficient implementation** in C/GSL for speed 4. **Comprehensive signature database** covering 1,170+ secreted proteins For questions or issues, please contact: - **Author**: Zaoqu Liu (liuzaoqu@163.com) - **GitHub**: https://github.com/Zaoqu-Liu/SecAct ## References 1. Hoerl, A.E. and Kennard, R.W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. *Technometrics*. 2. Good, P. (2005). Permutation, Parametric and Bootstrap Tests of Hypotheses. *Springer*. ## Session Info ```{r session-info} sessionInfo() ```