--- title: "Statistical Framework" author: "Zaoqu Liu, Robert K. Suter, Nagi G. Ayad" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Statistical Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 6, eval = TRUE ) library(ggplot2) library(dplyr) library(tidyr) ``` ## Introduction This vignette provides a comprehensive overview of the statistical methods implemented in scFOCAL. Understanding these methods is essential for proper interpretation of results and experimental design. ## 1. Spearman Rank Correlation ### Theoretical Foundation scFOCAL uses Spearman's rank correlation coefficient to measure the monotonic relationship between gene expression and drug signatures. Unlike Pearson correlation, Spearman is: - **Robust to outliers** - **Non-parametric** (no distributional assumptions) - **Captures monotonic relationships** (not just linear) ### Formula $$\rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2-1)}$$ Where $d_i = R(X_i) - R(Y_i)$ is the difference between ranks. ### Demonstration ```{r spearman-demo, fig.height=4, fig.width=8} set.seed(42) # Generate example data n <- 50 x <- rnorm(n) y_linear <- x + rnorm(n, sd = 0.5) y_monotonic <- sign(x) * abs(x)^0.5 + rnorm(n, sd = 0.3) y_outlier <- y_linear y_outlier[c(1, 2)] <- c(10, -10) # Add outliers # Calculate correlations results <- data.frame( Scenario = c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"), Pearson = c(cor(x, y_linear), cor(x, y_monotonic), cor(x, y_outlier)), Spearman = c(cor(x, y_linear, method = "spearman"), cor(x, y_monotonic, method = "spearman"), cor(x, y_outlier, method = "spearman")) ) # Visualization df_plot <- data.frame( x = rep(x, 3), y = c(y_linear, y_monotonic, y_outlier), Scenario = factor(rep(c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"), each = n)) ) ggplot(df_plot, aes(x = x, y = y)) + geom_point(alpha = 0.6, color = "steelblue") + geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") + facet_wrap(~Scenario, scales = "free_y") + labs( title = "Pearson vs Spearman Correlation", subtitle = "Spearman is more robust to outliers and non-linear relationships", x = "Variable X", y = "Variable Y" ) + theme_minimal() + theme(strip.text = element_text(face = "bold")) ``` ```{r correlation-table} knitr::kable(results, digits = 3, caption = "Comparison of Pearson and Spearman correlations") ``` ## 2. Fisher's Z-Transformation ### Why Transform? The sampling distribution of correlation coefficients is: 1. **Skewed** near ±1 2. **Variance depends on the true correlation** 3. **Not suitable for standard parametric tests** Fisher's Z-transformation addresses these issues. ### Statistical Properties ```{r fisher-properties, fig.height=5, fig.width=8} # Simulate sampling distributions set.seed(123) n_samples <- 1000 sample_size <- 30 simulate_cors <- function(true_rho, n, n_samples) { cors <- sapply(1:n_samples, function(i) { sigma <- matrix(c(1, true_rho, true_rho, 1), 2, 2) data <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = sigma) cor(data[,1], data[,2]) }) return(cors) } # Different true correlations rhos <- c(0, 0.5, 0.8) results_list <- lapply(rhos, function(r) { cors <- simulate_cors(r, sample_size, n_samples) z <- 0.5 * log((1 + cors) / (1 - cors)) data.frame( Correlation = cors, FisherZ = z, TrueRho = paste("ρ =", r) ) }) df_all <- do.call(rbind, results_list) # Plot correlation distributions p1 <- ggplot(df_all, aes(x = Correlation, fill = TrueRho)) + geom_density(alpha = 0.5) + labs(title = "A) Raw Correlation Distributions", x = "Sample Correlation", y = "Density") + theme_minimal() + theme(legend.position = "top") # Plot Fisher Z distributions p2 <- ggplot(df_all, aes(x = FisherZ, fill = TrueRho)) + geom_density(alpha = 0.5) + labs(title = "B) Fisher Z Distributions", x = "Fisher's Z", y = "Density") + theme_minimal() + theme(legend.position = "top") gridExtra::grid.arrange(p1, p2, ncol = 2) ``` **Key observation**: After transformation, distributions are: - More symmetric - Have similar variance - Suitable for combining across studies ### Standard Error The standard error of Fisher's Z is approximately: $$SE(Z) = \frac{1}{\sqrt{n-3}}$$ This enables construction of confidence intervals: $$Z \pm 1.96 \times SE(Z)$$ ## 3. limma Linear Models ### Design Matrix Construction For differential connectivity analysis, scFOCAL constructs a design matrix incorporating: ```{r design-matrix} # Example design matrix n_cells <- 20 cell_data <- data.frame( Cell = paste0("Cell_", 1:n_cells), Subject = rep(c("S1", "S2", "S3", "S4"), each = 5), Sensitivity = c(rep("Sensitive", 10), rep("Resistant", 10)) ) # Create design matrix design <- model.matrix(~ Subject + Sensitivity, data = cell_data) colnames(design) <- c("Intercept", "Subject_S2", "Subject_S3", "Subject_S4", "Sensitive") knitr::kable(head(design, 10), caption = "Example Design Matrix (first 10 cells)") ``` ### Model Fitting The linear model for each compound $c$: $$Z_{ic} = \beta_0 + \sum_j \beta_j \text{Subject}_{ij} + \beta_S \text{Sensitivity}_i + \epsilon_{ic}$$ ### Empirical Bayes ```{r ebayes-simulation, fig.height=4, fig.width=7} # Simulate the benefit of empirical Bayes set.seed(42) n_compounds <- 500 true_effects <- rnorm(n_compounds, mean = 0, sd = 0.5) n_cells_per_group <- 10 # Simulate with different sample sizes simulate_effect <- function(true_effect, n) { sensitive <- rnorm(n, mean = true_effect, sd = 1) resistant <- rnorm(n, mean = 0, sd = 1) t.test(sensitive, resistant)$statistic } t_stats <- sapply(true_effects, simulate_effect, n = n_cells_per_group) # Simple posterior (shrinkage toward zero) shrinkage <- 0.7 # Simplified representation moderated_t <- shrinkage * t_stats df_ebayes <- data.frame( True = true_effects, Ordinary = t_stats, Moderated = moderated_t ) ggplot(df_ebayes) + geom_point(aes(x = True, y = Ordinary), alpha = 0.3, color = "red") + geom_point(aes(x = True, y = Moderated), alpha = 0.3, color = "blue") + geom_abline(intercept = 0, slope = 1, linetype = "dashed") + labs( title = "Empirical Bayes Moderation Effect", subtitle = "Blue (moderated) estimates are shrunk toward zero, reducing false positives", x = "True Effect Size", y = "Estimated t-statistic" ) + theme_minimal() + annotate("text", x = -1.5, y = 3, label = "Ordinary t-test", color = "red") + annotate("text", x = -1.5, y = 2.5, label = "Moderated t-test", color = "blue") ``` ## 4. Multiple Testing Correction ### The Problem With 1,679 compounds tested simultaneously, controlling false positives is critical. ```{r multiple-testing, fig.height=4, fig.width=7} # Demonstrate multiple testing set.seed(42) n_tests <- 1679 n_true_effects <- 50 # Generate p-values true_p <- c(runif(n_true_effects, 0, 0.01), # True effects runif(n_tests - n_true_effects, 0, 1)) # Null # Apply corrections p_bonferroni <- p.adjust(true_p, method = "bonferroni") p_fdr <- p.adjust(true_p, method = "fdr") results_mult <- data.frame( Raw = true_p, Bonferroni = p_bonferroni, FDR = p_fdr, TrueEffect = c(rep("Yes", n_true_effects), rep("No", n_tests - n_true_effects)) ) # Count discoveries at alpha = 0.05 discoveries <- data.frame( Method = c("Raw p-value", "Bonferroni", "FDR (BH)"), `Total Discoveries` = c(sum(true_p < 0.05), sum(p_bonferroni < 0.05), sum(p_fdr < 0.05)), `True Positives` = c(sum(true_p[1:n_true_effects] < 0.05), sum(p_bonferroni[1:n_true_effects] < 0.05), sum(p_fdr[1:n_true_effects] < 0.05)), check.names = FALSE ) discoveries$`False Positives` <- discoveries$`Total Discoveries` - discoveries$`True Positives` knitr::kable(discoveries, caption = "Multiple Testing Correction Comparison (α = 0.05)") ``` ### FDR Control scFOCAL uses the **Benjamini-Hochberg procedure** for FDR control: 1. Order p-values: $p_{(1)} \leq p_{(2)} \leq ... \leq p_{(m)}$ 2. Find largest $k$ where $p_{(k)} \leq \frac{k}{m} \cdot \alpha$ 3. Reject hypotheses $H_{(1)}, ..., H_{(k)}$ This controls the **expected proportion of false discoveries** at level $\alpha$. ## 5. Effect Size Interpretation ### Log Fold Change in Z-space ```{r effect-sizes, fig.height=4, fig.width=7} # Effect size interpretation guide effect_sizes <- data.frame( logFC = c(-0.5, -0.3, -0.1, 0, 0.1, 0.3, 0.5), Interpretation = c("Strong negative", "Moderate negative", "Weak negative", "No effect", "Weak positive", "Moderate positive", "Strong positive"), Meaning = c("Much lower connectivity in resistant", "Lower connectivity in resistant", "Slightly lower in resistant", "No difference", "Slightly higher in resistant", "Higher connectivity in resistant", "Much higher connectivity in resistant") ) knitr::kable(effect_sizes, caption = "Interpretation of logFC values in differential connectivity") ``` ### Cohen's d Equivalent For practical significance assessment: $$d = \frac{\text{logFC}}{\text{pooled SD}} \approx \frac{\text{logFC}}{0.3}$$ | Cohen's d | Interpretation | |-----------|----------------| | 0.2 | Small effect | | 0.5 | Medium effect | | 0.8 | Large effect | ## 6. Power Analysis ### Sample Size Considerations ```{r power-analysis, fig.height=4, fig.width=7} # Power calculation power_calc <- function(n, effect, alpha = 0.05) { se <- sqrt(2/n) z_crit <- qnorm(1 - alpha/2) power <- pnorm(effect/se - z_crit) + pnorm(-effect/se - z_crit) return(power) } n_range <- seq(10, 500, by = 10) effects <- c(0.1, 0.2, 0.3, 0.5) power_df <- expand.grid(n = n_range, effect = effects) power_df$power <- mapply(power_calc, power_df$n, power_df$effect) power_df$effect <- factor(power_df$effect) ggplot(power_df, aes(x = n, y = power, color = effect)) + geom_line(linewidth = 1.2) + geom_hline(yintercept = 0.8, linetype = "dashed", color = "gray50") + scale_color_brewer(palette = "Set1", name = "Effect Size\n(logFC)") + labs( title = "Statistical Power vs Sample Size", subtitle = "Dashed line indicates 80% power threshold", x = "Number of Cells per Group", y = "Statistical Power" ) + theme_minimal() + annotate("text", x = 400, y = 0.85, label = "80% power", color = "gray40") ``` ### Recommended Sample Sizes | Effect Size (logFC) | Min Cells per Group | Recommended | |---------------------|---------------------|-------------| | 0.5 (large) | 30 | 50 | | 0.3 (medium) | 80 | 100 | | 0.2 (small) | 200 | 250 | | 0.1 (very small) | 800 | 1000 | ## 7. Quality Control Metrics ### Gene Coverage scFOCAL reports the overlap between your data and L1000 genes: ```{r gene-coverage, fig.height=3, fig.width=6} # Example gene coverage visualization coverage_data <- data.frame( Category = c("L1000 Genes (978)", "In Your Data", "Used for Analysis"), Count = c(978, 850, 850), Percentage = c(100, 86.9, 86.9) ) ggplot(coverage_data, aes(x = Category, y = Count, fill = Category)) + geom_col(width = 0.6) + geom_text(aes(label = paste0(Count, " (", round(Percentage, 1), "%)")), vjust = -0.5) + scale_fill_brewer(palette = "Blues") + labs( title = "Gene Coverage Quality Control", x = "", y = "Number of Genes" ) + theme_minimal() + theme(legend.position = "none") + ylim(0, 1100) ``` ### Connectivity Distribution QC Healthy connectivity distributions should be: - Centered near zero - Roughly symmetric - Not strongly bimodal ## Summary Statistics Table | Metric | Formula | Interpretation | |--------|---------|----------------| | Connectivity | $\rho_s(cell, drug)$ | Transcriptional similarity | | Fisher's Z | $\frac{1}{2}\ln\frac{1+\rho}{1-\rho}$ | Normalized correlation | | logFC | $\bar{Z}_{resistant} - \bar{Z}_{sensitive}$ | Differential connectivity | | adj.P.Val | BH-corrected p-value | Statistical significance | | Reversal Score | $\frac{N_{disc}}{N_{conc}}$ | Therapeutic potential | ## Best Practices 1. **Ensure adequate cell numbers** per condition (>50 cells recommended) 2. **Check gene coverage** before analysis (>70% L1000 overlap) 3. **Use FDR-adjusted p-values** for discovery 4. **Consider effect sizes** alongside p-values 5. **Validate computationally identified candidates** experimentally ## Session Info ```{r session-info} sessionInfo() ```