Statistical Framework

Introduction

This vignette provides a comprehensive overview of the statistical methods implemented in scFOCAL. Understanding these methods is essential for proper interpretation of results and experimental design.

1. Spearman Rank Correlation

Theoretical Foundation

scFOCAL uses Spearman’s rank correlation coefficient to measure the monotonic relationship between gene expression and drug signatures. Unlike Pearson correlation, Spearman is:

  • Robust to outliers
  • Non-parametric (no distributional assumptions)
  • Captures monotonic relationships (not just linear)

Formula

\[\rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2-1)}\]

Where \(d_i = R(X_i) - R(Y_i)\) is the difference between ranks.

Demonstration

set.seed(42)

# Generate example data
n <- 50
x <- rnorm(n)
y_linear <- x + rnorm(n, sd = 0.5)
y_monotonic <- sign(x) * abs(x)^0.5 + rnorm(n, sd = 0.3)
y_outlier <- y_linear
y_outlier[c(1, 2)] <- c(10, -10)  # Add outliers

# Calculate correlations
results <- data.frame(
  Scenario = c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"),
  Pearson = c(cor(x, y_linear), cor(x, y_monotonic), cor(x, y_outlier)),
  Spearman = c(cor(x, y_linear, method = "spearman"),
               cor(x, y_monotonic, method = "spearman"),
               cor(x, y_outlier, method = "spearman"))
)

# Visualization
df_plot <- data.frame(
  x = rep(x, 3),
  y = c(y_linear, y_monotonic, y_outlier),
  Scenario = factor(rep(c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"), 
                        each = n))
)

ggplot(df_plot, aes(x = x, y = y)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  facet_wrap(~Scenario, scales = "free_y") +
  labs(
    title = "Pearson vs Spearman Correlation",
    subtitle = "Spearman is more robust to outliers and non-linear relationships",
    x = "Variable X",
    y = "Variable Y"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"))

knitr::kable(results, digits = 3,
             caption = "Comparison of Pearson and Spearman correlations")
Comparison of Pearson and Spearman correlations
Scenario Pearson Spearman
Linear + Noise 0.918 0.918
Monotonic (Non-linear) 0.924 0.940
With Outliers 0.581 0.908

2. Fisher’s Z-Transformation

Why Transform?

The sampling distribution of correlation coefficients is:

  1. Skewed near ±1
  2. Variance depends on the true correlation
  3. Not suitable for standard parametric tests

Fisher’s Z-transformation addresses these issues.

Statistical Properties

# Simulate sampling distributions
set.seed(123)
n_samples <- 1000
sample_size <- 30

simulate_cors <- function(true_rho, n, n_samples) {
  cors <- sapply(1:n_samples, function(i) {
    sigma <- matrix(c(1, true_rho, true_rho, 1), 2, 2)
    data <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = sigma)
    cor(data[,1], data[,2])
  })
  return(cors)
}

# Different true correlations
rhos <- c(0, 0.5, 0.8)
results_list <- lapply(rhos, function(r) {
  cors <- simulate_cors(r, sample_size, n_samples)
  z <- 0.5 * log((1 + cors) / (1 - cors))
  data.frame(
    Correlation = cors,
    FisherZ = z,
    TrueRho = paste("ρ =", r)
  )
})

df_all <- do.call(rbind, results_list)

# Plot correlation distributions
p1 <- ggplot(df_all, aes(x = Correlation, fill = TrueRho)) +
  geom_density(alpha = 0.5) +
  labs(title = "A) Raw Correlation Distributions",
       x = "Sample Correlation", y = "Density") +
  theme_minimal() +
  theme(legend.position = "top")

# Plot Fisher Z distributions
p2 <- ggplot(df_all, aes(x = FisherZ, fill = TrueRho)) +
  geom_density(alpha = 0.5) +
  labs(title = "B) Fisher Z Distributions",
       x = "Fisher's Z", y = "Density") +
  theme_minimal() +
  theme(legend.position = "top")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Key observation: After transformation, distributions are:

  • More symmetric
  • Have similar variance
  • Suitable for combining across studies

Standard Error

The standard error of Fisher’s Z is approximately:

\[SE(Z) = \frac{1}{\sqrt{n-3}}\]

This enables construction of confidence intervals:

\[Z \pm 1.96 \times SE(Z)\]

3. limma Linear Models

Design Matrix Construction

For differential connectivity analysis, scFOCAL constructs a design matrix incorporating:

# Example design matrix
n_cells <- 20
cell_data <- data.frame(
  Cell = paste0("Cell_", 1:n_cells),
  Subject = rep(c("S1", "S2", "S3", "S4"), each = 5),
  Sensitivity = c(rep("Sensitive", 10), rep("Resistant", 10))
)

# Create design matrix
design <- model.matrix(~ Subject + Sensitivity, data = cell_data)
colnames(design) <- c("Intercept", "Subject_S2", "Subject_S3", "Subject_S4", "Sensitive")

knitr::kable(head(design, 10), caption = "Example Design Matrix (first 10 cells)")
Example Design Matrix (first 10 cells)
Intercept Subject_S2 Subject_S3 Subject_S4 Sensitive
1 0 0 0 1
1 0 0 0 1
1 0 0 0 1
1 0 0 0 1
1 0 0 0 1
1 1 0 0 1
1 1 0 0 1
1 1 0 0 1
1 1 0 0 1
1 1 0 0 1

Model Fitting

The linear model for each compound \(c\):

\[Z_{ic} = \beta_0 + \sum_j \beta_j \text{Subject}_{ij} + \beta_S \text{Sensitivity}_i + \epsilon_{ic}\]

Empirical Bayes

# Simulate the benefit of empirical Bayes
set.seed(42)
n_compounds <- 500
true_effects <- rnorm(n_compounds, mean = 0, sd = 0.5)
n_cells_per_group <- 10

# Simulate with different sample sizes
simulate_effect <- function(true_effect, n) {
  sensitive <- rnorm(n, mean = true_effect, sd = 1)
  resistant <- rnorm(n, mean = 0, sd = 1)
  t.test(sensitive, resistant)$statistic
}

t_stats <- sapply(true_effects, simulate_effect, n = n_cells_per_group)

# Simple posterior (shrinkage toward zero)
shrinkage <- 0.7  # Simplified representation
moderated_t <- shrinkage * t_stats

df_ebayes <- data.frame(
  True = true_effects,
  Ordinary = t_stats,
  Moderated = moderated_t
)

ggplot(df_ebayes) +
  geom_point(aes(x = True, y = Ordinary), alpha = 0.3, color = "red") +
  geom_point(aes(x = True, y = Moderated), alpha = 0.3, color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(
    title = "Empirical Bayes Moderation Effect",
    subtitle = "Blue (moderated) estimates are shrunk toward zero, reducing false positives",
    x = "True Effect Size",
    y = "Estimated t-statistic"
  ) +
  theme_minimal() +
  annotate("text", x = -1.5, y = 3, label = "Ordinary t-test", color = "red") +
  annotate("text", x = -1.5, y = 2.5, label = "Moderated t-test", color = "blue")

4. Multiple Testing Correction

The Problem

With 1,679 compounds tested simultaneously, controlling false positives is critical.

# Demonstrate multiple testing
set.seed(42)
n_tests <- 1679
n_true_effects <- 50

# Generate p-values
true_p <- c(runif(n_true_effects, 0, 0.01),  # True effects
            runif(n_tests - n_true_effects, 0, 1))  # Null

# Apply corrections
p_bonferroni <- p.adjust(true_p, method = "bonferroni")
p_fdr <- p.adjust(true_p, method = "fdr")

results_mult <- data.frame(
  Raw = true_p,
  Bonferroni = p_bonferroni,
  FDR = p_fdr,
  TrueEffect = c(rep("Yes", n_true_effects), 
                  rep("No", n_tests - n_true_effects))
)

# Count discoveries at alpha = 0.05
discoveries <- data.frame(
  Method = c("Raw p-value", "Bonferroni", "FDR (BH)"),
  `Total Discoveries` = c(sum(true_p < 0.05),
                           sum(p_bonferroni < 0.05),
                           sum(p_fdr < 0.05)),
  `True Positives` = c(sum(true_p[1:n_true_effects] < 0.05),
                        sum(p_bonferroni[1:n_true_effects] < 0.05),
                        sum(p_fdr[1:n_true_effects] < 0.05)),
  check.names = FALSE
)
discoveries$`False Positives` <- discoveries$`Total Discoveries` - discoveries$`True Positives`

knitr::kable(discoveries, 
             caption = "Multiple Testing Correction Comparison (α = 0.05)")
Multiple Testing Correction Comparison (α = 0.05)
Method Total Discoveries True Positives False Positives
Raw p-value 143 50 93
Bonferroni 0 0 0
FDR (BH) 0 0 0

FDR Control

scFOCAL uses the Benjamini-Hochberg procedure for FDR control:

  1. Order p-values: \(p_{(1)} \leq p_{(2)} \leq ... \leq p_{(m)}\)
  2. Find largest \(k\) where \(p_{(k)} \leq \frac{k}{m} \cdot \alpha\)
  3. Reject hypotheses \(H_{(1)}, ..., H_{(k)}\)

This controls the expected proportion of false discoveries at level \(\alpha\).

5. Effect Size Interpretation

Log Fold Change in Z-space

# Effect size interpretation guide
effect_sizes <- data.frame(
  logFC = c(-0.5, -0.3, -0.1, 0, 0.1, 0.3, 0.5),
  Interpretation = c("Strong negative", "Moderate negative", "Weak negative",
                     "No effect", "Weak positive", "Moderate positive", "Strong positive"),
  Meaning = c("Much lower connectivity in resistant",
              "Lower connectivity in resistant",
              "Slightly lower in resistant",
              "No difference",
              "Slightly higher in resistant",
              "Higher connectivity in resistant",
              "Much higher connectivity in resistant")
)

knitr::kable(effect_sizes, 
             caption = "Interpretation of logFC values in differential connectivity")
Interpretation of logFC values in differential connectivity
logFC Interpretation Meaning
-0.5 Strong negative Much lower connectivity in resistant
-0.3 Moderate negative Lower connectivity in resistant
-0.1 Weak negative Slightly lower in resistant
0.0 No effect No difference
0.1 Weak positive Slightly higher in resistant
0.3 Moderate positive Higher connectivity in resistant
0.5 Strong positive Much higher connectivity in resistant

Cohen’s d Equivalent

For practical significance assessment:

\[d = \frac{\text{logFC}}{\text{pooled SD}} \approx \frac{\text{logFC}}{0.3}\]

Cohen’s d Interpretation
0.2 Small effect
0.5 Medium effect
0.8 Large effect

6. Power Analysis

Sample Size Considerations

# Power calculation
power_calc <- function(n, effect, alpha = 0.05) {
  se <- sqrt(2/n)
  z_crit <- qnorm(1 - alpha/2)
  power <- pnorm(effect/se - z_crit) + pnorm(-effect/se - z_crit)
  return(power)
}

n_range <- seq(10, 500, by = 10)
effects <- c(0.1, 0.2, 0.3, 0.5)

power_df <- expand.grid(n = n_range, effect = effects)
power_df$power <- mapply(power_calc, power_df$n, power_df$effect)
power_df$effect <- factor(power_df$effect)

ggplot(power_df, aes(x = n, y = power, color = effect)) +
  geom_line(linewidth = 1.2) +
  geom_hline(yintercept = 0.8, linetype = "dashed", color = "gray50") +
  scale_color_brewer(palette = "Set1", name = "Effect Size\n(logFC)") +
  labs(
    title = "Statistical Power vs Sample Size",
    subtitle = "Dashed line indicates 80% power threshold",
    x = "Number of Cells per Group",
    y = "Statistical Power"
  ) +
  theme_minimal() +
  annotate("text", x = 400, y = 0.85, label = "80% power", color = "gray40")

7. Quality Control Metrics

Gene Coverage

scFOCAL reports the overlap between your data and L1000 genes:

# Example gene coverage visualization
coverage_data <- data.frame(
  Category = c("L1000 Genes (978)", "In Your Data", "Used for Analysis"),
  Count = c(978, 850, 850),
  Percentage = c(100, 86.9, 86.9)
)

ggplot(coverage_data, aes(x = Category, y = Count, fill = Category)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = paste0(Count, " (", round(Percentage, 1), "%)")),
            vjust = -0.5) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title = "Gene Coverage Quality Control",
    x = "",
    y = "Number of Genes"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 1100)

Connectivity Distribution QC

Healthy connectivity distributions should be:

  • Centered near zero
  • Roughly symmetric
  • Not strongly bimodal

Summary Statistics Table

Metric Formula Interpretation
Connectivity \(\rho_s(cell, drug)\) Transcriptional similarity
Fisher’s Z \(\frac{1}{2}\ln\frac{1+\rho}{1-\rho}\) Normalized correlation
logFC \(\bar{Z}_{resistant} - \bar{Z}_{sensitive}\) Differential connectivity
adj.P.Val BH-corrected p-value Statistical significance
Reversal Score \(\frac{N_{disc}}{N_{conc}}\) Therapeutic potential

Best Practices

  1. Ensure adequate cell numbers per condition (>50 cells recommended)
  2. Check gene coverage before analysis (>70% L1000 overlap)
  3. Use FDR-adjusted p-values for discovery
  4. Consider effect sizes alongside p-values
  5. Validate computationally identified candidates experimentally

Session Info

sessionInfo()
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.3.2    dplyr_1.2.1    ggplot2_4.0.3  rmarkdown_2.31
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-5       gtable_0.3.6       jsonlite_2.0.0     compiler_4.6.1    
##  [5] tidyselect_1.2.1   gridExtra_2.3.1    jquerylib_0.1.4    splines_4.6.1     
##  [9] scales_1.4.0       yaml_2.3.12        fastmap_1.2.0      lattice_0.22-9    
## [13] R6_2.6.1           labeling_0.4.3     generics_0.1.4     knitr_1.51        
## [17] MASS_7.3-65        tibble_3.3.1       maketools_1.3.2    bslib_0.11.0      
## [21] pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.2.0        cachem_1.1.0      
## [25] xfun_0.59          sass_0.4.10        sys_3.4.3          S7_0.2.2          
## [29] otel_0.2.0         cli_3.6.6          mgcv_1.9-4         withr_3.0.3       
## [33] magrittr_2.0.5     digest_0.6.39      grid_4.6.1         nlme_3.1-169      
## [37] lifecycle_1.0.5    vctrs_0.7.3        evaluate_1.0.5     glue_1.8.1        
## [41] farver_2.1.2       buildtools_1.0.0   purrr_1.2.2        tools_4.6.1       
## [45] pkgconfig_2.0.3    htmltools_0.5.9