This vignette provides a comprehensive overview of the statistical methods implemented in scFOCAL. Understanding these methods is essential for proper interpretation of results and experimental design.
scFOCAL uses Spearman’s rank correlation coefficient to measure the monotonic relationship between gene expression and drug signatures. Unlike Pearson correlation, Spearman is:
\[\rho_s = 1 - \frac{6 \sum d_i^2}{n(n^2-1)}\]
Where \(d_i = R(X_i) - R(Y_i)\) is the difference between ranks.
set.seed(42)
# Generate example data
n <- 50
x <- rnorm(n)
y_linear <- x + rnorm(n, sd = 0.5)
y_monotonic <- sign(x) * abs(x)^0.5 + rnorm(n, sd = 0.3)
y_outlier <- y_linear
y_outlier[c(1, 2)] <- c(10, -10) # Add outliers
# Calculate correlations
results <- data.frame(
Scenario = c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"),
Pearson = c(cor(x, y_linear), cor(x, y_monotonic), cor(x, y_outlier)),
Spearman = c(cor(x, y_linear, method = "spearman"),
cor(x, y_monotonic, method = "spearman"),
cor(x, y_outlier, method = "spearman"))
)
# Visualization
df_plot <- data.frame(
x = rep(x, 3),
y = c(y_linear, y_monotonic, y_outlier),
Scenario = factor(rep(c("Linear + Noise", "Monotonic (Non-linear)", "With Outliers"),
each = n))
)
ggplot(df_plot, aes(x = x, y = y)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") +
facet_wrap(~Scenario, scales = "free_y") +
labs(
title = "Pearson vs Spearman Correlation",
subtitle = "Spearman is more robust to outliers and non-linear relationships",
x = "Variable X",
y = "Variable Y"
) +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))| Scenario | Pearson | Spearman |
|---|---|---|
| Linear + Noise | 0.918 | 0.918 |
| Monotonic (Non-linear) | 0.924 | 0.940 |
| With Outliers | 0.581 | 0.908 |
The sampling distribution of correlation coefficients is:
Fisher’s Z-transformation addresses these issues.
# Simulate sampling distributions
set.seed(123)
n_samples <- 1000
sample_size <- 30
simulate_cors <- function(true_rho, n, n_samples) {
cors <- sapply(1:n_samples, function(i) {
sigma <- matrix(c(1, true_rho, true_rho, 1), 2, 2)
data <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = sigma)
cor(data[,1], data[,2])
})
return(cors)
}
# Different true correlations
rhos <- c(0, 0.5, 0.8)
results_list <- lapply(rhos, function(r) {
cors <- simulate_cors(r, sample_size, n_samples)
z <- 0.5 * log((1 + cors) / (1 - cors))
data.frame(
Correlation = cors,
FisherZ = z,
TrueRho = paste("ρ =", r)
)
})
df_all <- do.call(rbind, results_list)
# Plot correlation distributions
p1 <- ggplot(df_all, aes(x = Correlation, fill = TrueRho)) +
geom_density(alpha = 0.5) +
labs(title = "A) Raw Correlation Distributions",
x = "Sample Correlation", y = "Density") +
theme_minimal() +
theme(legend.position = "top")
# Plot Fisher Z distributions
p2 <- ggplot(df_all, aes(x = FisherZ, fill = TrueRho)) +
geom_density(alpha = 0.5) +
labs(title = "B) Fisher Z Distributions",
x = "Fisher's Z", y = "Density") +
theme_minimal() +
theme(legend.position = "top")
gridExtra::grid.arrange(p1, p2, ncol = 2)Key observation: After transformation, distributions are:
The standard error of Fisher’s Z is approximately:
\[SE(Z) = \frac{1}{\sqrt{n-3}}\]
This enables construction of confidence intervals:
\[Z \pm 1.96 \times SE(Z)\]
For differential connectivity analysis, scFOCAL constructs a design matrix incorporating:
# Example design matrix
n_cells <- 20
cell_data <- data.frame(
Cell = paste0("Cell_", 1:n_cells),
Subject = rep(c("S1", "S2", "S3", "S4"), each = 5),
Sensitivity = c(rep("Sensitive", 10), rep("Resistant", 10))
)
# Create design matrix
design <- model.matrix(~ Subject + Sensitivity, data = cell_data)
colnames(design) <- c("Intercept", "Subject_S2", "Subject_S3", "Subject_S4", "Sensitive")
knitr::kable(head(design, 10), caption = "Example Design Matrix (first 10 cells)")| Intercept | Subject_S2 | Subject_S3 | Subject_S4 | Sensitive |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
The linear model for each compound \(c\):
\[Z_{ic} = \beta_0 + \sum_j \beta_j \text{Subject}_{ij} + \beta_S \text{Sensitivity}_i + \epsilon_{ic}\]
# Simulate the benefit of empirical Bayes
set.seed(42)
n_compounds <- 500
true_effects <- rnorm(n_compounds, mean = 0, sd = 0.5)
n_cells_per_group <- 10
# Simulate with different sample sizes
simulate_effect <- function(true_effect, n) {
sensitive <- rnorm(n, mean = true_effect, sd = 1)
resistant <- rnorm(n, mean = 0, sd = 1)
t.test(sensitive, resistant)$statistic
}
t_stats <- sapply(true_effects, simulate_effect, n = n_cells_per_group)
# Simple posterior (shrinkage toward zero)
shrinkage <- 0.7 # Simplified representation
moderated_t <- shrinkage * t_stats
df_ebayes <- data.frame(
True = true_effects,
Ordinary = t_stats,
Moderated = moderated_t
)
ggplot(df_ebayes) +
geom_point(aes(x = True, y = Ordinary), alpha = 0.3, color = "red") +
geom_point(aes(x = True, y = Moderated), alpha = 0.3, color = "blue") +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
labs(
title = "Empirical Bayes Moderation Effect",
subtitle = "Blue (moderated) estimates are shrunk toward zero, reducing false positives",
x = "True Effect Size",
y = "Estimated t-statistic"
) +
theme_minimal() +
annotate("text", x = -1.5, y = 3, label = "Ordinary t-test", color = "red") +
annotate("text", x = -1.5, y = 2.5, label = "Moderated t-test", color = "blue")With 1,679 compounds tested simultaneously, controlling false positives is critical.
# Demonstrate multiple testing
set.seed(42)
n_tests <- 1679
n_true_effects <- 50
# Generate p-values
true_p <- c(runif(n_true_effects, 0, 0.01), # True effects
runif(n_tests - n_true_effects, 0, 1)) # Null
# Apply corrections
p_bonferroni <- p.adjust(true_p, method = "bonferroni")
p_fdr <- p.adjust(true_p, method = "fdr")
results_mult <- data.frame(
Raw = true_p,
Bonferroni = p_bonferroni,
FDR = p_fdr,
TrueEffect = c(rep("Yes", n_true_effects),
rep("No", n_tests - n_true_effects))
)
# Count discoveries at alpha = 0.05
discoveries <- data.frame(
Method = c("Raw p-value", "Bonferroni", "FDR (BH)"),
`Total Discoveries` = c(sum(true_p < 0.05),
sum(p_bonferroni < 0.05),
sum(p_fdr < 0.05)),
`True Positives` = c(sum(true_p[1:n_true_effects] < 0.05),
sum(p_bonferroni[1:n_true_effects] < 0.05),
sum(p_fdr[1:n_true_effects] < 0.05)),
check.names = FALSE
)
discoveries$`False Positives` <- discoveries$`Total Discoveries` - discoveries$`True Positives`
knitr::kable(discoveries,
caption = "Multiple Testing Correction Comparison (α = 0.05)")| Method | Total Discoveries | True Positives | False Positives |
|---|---|---|---|
| Raw p-value | 143 | 50 | 93 |
| Bonferroni | 0 | 0 | 0 |
| FDR (BH) | 0 | 0 | 0 |
scFOCAL uses the Benjamini-Hochberg procedure for FDR control:
This controls the expected proportion of false discoveries at level \(\alpha\).
# Effect size interpretation guide
effect_sizes <- data.frame(
logFC = c(-0.5, -0.3, -0.1, 0, 0.1, 0.3, 0.5),
Interpretation = c("Strong negative", "Moderate negative", "Weak negative",
"No effect", "Weak positive", "Moderate positive", "Strong positive"),
Meaning = c("Much lower connectivity in resistant",
"Lower connectivity in resistant",
"Slightly lower in resistant",
"No difference",
"Slightly higher in resistant",
"Higher connectivity in resistant",
"Much higher connectivity in resistant")
)
knitr::kable(effect_sizes,
caption = "Interpretation of logFC values in differential connectivity")| logFC | Interpretation | Meaning |
|---|---|---|
| -0.5 | Strong negative | Much lower connectivity in resistant |
| -0.3 | Moderate negative | Lower connectivity in resistant |
| -0.1 | Weak negative | Slightly lower in resistant |
| 0.0 | No effect | No difference |
| 0.1 | Weak positive | Slightly higher in resistant |
| 0.3 | Moderate positive | Higher connectivity in resistant |
| 0.5 | Strong positive | Much higher connectivity in resistant |
For practical significance assessment:
\[d = \frac{\text{logFC}}{\text{pooled SD}} \approx \frac{\text{logFC}}{0.3}\]
| Cohen’s d | Interpretation |
|---|---|
| 0.2 | Small effect |
| 0.5 | Medium effect |
| 0.8 | Large effect |
# Power calculation
power_calc <- function(n, effect, alpha = 0.05) {
se <- sqrt(2/n)
z_crit <- qnorm(1 - alpha/2)
power <- pnorm(effect/se - z_crit) + pnorm(-effect/se - z_crit)
return(power)
}
n_range <- seq(10, 500, by = 10)
effects <- c(0.1, 0.2, 0.3, 0.5)
power_df <- expand.grid(n = n_range, effect = effects)
power_df$power <- mapply(power_calc, power_df$n, power_df$effect)
power_df$effect <- factor(power_df$effect)
ggplot(power_df, aes(x = n, y = power, color = effect)) +
geom_line(linewidth = 1.2) +
geom_hline(yintercept = 0.8, linetype = "dashed", color = "gray50") +
scale_color_brewer(palette = "Set1", name = "Effect Size\n(logFC)") +
labs(
title = "Statistical Power vs Sample Size",
subtitle = "Dashed line indicates 80% power threshold",
x = "Number of Cells per Group",
y = "Statistical Power"
) +
theme_minimal() +
annotate("text", x = 400, y = 0.85, label = "80% power", color = "gray40")| Effect Size (logFC) | Min Cells per Group | Recommended |
|---|---|---|
| 0.5 (large) | 30 | 50 |
| 0.3 (medium) | 80 | 100 |
| 0.2 (small) | 200 | 250 |
| 0.1 (very small) | 800 | 1000 |
scFOCAL reports the overlap between your data and L1000 genes:
# Example gene coverage visualization
coverage_data <- data.frame(
Category = c("L1000 Genes (978)", "In Your Data", "Used for Analysis"),
Count = c(978, 850, 850),
Percentage = c(100, 86.9, 86.9)
)
ggplot(coverage_data, aes(x = Category, y = Count, fill = Category)) +
geom_col(width = 0.6) +
geom_text(aes(label = paste0(Count, " (", round(Percentage, 1), "%)")),
vjust = -0.5) +
scale_fill_brewer(palette = "Blues") +
labs(
title = "Gene Coverage Quality Control",
x = "",
y = "Number of Genes"
) +
theme_minimal() +
theme(legend.position = "none") +
ylim(0, 1100)Healthy connectivity distributions should be:
| Metric | Formula | Interpretation |
|---|---|---|
| Connectivity | \(\rho_s(cell, drug)\) | Transcriptional similarity |
| Fisher’s Z | \(\frac{1}{2}\ln\frac{1+\rho}{1-\rho}\) | Normalized correlation |
| logFC | \(\bar{Z}_{resistant} - \bar{Z}_{sensitive}\) | Differential connectivity |
| adj.P.Val | BH-corrected p-value | Statistical significance |
| Reversal Score | \(\frac{N_{disc}}{N_{conc}}\) | Therapeutic potential |
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.3.2 dplyr_1.2.1 ggplot2_4.0.3 rmarkdown_2.31
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-5 gtable_0.3.6 jsonlite_2.0.0 compiler_4.6.1
## [5] tidyselect_1.2.1 gridExtra_2.3.1 jquerylib_0.1.4 splines_4.6.1
## [9] scales_1.4.0 yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9
## [13] R6_2.6.1 labeling_0.4.3 generics_0.1.4 knitr_1.51
## [17] MASS_7.3-65 tibble_3.3.1 maketools_1.3.2 bslib_0.11.0
## [21] pillar_1.11.1 RColorBrewer_1.1-3 rlang_1.2.0 cachem_1.1.0
## [25] xfun_0.59 sass_0.4.10 sys_3.4.3 S7_0.2.2
## [29] otel_0.2.0 cli_3.6.6 mgcv_1.9-4 withr_3.0.3
## [33] magrittr_2.0.5 digest_0.6.39 grid_4.6.1 nlme_3.1-169
## [37] lifecycle_1.0.5 vctrs_0.7.3 evaluate_1.0.5 glue_1.8.1
## [41] farver_2.1.2 buildtools_1.0.0 purrr_1.2.2 tools_4.6.1
## [45] pkgconfig_2.0.3 htmltools_0.5.9