Algorithm Principles

Overview

scFOCAL implements a multi-step computational framework that integrates single-cell transcriptomics with pharmacological perturbation data. This vignette details the mathematical principles underlying each algorithm.

1. Drug-Cell Connectivity Score

Concept

The core innovation of scFOCAL is the Drug-Cell Connectivity Score, which quantifies the transcriptional relationship between individual cells and drug perturbation signatures.

Mathematical Formulation

For each cell \(c\) and drug \(d\), we compute the Spearman rank correlation:

\[\rho_{c,d} = 1 - \frac{6 \sum_{i=1}^{n} (R_{x_i} - R_{y_i})^2}{n(n^2 - 1)}\]

Where:

  • \(R_{x_i}\) = rank of gene \(i\) expression in cell \(c\)
  • \(R_{y_i}\) = rank of gene \(i\) in drug signature \(d\)
  • \(n\) = number of overlapping genes

Interpretation

Score Range Interpretation Biological Meaning
ρ < -0.3 Strong discordance High drug sensitivity
-0.3 ≤ ρ < 0 Moderate discordance Moderate sensitivity
0 ≤ ρ < 0.3 Moderate concordance Moderate resistance
ρ ≥ 0.3 Strong concordance High drug resistance

2. Fisher’s Z-Transformation

Purpose

To enable statistical comparison of correlation coefficients across conditions, we apply Fisher’s Z-transformation:

\[Z = \frac{1}{2} \ln\left(\frac{1+\rho}{1-\rho}\right) = \text{arctanh}(\rho)\]

Properties

# Demonstrate Fisher Z transformation
rho <- seq(-0.95, 0.95, by = 0.01)
z <- 0.5 * log((1 + rho) / (1 - rho))

df <- data.frame(rho = rho, z = z)

ggplot(df, aes(x = rho, y = z)) +
  geom_line(color = "#0072B2", linewidth = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Fisher's Z-Transformation",
    subtitle = expression(Z == frac(1,2) * ln * bgroup("(", frac(1+rho, 1-rho), ")")),
    x = expression("Correlation coefficient (" * rho * ")"),
    y = "Fisher's Z"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12)
  )

Key advantages:

  1. Normalizes the sampling distribution - Z-values follow an approximately normal distribution
  2. Stabilizes variance - Variance becomes approximately constant across different correlation values
  3. Enables parametric testing - Allows use of standard statistical tests (t-tests, ANOVA)

Variance Property

The variance of Z is approximately:

\[\text{Var}(Z) \approx \frac{1}{n-3}\]

Where \(n\) is the number of genes used in correlation.

3. Differential Connectivity Analysis

Linear Model Framework

scFOCAL uses limma (Linear Models for Microarray Data) for differential connectivity analysis:

\[Z_{ij} = \mu + \beta_1 \cdot \text{Subject}_j + \beta_2 \cdot \text{Sensitivity}_i + \epsilon_{ij}\]

Where:

  • \(Z_{ij}\) = Z-transformed connectivity for cell \(i\) in subject \(j\)
  • \(\text{Subject}_j\) = blocking factor for subject/replicate
  • \(\text{Sensitivity}_i\) = binary indicator (sensitive vs resistant)
  • \(\epsilon_{ij}\) = residual error

Empirical Bayes Moderation

limma applies empirical Bayes moderation to improve variance estimates:

The moderated t-statistic provides:

  • Increased statistical power for compounds with few cells
  • Robust inference even with small sample sizes
  • Proper multiple testing correction via FDR adjustment

4. Disease Signature Reversal Score

Concept

The reversal score quantifies how effectively a drug can reverse disease-associated gene expression changes.

Algorithm

For disease signature \(D\) and drug signature \(d\):

\[\text{Reversal Score} = \frac{N_{\text{discordant}}}{N_{\text{concordant}}}\]

Where:

  • \(N_{\text{discordant}}\) = genes where \(\text{sign}(D_g) \neq \text{sign}(d_g)\)
  • \(N_{\text{concordant}}\) = genes where \(\text{sign}(D_g) = \text{sign}(d_g)\)

Interpretation

Reversal Score Interpretation
> 2.0 Strong reversal potential
1.0 - 2.0 Moderate reversal
< 1.0 Limited reversal

5. MAST Differential Expression

Model Structure

For disease signature computation, scFOCAL employs MAST (Model-based Analysis of Single-cell Transcriptomics):

\[\text{logit}(P(Z_g > 0)) = X\beta_D + W\alpha_D\]

\[E[Y_g | Z_g = 1] = X\beta_C + W\alpha_C\]

Where:

  • \(Z_g\) = indicator for gene detection
  • \(Y_g\) = continuous expression level
  • \(X\) = design matrix (cell type, condition)
  • \(W\) = cellular detection rate (technical covariate)

Advantages for Single-Cell Data

  1. Handles zero-inflation - Explicitly models dropout events
  2. Controls for technical variation - Includes cellular detection rate
  3. Subject-level effects - Can incorporate random effects for replicates

6. Computational Complexity

Time Complexity Analysis

Operation Complexity Typical Runtime
Drug-Cell Connectivity O(n × m × g) ~20 min for 10K cells
MAST Differential Expression O(n × g) ~5 min per comparison
Reversal Scoring O(d × g) < 1 min
Differential Connectivity O(n × d) ~2 min

Where: n = cells, m = compounds (1679), g = genes (~1000), d = drug signatures

Memory Requirements

  • Minimum: 8 GB RAM
  • Recommended: 16 GB RAM for datasets > 50K cells
  • Large datasets: Consider chunked processing

Summary

scFOCAL’s algorithmic framework provides:

  1. Robust statistical inference through Fisher Z-transformation and empirical Bayes
  2. Biological interpretability via connectivity and reversal scores
  3. Scalability to large single-cell datasets
  4. Integration of orthogonal pharmacological and transcriptomic data

References

  1. Subramanian A, et al. A Next Generation Connectivity Map. Cell (2017)
  2. Finak G, et al. MAST: a flexible statistical framework. Genome Biology (2015)
  3. Ritchie ME, et al. limma powers differential expression. Nucleic Acids Res (2015)
  4. Fisher RA. On the probable error of a coefficient of correlation. Metron (1921)

Session Info

sessionInfo()
## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.2.1    ggplot2_4.0.3  rmarkdown_2.31
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
##  [5] xfun_0.59          otel_0.2.0         generics_0.1.4     S7_0.2.2          
##  [9] jsonlite_2.0.0     labeling_0.4.3     glue_1.8.1         buildtools_1.0.0  
## [13] htmltools_0.5.9    maketools_1.3.2    sys_3.4.3          sass_0.4.10       
## [17] scales_1.4.0       grid_4.6.1         tibble_3.3.1       evaluate_1.0.5    
## [21] jquerylib_0.1.4    fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5   
## [25] compiler_4.6.1     RColorBrewer_1.1-3 pkgconfig_2.0.3    farver_2.1.2      
## [29] digest_0.6.39      R6_2.6.1           tidyselect_1.2.1   pillar_1.11.1     
## [33] magrittr_2.0.5     bslib_0.11.0       withr_3.0.3        tools_4.6.1       
## [37] gtable_0.3.6       cachem_1.1.0