Algorithm Principles

Overview

scFOCAL implements a multi-step computational framework that integrates single-cell transcriptomics with pharmacological perturbation data. This vignette details the mathematical principles underlying each algorithm.

1. Drug-Cell Connectivity Score

Concept

The core innovation of scFOCAL is the Drug-Cell Connectivity Score, which quantifies the transcriptional relationship between individual cells and drug perturbation signatures.

Mathematical Formulation

For each cell \(c\) and drug \(d\), we compute the Spearman rank correlation:

\[\rho_{c,d} = 1 - \frac{6 \sum_{i=1}^{n} (R_{x_i} - R_{y_i})^2}{n(n^2 - 1)}\]

Where:

\(R_{x_i}\) = rank of gene \(i\) expression in cell \(c\)
\(R_{y_i}\) = rank of gene \(i\) in drug signature \(d\)
\(n\) = number of overlapping genes

Interpretation

Score Range	Interpretation	Biological Meaning
ρ < -0.3	Strong discordance	High drug sensitivity
-0.3 ≤ ρ < 0	Moderate discordance	Moderate sensitivity
0 ≤ ρ < 0.3	Moderate concordance	Moderate resistance
ρ ≥ 0.3	Strong concordance	High drug resistance

2. Fisher’s Z-Transformation

Purpose

To enable statistical comparison of correlation coefficients across conditions, we apply Fisher’s Z-transformation:

\[Z = \frac{1}{2} \ln\left(\frac{1+\rho}{1-\rho}\right) = \text{arctanh}(\rho)\]

Properties

# Demonstrate Fisher Z transformation
rho <- seq(-0.95, 0.95, by = 0.01)
z <- 0.5 * log((1 + rho) / (1 - rho))

df <- data.frame(rho = rho, z = z)

ggplot(df, aes(x = rho, y = z)) +
  geom_line(color = "#0072B2", linewidth = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Fisher's Z-Transformation",
    subtitle = expression(Z == frac(1,2) * ln * bgroup("(", frac(1+rho, 1-rho), ")")),
    x = expression("Correlation coefficient (" * rho * ")"),
    y = "Fisher's Z"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12)
  )

Key advantages:

Normalizes the sampling distribution - Z-values follow an approximately normal distribution
Stabilizes variance - Variance becomes approximately constant across different correlation values
Enables parametric testing - Allows use of standard statistical tests (t-tests, ANOVA)

Variance Property

The variance of Z is approximately:

\[\text{Var}(Z) \approx \frac{1}{n-3}\]

Where \(n\) is the number of genes used in correlation.

3. Differential Connectivity Analysis

Linear Model Framework

scFOCAL uses limma (Linear Models for Microarray Data) for differential connectivity analysis:

\[Z_{ij} = \mu + \beta_1 \cdot \text{Subject}_j + \beta_2 \cdot \text{Sensitivity}_i + \epsilon_{ij}\]

Where:

\(Z_{ij}\) = Z-transformed connectivity for cell \(i\) in subject \(j\)
\(\text{Subject}_j\) = blocking factor for subject/replicate
\(\text{Sensitivity}_i\) = binary indicator (sensitive vs resistant)
\(\epsilon_{ij}\) = residual error

Empirical Bayes Moderation

limma applies empirical Bayes moderation to improve variance estimates:

The moderated t-statistic provides:

Increased statistical power for compounds with few cells
Robust inference even with small sample sizes
Proper multiple testing correction via FDR adjustment

4. Disease Signature Reversal Score

Concept

The reversal score quantifies how effectively a drug can reverse disease-associated gene expression changes.

Algorithm

For disease signature \(D\) and drug signature \(d\):

\[\text{Reversal Score} = \frac{N_{\text{discordant}}}{N_{\text{concordant}}}\]

Where:

\(N_{\text{discordant}}\) = genes where \(\text{sign}(D_g) \neq \text{sign}(d_g)\)
\(N_{\text{concordant}}\) = genes where \(\text{sign}(D_g) = \text{sign}(d_g)\)

Interpretation

Reversal Score	Interpretation
> 2.0	Strong reversal potential
1.0 - 2.0	Moderate reversal
< 1.0	Limited reversal

5. MAST Differential Expression

Model Structure

For disease signature computation, scFOCAL employs MAST (Model-based Analysis of Single-cell Transcriptomics):

\[\text{logit}(P(Z_g > 0)) = X\beta_D + W\alpha_D\]

\[E[Y_g | Z_g = 1] = X\beta_C + W\alpha_C\]

Where:

\(Z_g\) = indicator for gene detection
\(Y_g\) = continuous expression level
\(X\) = design matrix (cell type, condition)
\(W\) = cellular detection rate (technical covariate)

Advantages for Single-Cell Data

Handles zero-inflation - Explicitly models dropout events
Controls for technical variation - Includes cellular detection rate
Subject-level effects - Can incorporate random effects for replicates

6. Computational Complexity

Time Complexity Analysis

Operation	Complexity	Typical Runtime
Drug-Cell Connectivity	O(n × m × g)	~20 min for 10K cells
MAST Differential Expression	O(n × g)	~5 min per comparison
Reversal Scoring	O(d × g)	< 1 min
Differential Connectivity	O(n × d)	~2 min

Where: n = cells, m = compounds (1679), g = genes (~1000), d = drug signatures

Memory Requirements

Minimum: 8 GB RAM
Recommended: 16 GB RAM for datasets > 50K cells
Large datasets: Consider chunked processing

Summary

scFOCAL’s algorithmic framework provides:

Robust statistical inference through Fisher Z-transformation and empirical Bayes
Biological interpretability via connectivity and reversal scores
Scalability to large single-cell datasets
Integration of orthogonal pharmacological and transcriptomic data

References

Subramanian A, et al. A Next Generation Connectivity Map. Cell (2017)
Finak G, et al. MAST: a flexible statistical framework. Genome Biology (2015)
Ritchie ME, et al. limma powers differential expression. Nucleic Acids Res (2015)
Fisher RA. On the probable error of a coefficient of correlation. Metron (1921)

Session Info

sessionInfo()

## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.2.1    ggplot2_4.0.3  rmarkdown_2.31
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
##  [5] xfun_0.59          otel_0.2.0         generics_0.1.4     S7_0.2.2          
##  [9] jsonlite_2.0.0     labeling_0.4.3     glue_1.8.1         buildtools_1.0.0  
## [13] htmltools_0.5.9    maketools_1.3.2    sys_3.4.3          sass_0.4.10       
## [17] scales_1.4.0       grid_4.6.1         tibble_3.3.1       evaluate_1.0.5    
## [21] jquerylib_0.1.4    fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5   
## [25] compiler_4.6.1     RColorBrewer_1.1-3 pkgconfig_2.0.3    farver_2.1.2      
## [29] digest_0.6.39      R6_2.6.1           tidyselect_1.2.1   pillar_1.11.1     
## [33] magrittr_2.0.5     bslib_0.11.0       withr_3.0.3        tools_4.6.1       
## [37] gtable_0.3.6       cachem_1.1.0