--- title: "Algorithm Principles" author: "Zaoqu Liu, Robert K. Suter, Nagi G. Ayad" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm Principles} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 8, fig.height = 6, eval = TRUE ) library(ggplot2) library(dplyr) ``` ## Overview scFOCAL implements a multi-step computational framework that integrates single-cell transcriptomics with pharmacological perturbation data. This vignette details the mathematical principles underlying each algorithm. ## 1. Drug-Cell Connectivity Score ### Concept The core innovation of scFOCAL is the **Drug-Cell Connectivity Score**, which quantifies the transcriptional relationship between individual cells and drug perturbation signatures. ### Mathematical Formulation For each cell $c$ and drug $d$, we compute the Spearman rank correlation: $$\rho_{c,d} = 1 - \frac{6 \sum_{i=1}^{n} (R_{x_i} - R_{y_i})^2}{n(n^2 - 1)}$$ Where: - $R_{x_i}$ = rank of gene $i$ expression in cell $c$ - $R_{y_i}$ = rank of gene $i$ in drug signature $d$ - $n$ = number of overlapping genes ### Interpretation ```{r connectivity-concept, echo=FALSE, fig.height=4, fig.width=7} # Visualization of connectivity concept set.seed(42) df <- data.frame( Connectivity = seq(-1, 1, length.out = 100), Density = dnorm(seq(-1, 1, length.out = 100), mean = 0, sd = 0.3) ) ggplot(df, aes(x = Connectivity, y = Density)) + geom_area(fill = "steelblue", alpha = 0.6) + geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") + annotate("text", x = -0.6, y = 1.2, label = "Sensitive\n(Discordant)", color = "#2166AC", fontface = "bold", size = 4) + annotate("text", x = 0.6, y = 1.2, label = "Resistant\n(Concordant)", color = "#B2182B", fontface = "bold", size = 4) + annotate("segment", x = -0.8, xend = -0.3, y = 0.8, yend = 0.8, arrow = arrow(length = unit(0.2, "cm")), color = "#2166AC") + annotate("segment", x = 0.3, xend = 0.8, y = 0.8, yend = 0.8, arrow = arrow(length = unit(0.2, "cm")), color = "#B2182B") + labs( title = "Drug-Cell Connectivity Score Distribution", subtitle = "Negative values indicate drug sensitivity, positive values indicate resistance", x = "Connectivity Score (ρ)", y = "Density" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), plot.subtitle = element_text(size = 10, color = "gray40") ) ``` | Score Range | Interpretation | Biological Meaning | |-------------|----------------|-------------------| | ρ < -0.3 | Strong discordance | High drug sensitivity | | -0.3 ≤ ρ < 0 | Moderate discordance | Moderate sensitivity | | 0 ≤ ρ < 0.3 | Moderate concordance | Moderate resistance | | ρ ≥ 0.3 | Strong concordance | High drug resistance | ## 2. Fisher's Z-Transformation ### Purpose To enable statistical comparison of correlation coefficients across conditions, we apply Fisher's Z-transformation: $$Z = \frac{1}{2} \ln\left(\frac{1+\rho}{1-\rho}\right) = \text{arctanh}(\rho)$$ ### Properties ```{r fisher-z-demo, fig.height=4, fig.width=7} # Demonstrate Fisher Z transformation rho <- seq(-0.95, 0.95, by = 0.01) z <- 0.5 * log((1 + rho) / (1 - rho)) df <- data.frame(rho = rho, z = z) ggplot(df, aes(x = rho, y = z)) + geom_line(color = "#0072B2", linewidth = 1.2) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") + geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") + labs( title = "Fisher's Z-Transformation", subtitle = expression(Z == frac(1,2) * ln * bgroup("(", frac(1+rho, 1-rho), ")")), x = expression("Correlation coefficient (" * rho * ")"), y = "Fisher's Z" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), plot.subtitle = element_text(size = 12) ) ``` **Key advantages:** 1. **Normalizes the sampling distribution** - Z-values follow an approximately normal distribution 2. **Stabilizes variance** - Variance becomes approximately constant across different correlation values 3. **Enables parametric testing** - Allows use of standard statistical tests (t-tests, ANOVA) ### Variance Property The variance of Z is approximately: $$\text{Var}(Z) \approx \frac{1}{n-3}$$ Where $n$ is the number of genes used in correlation. ## 3. Differential Connectivity Analysis ### Linear Model Framework scFOCAL uses limma (Linear Models for Microarray Data) for differential connectivity analysis: $$Z_{ij} = \mu + \beta_1 \cdot \text{Subject}_j + \beta_2 \cdot \text{Sensitivity}_i + \epsilon_{ij}$$ Where: - $Z_{ij}$ = Z-transformed connectivity for cell $i$ in subject $j$ - $\text{Subject}_j$ = blocking factor for subject/replicate - $\text{Sensitivity}_i$ = binary indicator (sensitive vs resistant) - $\epsilon_{ij}$ = residual error ### Empirical Bayes Moderation limma applies empirical Bayes moderation to improve variance estimates: ```{r ebayes-concept, echo=FALSE, fig.height=4, fig.width=7} # Conceptual visualization of empirical Bayes set.seed(123) n_genes <- 1000 true_var <- rgamma(n_genes, shape = 4, rate = 2) observed_var <- rchisq(n_genes, df = 5) * true_var / 5 df <- data.frame( Observed = observed_var, Moderated = (true_var + mean(observed_var)) / 2 ) ggplot() + geom_density(data = df, aes(x = Observed, fill = "Observed"), alpha = 0.5) + geom_density(data = df, aes(x = Moderated, fill = "Moderated"), alpha = 0.5) + scale_fill_manual(values = c("Observed" = "#E69F00", "Moderated" = "#56B4E9"), name = "Variance") + labs( title = "Empirical Bayes Variance Moderation", subtitle = "Borrowing information across compounds stabilizes variance estimates", x = "Variance Estimate", y = "Density" ) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 14), legend.position = "top" ) ``` The moderated t-statistic provides: - **Increased statistical power** for compounds with few cells - **Robust inference** even with small sample sizes - **Proper multiple testing correction** via FDR adjustment ## 4. Disease Signature Reversal Score ### Concept The reversal score quantifies how effectively a drug can reverse disease-associated gene expression changes. ### Algorithm For disease signature $D$ and drug signature $d$: $$\text{Reversal Score} = \frac{N_{\text{discordant}}}{N_{\text{concordant}}}$$ Where: - $N_{\text{discordant}}$ = genes where $\text{sign}(D_g) \neq \text{sign}(d_g)$ - $N_{\text{concordant}}$ = genes where $\text{sign}(D_g) = \text{sign}(d_g)$ ```{r reversal-concept, echo=FALSE, fig.height=4, fig.width=7} # Illustration of reversal scoring df <- data.frame( Gene = paste0("Gene", 1:10), Disease = c(2.1, -1.5, 1.8, -0.9, 2.5, -1.2, 0.7, -2.0, 1.3, -0.6), Drug = c(-1.8, 1.2, -1.5, 0.8, -2.2, 1.0, -0.5, 1.7, -1.1, 0.4) ) df$Interaction <- ifelse(df$Disease * df$Drug < 0, "Discordant\n(Reversal)", "Concordant") ggplot(df, aes(x = Disease, y = Drug, color = Interaction)) + geom_point(size = 4) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") + geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") + geom_text(aes(label = Gene), hjust = -0.2, vjust = -0.5, size = 3) + scale_color_manual(values = c("Discordant\n(Reversal)" = "#2166AC", "Concordant" = "#B2182B")) + labs( title = "Disease Signature Reversal Concept", subtitle = "Discordant genes indicate therapeutic potential", x = "Disease Signature (log2FC)", y = "Drug Signature (log2FC)" ) + theme_minimal() + xlim(-3, 3) + ylim(-3, 3) + theme( plot.title = element_text(face = "bold", size = 14), legend.position = "right" ) ``` ### Interpretation | Reversal Score | Interpretation | |----------------|----------------| | > 2.0 | Strong reversal potential | | 1.0 - 2.0 | Moderate reversal | | < 1.0 | Limited reversal | ## 5. MAST Differential Expression ### Model Structure For disease signature computation, scFOCAL employs MAST (Model-based Analysis of Single-cell Transcriptomics): $$\text{logit}(P(Z_g > 0)) = X\beta_D + W\alpha_D$$ $$E[Y_g | Z_g = 1] = X\beta_C + W\alpha_C$$ Where: - $Z_g$ = indicator for gene detection - $Y_g$ = continuous expression level - $X$ = design matrix (cell type, condition) - $W$ = cellular detection rate (technical covariate) ### Advantages for Single-Cell Data 1. **Handles zero-inflation** - Explicitly models dropout events 2. **Controls for technical variation** - Includes cellular detection rate 3. **Subject-level effects** - Can incorporate random effects for replicates ## 6. Computational Complexity ### Time Complexity Analysis | Operation | Complexity | Typical Runtime | |-----------|------------|-----------------| | Drug-Cell Connectivity | O(n × m × g) | ~20 min for 10K cells | | MAST Differential Expression | O(n × g) | ~5 min per comparison | | Reversal Scoring | O(d × g) | < 1 min | | Differential Connectivity | O(n × d) | ~2 min | Where: n = cells, m = compounds (1679), g = genes (~1000), d = drug signatures ### Memory Requirements - **Minimum**: 8 GB RAM - **Recommended**: 16 GB RAM for datasets > 50K cells - **Large datasets**: Consider chunked processing ## Summary ```{r summary-flowchart, echo=FALSE, fig.height=5, fig.width=8} # Create a conceptual flowchart library(ggplot2) boxes <- data.frame( x = c(1, 2, 3, 4, 5), y = rep(2, 5), label = c("scRNA-seq\nData", "Disease\nSignatures", "Drug-Cell\nConnectivity", "Z-Transform\n& limma", "Combination\nDiscovery"), color = c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#CC79A7") ) arrows <- data.frame( x = c(1.3, 2.3, 3.3, 4.3), xend = c(1.7, 2.7, 3.7, 4.7), y = rep(2, 4), yend = rep(2, 4) ) ggplot() + geom_rect(data = boxes, aes(xmin = x - 0.4, xmax = x + 0.4, ymin = y - 0.5, ymax = y + 0.5, fill = color), color = "black", alpha = 0.8) + geom_text(data = boxes, aes(x = x, y = y, label = label), size = 3.5, fontface = "bold") + geom_segment(data = arrows, aes(x = x, xend = xend, y = y, yend = yend), arrow = arrow(length = unit(0.3, "cm")), linewidth = 1) + scale_fill_identity() + theme_void() + labs(title = "scFOCAL Algorithm Pipeline") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 16)) ``` scFOCAL's algorithmic framework provides: 1. **Robust statistical inference** through Fisher Z-transformation and empirical Bayes 2. **Biological interpretability** via connectivity and reversal scores 3. **Scalability** to large single-cell datasets 4. **Integration** of orthogonal pharmacological and transcriptomic data ## References 1. Subramanian A, *et al*. A Next Generation Connectivity Map. *Cell* (2017) 2. Finak G, *et al*. MAST: a flexible statistical framework. *Genome Biology* (2015) 3. Ritchie ME, *et al*. limma powers differential expression. *Nucleic Acids Res* (2015) 4. Fisher RA. On the probable error of a coefficient of correlation. *Metron* (1921) ## Session Info ```{r session-info} sessionInfo() ```