--- title: "Algorithm Principles and Mathematical Framework" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm Principles and Mathematical Framework} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, warning = FALSE, message = FALSE ) ``` ## Theoretical Foundation Connectome is built on the premise that cell-cell communication can be inferred from the co-expression patterns of ligand-receptor pairs across distinct cell populations. This document describes the mathematical framework underlying the analysis. ## 1. Ligand-Receptor Database ### FANTOM5 Database Connectome utilizes the FANTOM5 (Functional Annotation of the Mammalian Genome 5) ligand-receptor database, which provides curated pairs of interacting molecules: $$\mathcal{P} = \{(L_k, R_k, M_k)\}_{k=1}^{N}$$ Where: - $L_k$: Ligand gene symbol - $R_k$: Receptor gene symbol - $M_k$: Signaling mode/family classification - $N$: Total number of pairs (~2,557 for human) ### Evidence Levels Pairs are classified by evidence strength: | Level | Description | |-------|-------------| | Literature supported | Experimentally validated interactions | | Putative | Computationally predicted interactions | ## 2. Edge Weight Computation ### Expression Metrics For each cell population $i$ and gene $g$: **Normalized Expression:** $$\bar{E}_{i,g} = \frac{1}{|C_i|} \sum_{c \in C_i} E_{c,g}$$ **Scaled Expression (Z-score):** $$Z_{i,g} = \frac{\bar{E}_{i,g} - \mu_g}{\sigma_g}$$ **Percent Expression:** $$P_{i,g} = \frac{|\{c \in C_i : E_{c,g} > 0\}|}{|C_i|}$$ ### Edge Weight Functions Given source population $i$, target population $j$, and ligand-receptor pair $(L, R)$: **Product (Default):** $$w_{ij}^{LR} = E_{i,L} \times E_{j,R}$$ **Sum:** $$w_{ij}^{LR} = E_{i,L} + E_{j,R}$$ **Mean:** $$w_{ij}^{LR} = \frac{E_{i,L} + E_{j,R}}{2}$$ The product formulation captures the multiplicative nature of ligand-receptor binding kinetics. ## 3. Statistical Testing ### Wilcoxon Rank-Sum Test For each gene $g$ in cluster $i$, we test whether expression differs from background: $$H_0: \text{median}(E_{C_i,g}) = \text{median}(E_{C_{\backslash i},g})$$ The test statistic: $$W = \sum_{c \in C_i} R_c$$ Where $R_c$ is the rank of cell $c$ in the combined sample. ### Multiple Testing Correction Adjusted p-values using Bonferroni correction: $$p_{adj} = \min(p \times m, 1)$$ Where $m$ is the number of tests performed. ## 4. Diagnostic Odds Ratio (DOR) The DOR quantifies gene specificity for a cell cluster using a 2×2 contingency table: | | Expressing | Non-expressing | |--|-----------|----------------| | In cluster | TP | FN | | Out of cluster | FP | TN | ### Standard DOR $$\text{DOR} = \frac{TP \times TN}{FP \times FN}$$ ### Haldane-Anscombe Correction To handle zero cells, we apply the Haldane-Anscombe correction with pseudocount $\epsilon = 0.5$: $$\text{DOR}_{corrected} = \frac{(TP + \epsilon)(TN + \epsilon)}{(FP + \epsilon)(FN + \epsilon)}$$ Log-transformed for symmetry: $$\log(\text{DOR}) = \log(TP + \epsilon) + \log(TN + \epsilon) - \log(FP + \epsilon) - \log(FN + \epsilon)$$ **Interpretation:** - $\log(\text{DOR}) > 0$: Gene is enriched in cluster - $\log(\text{DOR}) < 0$: Gene is depleted in cluster - $\log(\text{DOR}) = 0$: No association ## 5. Network Centrality Analysis ### Graph Construction The connectome is represented as a directed weighted graph: $$G = (V, E, w)$$ Where: - $V$: Cell populations (nodes) - $E$: Signaling edges - $w$: Edge weights ### Kleinberg's Hub and Authority Scores **Authority score** (receiving importance): $$a_i = \sum_{j \rightarrow i} w_{ji} \cdot h_j$$ **Hub score** (sending importance): $$h_i = \sum_{i \rightarrow j} w_{ij} \cdot a_j$$ These are computed iteratively until convergence: ``` Initialize: h = a = 1/√n Repeat until convergence: a' = A^T · h h' = A · a Normalize a' and h' a = a', h = h' ``` ## 6. Differential Connectivity Analysis ### Fold Change Computation For two conditions (reference and test): $$\text{LFC}_{ij}^{L} = \log_2\left(\frac{E_{i,L}^{test}}{E_{i,L}^{ref}}\right)$$ $$\text{LFC}_{ij}^{R} = \log_2\left(\frac{E_{j,R}^{test}}{E_{j,R}^{ref}}\right)$$ ### Perturbation Score The overall perturbation score combines ligand and receptor changes: $$S_{ij}^{LR} = |\text{LFC}_{ij}^{L}| \times |\text{LFC}_{ij}^{R}|$$ This captures edges where both components are differentially expressed. ## 7. Implementation Details ### Computational Complexity | Operation | Complexity | |-----------|------------| | Expression averaging | $O(n \times g)$ | | Edge construction | $O(k^2 \times p)$ | | P-value calculation | $O(k \times g \times n)$ | | Centrality analysis | $O(k^2 \times p)$ | Where: - $n$: Number of cells - $g$: Number of genes - $k$: Number of cell populations - $p$: Number of L-R pairs ### Memory Efficiency Connectome uses: - `data.table` for efficient data manipulation - Sparse matrix operations via `Matrix` package - Pre-allocated vectors to avoid memory fragmentation ## References 1. **FANTOM5 Database**: Ramilowski, J.A. *et al.* A draft network of ligand–receptor-mediated multicellular signalling in human. *Nat Commun* 6, 7866 (2015). 2. **Kleinberg Algorithm**: Kleinberg, J.M. Authoritative sources in a hyperlinked environment. *Journal of the ACM* 46(5), 604–632 (1999). 3. **Haldane-Anscombe Correction**: Agresti, A. *Categorical Data Analysis*. Wiley, 3rd edition (2013). ## Session Info ```{r} sessionInfo() ```