--- title: "Algorithm & Methodology" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: true vignette: > %\VignetteIndexEntry{Algorithm & Methodology} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6, fig.align = "center", message = FALSE, warning = FALSE ) ``` ## Overview NOVA implements a comprehensive computational framework for inferring cell-to-cell communication networks based on ligand-receptor co-expression patterns. This document details the mathematical foundations and algorithmic implementations underlying NOVA's analysis pipeline. ## Theoretical Background ### Cell-Cell Communication Intercellular communication is a fundamental biological process where cells exchange information through signaling molecules. In the context of transcriptomic analysis, we focus on **ligand-receptor (L-R) interactions**, where: - **Ligands**: Secreted or membrane-bound signaling molecules produced by sending cells - **Receptors**: Cell surface proteins on receiving cells that bind to specific ligands The strength of a potential communication event can be estimated from gene expression data by examining the co-expression of ligand-receptor pairs between cell populations. ## Mathematical Framework ### 1. Gene Expression Statistics For a given gene $g$ in cluster $c$, NOVA computes: #### Detection Rate (Percentage of Expressing Cells) $$ \text{pct}_{g,c} = \frac{|\{i : x_{g,i} > 0, i \in c\}|}{|c|} $$ where $x_{g,i}$ is the expression of gene $g$ in cell $i$, and $|c|$ is the number of cells in cluster $c$. #### Mean Expression $$ \bar{x}_{g,c} = \frac{1}{|c|} \sum_{i \in c} x_{g,i} $$ ### 2. Expression Specificity The specificity score quantifies how preferentially a gene is expressed in a given cluster relative to all clusters: $$ S_{g,c} = \frac{\bar{x}_{g,c}}{\sum_{c' \in C} \bar{x}_{g,c'}} $$ where $C$ is the set of all clusters. This metric ranges from 0 to 1, with higher values indicating more cluster-specific expression. **Properties:** - $\sum_{c \in C} S_{g,c} = 1$ for each gene - $S_{g,c} = 0$ if gene $g$ is not expressed in cluster $c$ - $S_{g,c} = 1$ if gene $g$ is only expressed in cluster $c$ ### 3. Edge Weight Calculation For a ligand-receptor pair $(L, R)$ between sending cluster $s$ and target cluster $t$: #### Expression-based Weight $$ W_{\text{expr}}(L,R,s,t) = \bar{x}_{L,s} \times \bar{x}_{R,t} $$ #### Specificity-based Weight $$ W_{\text{spec}}(L,R,s,t) = S_{L,s} \times S_{R,t} $$ ### 4. Filtering Criteria NOVA applies the following filters to identify biologically meaningful interactions: 1. **Detection threshold**: Both ligand and receptor must be detected in a minimum percentage of cells: $$ \text{pct}_{L,s} > \theta_{\text{pct}} \quad \text{and} \quad \text{pct}_{R,t} > \theta_{\text{pct}} $$ 2. **Expression threshold**: The edge weight must exceed a minimum value: $$ W_{\text{expr}}(L,R,s,t) > \theta_{\text{expr}} $$ ## Algorithmic Implementation ### Cluster Statistics Computation ```r # Pseudocode for cluster statistics compute_cluster_stats <- function(expr, clusters) { for each cluster c: cells_in_c <- get_cells(clusters, c) for each gene g: n_expressing <- sum(expr[g, cells_in_c] > 0) pct[g,c] <- n_expressing / length(cells_in_c) mean[g,c] <- mean(expr[g, cells_in_c]) # Compute specificity (row normalization) for each gene g: row_sum <- sum(mean[g, ]) if row_sum > 0: specificity[g, ] <- mean[g, ] / row_sum return list(pct, mean, specificity) } ``` ### Edge Computation ```r # Pseudocode for edge computation compute_edges <- function(lr_pairs, lig_stats, rec_stats, clusters) { edges <- list() for each pair (L, R) in lr_pairs: for each sending cluster s: for each target cluster t: # Check detection threshold if lig_stats$pct[L,s] > min_pct AND rec_stats$pct[R,t] > min_pct: # Compute weights weight_expr <- lig_stats$mean[L,s] * rec_stats$mean[R,t] weight_spec <- lig_stats$spec[L,s] * rec_stats$spec[R,t] if weight_expr > 0: edges.add(L, R, s, t, weight_expr, weight_spec) return edges } ``` ## Differential Analysis ### Fold Change Calculation For comparing communication between two conditions (reference vs. target): $$ \text{log}_2\text{FC} = \log_2\left(\frac{W_{\text{target}} + \epsilon}{W_{\text{reference}} + \epsilon}\right) $$ where $\epsilon$ is a small constant (default: 0.001) to avoid division by zero. ### Delta Metrics $$ \Delta_{\text{expr}} = W_{\text{expr,target}} - W_{\text{expr,reference}} $$ $$ \Delta_{\text{spec}} = W_{\text{spec,target}} - W_{\text{spec,reference}} $$ ## Ligand-Receptor Database ### connectomeDB2020 NOVA utilizes the connectomeDB2020 database, which provides: | Database | Description | Pairs | |----------|-------------|-------| | **lrc2p** | Literature-curated, high-confidence pairs | 2,293 | | **lrc2a** | Extended set including predictions | ~15,000 | ### Database Structure ```{r db_structure} library(NOVA) # Load database lr_db <- GetLRDatabase("lrc2p") str(lr_db) head(lr_db) ``` ## Multi-Species Support ### Homology Mapping NOVA supports cross-species analysis through NCBI HomoloGene: ```{r homology} # Get supported species species_list <- supported_species() print(species_list) # Convert mouse genes to human # mouse_genes <- c("Cd4", "Cd8a", "Ptprc") # human_orthologs <- ConvertGeneSymbols(mouse_genes, from = "mouse", to = "human") ``` ### Supported Species NOVA supports 21 species including: - **Mammals**: Human, Mouse, Rat, Dog, Cattle, Chimpanzee, Monkey - **Model organisms**: Zebrafish, Fruitfly, Nematode, Frog, Chicken - **Microorganisms**: Yeast, Fission yeast ## Computational Efficiency ### Vectorization Strategy NOVA employs several optimization strategies: 1. **Matrix operations**: Using `data.table` for efficient data manipulation 2. **Sparse matrix support**: Via the `Matrix` package for memory efficiency 3. **C++ acceleration**: Critical loops implemented in RcppArmadillo 4. **Parallel processing**: Optional parallelization via `future` package ### Complexity Analysis | Operation | Time Complexity | Space Complexity | |-----------|-----------------|------------------| | Cluster stats | O(G × N) | O(G × C) | | Edge computation | O(P × C²) | O(E) | | Filtering | O(E) | O(E) | Where: - G = number of genes - N = number of cells - C = number of clusters - P = number of LR pairs - E = number of edges ## References 1. Hou, R., Denisenko, E., Ong, H.T. et al. Predicting cell-to-cell communication networks using NATMI. *Nat Commun* **11**, 5011 (2020). 2. Ramilowski, J.A., et al. A draft network of ligand-receptor-mediated multicellular signalling in human. *Nat Commun* **6**, 7866 (2015). ## Author **Zaoqu Liu** - Email: liuzaoqu@163.com - GitHub: [Zaoqu-Liu](https://github.com/Zaoqu-Liu)