Algorithm & Methodology

Overview

NOVA implements a comprehensive computational framework for inferring cell-to-cell communication networks based on ligand-receptor co-expression patterns. This document details the mathematical foundations and algorithmic implementations underlying NOVA’s analysis pipeline.

Theoretical Background

Cell-Cell Communication

Intercellular communication is a fundamental biological process where cells exchange information through signaling molecules. In the context of transcriptomic analysis, we focus on ligand-receptor (L-R) interactions, where:

  • Ligands: Secreted or membrane-bound signaling molecules produced by sending cells
  • Receptors: Cell surface proteins on receiving cells that bind to specific ligands

The strength of a potential communication event can be estimated from gene expression data by examining the co-expression of ligand-receptor pairs between cell populations.

Mathematical Framework

1. Gene Expression Statistics

For a given gene \(g\) in cluster \(c\), NOVA computes:

Detection Rate (Percentage of Expressing Cells)

\[ \text{pct}_{g,c} = \frac{|\{i : x_{g,i} > 0, i \in c\}|}{|c|} \]

where \(x_{g,i}\) is the expression of gene \(g\) in cell \(i\), and \(|c|\) is the number of cells in cluster \(c\).

Mean Expression

\[ \bar{x}_{g,c} = \frac{1}{|c|} \sum_{i \in c} x_{g,i} \]

2. Expression Specificity

The specificity score quantifies how preferentially a gene is expressed in a given cluster relative to all clusters:

\[ S_{g,c} = \frac{\bar{x}_{g,c}}{\sum_{c' \in C} \bar{x}_{g,c'}} \]

where \(C\) is the set of all clusters. This metric ranges from 0 to 1, with higher values indicating more cluster-specific expression.

Properties: - \(\sum_{c \in C} S_{g,c} = 1\) for each gene - \(S_{g,c} = 0\) if gene \(g\) is not expressed in cluster \(c\) - \(S_{g,c} = 1\) if gene \(g\) is only expressed in cluster \(c\)

3. Edge Weight Calculation

For a ligand-receptor pair \((L, R)\) between sending cluster \(s\) and target cluster \(t\):

Expression-based Weight

\[ W_{\text{expr}}(L,R,s,t) = \bar{x}_{L,s} \times \bar{x}_{R,t} \]

Specificity-based Weight

\[ W_{\text{spec}}(L,R,s,t) = S_{L,s} \times S_{R,t} \]

4. Filtering Criteria

NOVA applies the following filters to identify biologically meaningful interactions:

  1. Detection threshold: Both ligand and receptor must be detected in a minimum percentage of cells: \[ \text{pct}_{L,s} > \theta_{\text{pct}} \quad \text{and} \quad \text{pct}_{R,t} > \theta_{\text{pct}} \]

  2. Expression threshold: The edge weight must exceed a minimum value: \[ W_{\text{expr}}(L,R,s,t) > \theta_{\text{expr}} \]

Algorithmic Implementation

Cluster Statistics Computation

# Pseudocode for cluster statistics
compute_cluster_stats <- function(expr, clusters) {
  for each cluster c:
    cells_in_c <- get_cells(clusters, c)
    for each gene g:
      n_expressing <- sum(expr[g, cells_in_c] > 0)
      pct[g,c] <- n_expressing / length(cells_in_c)
      mean[g,c] <- mean(expr[g, cells_in_c])
  
  # Compute specificity (row normalization)
  for each gene g:
    row_sum <- sum(mean[g, ])
    if row_sum > 0:
      specificity[g, ] <- mean[g, ] / row_sum
  
  return list(pct, mean, specificity)
}

Edge Computation

# Pseudocode for edge computation
compute_edges <- function(lr_pairs, lig_stats, rec_stats, clusters) {
  edges <- list()
  
  for each pair (L, R) in lr_pairs:
    for each sending cluster s:
      for each target cluster t:
        # Check detection threshold
        if lig_stats$pct[L,s] > min_pct AND 
           rec_stats$pct[R,t] > min_pct:
          
          # Compute weights
          weight_expr <- lig_stats$mean[L,s] * rec_stats$mean[R,t]
          weight_spec <- lig_stats$spec[L,s] * rec_stats$spec[R,t]
          
          if weight_expr > 0:
            edges.add(L, R, s, t, weight_expr, weight_spec)
  
  return edges
}

Differential Analysis

Fold Change Calculation

For comparing communication between two conditions (reference vs. target):

\[ \text{log}_2\text{FC} = \log_2\left(\frac{W_{\text{target}} + \epsilon}{W_{\text{reference}} + \epsilon}\right) \]

where \(\epsilon\) is a small constant (default: 0.001) to avoid division by zero.

Delta Metrics

\[ \Delta_{\text{expr}} = W_{\text{expr,target}} - W_{\text{expr,reference}} \]

\[ \Delta_{\text{spec}} = W_{\text{spec,target}} - W_{\text{spec,reference}} \]

Ligand-Receptor Database

connectomeDB2020

NOVA utilizes the connectomeDB2020 database, which provides:

Database Description Pairs
lrc2p Literature-curated, high-confidence pairs 2,293
lrc2a Extended set including predictions ~15,000

Database Structure

library(NOVA)

# Load database
lr_db <- GetLRDatabase("lrc2p")
str(lr_db)
#> Classes 'data.table' and 'data.frame':   2293 obs. of  2 variables:
#>  $ ligand  : chr  "A2M" "AANAT" "AANAT" "ACE" ...
#>  $ receptor: chr  "LRP1" "MTNR1A" "MTNR1B" "BDKRB2" ...
#>  - attr(*, ".internal.selfref")=<pointer: 0x55d386e54b60>
head(lr_db)
#>    ligand receptor
#>    <char>   <char>
#> 1:    A2M     LRP1
#> 2:  AANAT   MTNR1A
#> 3:  AANAT   MTNR1B
#> 4:    ACE   BDKRB2
#> 5: ADAM10    EPHA3
#> 6: ADAM11    ITGA4

Multi-Species Support

Homology Mapping

NOVA supports cross-species analysis through NCBI HomoloGene:

# Get supported species
species_list <- supported_species()
print(species_list)
#>                human                mouse           chimpanzee 
#>               "9606"              "10090"               "9598" 
#>                  dog               monkey               cattle 
#>               "9615"               "9544"               "9913" 
#>                  rat              chicken                 frog 
#>              "10116"               "9031"               "8364" 
#>            zebrafish             fruitfly             mosquito 
#>               "7955"               "7227"               "7165" 
#>             nematode           thalecress                 rice 
#>               "6239"               "3702"               "4530" 
#>      riceblastfungus           bakeryeast     neurosporacrassa 
#>             "318829"               "4932"               "5141" 
#>         fissionyeast eremotheciumgossypii  kluyveromyceslactis 
#>               "4896"              "33169"              "28985"

# Convert mouse genes to human
# mouse_genes <- c("Cd4", "Cd8a", "Ptprc")
# human_orthologs <- ConvertGeneSymbols(mouse_genes, from = "mouse", to = "human")

Supported Species

NOVA supports 21 species including:

  • Mammals: Human, Mouse, Rat, Dog, Cattle, Chimpanzee, Monkey
  • Model organisms: Zebrafish, Fruitfly, Nematode, Frog, Chicken
  • Microorganisms: Yeast, Fission yeast

Computational Efficiency

Vectorization Strategy

NOVA employs several optimization strategies:

  1. Matrix operations: Using data.table for efficient data manipulation
  2. Sparse matrix support: Via the Matrix package for memory efficiency
  3. C++ acceleration: Critical loops implemented in RcppArmadillo
  4. Parallel processing: Optional parallelization via future package

Complexity Analysis

Operation Time Complexity Space Complexity
Cluster stats O(G × N) O(G × C)
Edge computation O(P × C²) O(E)
Filtering O(E) O(E)

Where: - G = number of genes - N = number of cells - C = number of clusters - P = number of LR pairs - E = number of edges

References

  1. Hou, R., Denisenko, E., Ong, H.T. et al. Predicting cell-to-cell communication networks using NATMI. Nat Commun 11, 5011 (2020).

  2. Ramilowski, J.A., et al. A draft network of ligand-receptor-mediated multicellular signalling in human. Nat Commun 6, 7866 (2015).

Author

Zaoqu Liu