---
title: "Algorithm & Methodology"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_caption: true
vignette: >
  %\VignetteIndexEntry{Algorithm & Methodology}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.height = 6,
  fig.align = "center",
  message = FALSE,
  warning = FALSE
)
```

## Overview

NOVA implements a comprehensive computational framework for inferring cell-to-cell communication networks based on ligand-receptor co-expression patterns. This document details the mathematical foundations and algorithmic implementations underlying NOVA's analysis pipeline.

## Theoretical Background

### Cell-Cell Communication

Intercellular communication is a fundamental biological process where cells exchange information through signaling molecules. In the context of transcriptomic analysis, we focus on **ligand-receptor (L-R) interactions**, where:

- **Ligands**: Secreted or membrane-bound signaling molecules produced by sending cells
- **Receptors**: Cell surface proteins on receiving cells that bind to specific ligands

The strength of a potential communication event can be estimated from gene expression data by examining the co-expression of ligand-receptor pairs between cell populations.

## Mathematical Framework

### 1. Gene Expression Statistics

For a given gene $g$ in cluster $c$, NOVA computes:

#### Detection Rate (Percentage of Expressing Cells)

$$
\text{pct}_{g,c} = \frac{|\{i : x_{g,i} > 0, i \in c\}|}{|c|}
$$

where $x_{g,i}$ is the expression of gene $g$ in cell $i$, and $|c|$ is the number of cells in cluster $c$.

#### Mean Expression

$$
\bar{x}_{g,c} = \frac{1}{|c|} \sum_{i \in c} x_{g,i}
$$

### 2. Expression Specificity

The specificity score quantifies how preferentially a gene is expressed in a given cluster relative to all clusters:

$$
S_{g,c} = \frac{\bar{x}_{g,c}}{\sum_{c' \in C} \bar{x}_{g,c'}}
$$

where $C$ is the set of all clusters. This metric ranges from 0 to 1, with higher values indicating more cluster-specific expression.

**Properties:**
- $\sum_{c \in C} S_{g,c} = 1$ for each gene
- $S_{g,c} = 0$ if gene $g$ is not expressed in cluster $c$
- $S_{g,c} = 1$ if gene $g$ is only expressed in cluster $c$

### 3. Edge Weight Calculation

For a ligand-receptor pair $(L, R)$ between sending cluster $s$ and target cluster $t$:

#### Expression-based Weight

$$
W_{\text{expr}}(L,R,s,t) = \bar{x}_{L,s} \times \bar{x}_{R,t}
$$

#### Specificity-based Weight

$$
W_{\text{spec}}(L,R,s,t) = S_{L,s} \times S_{R,t}
$$

### 4. Filtering Criteria

NOVA applies the following filters to identify biologically meaningful interactions:

1. **Detection threshold**: Both ligand and receptor must be detected in a minimum percentage of cells:
   $$
   \text{pct}_{L,s} > \theta_{\text{pct}} \quad \text{and} \quad \text{pct}_{R,t} > \theta_{\text{pct}}
   $$

2. **Expression threshold**: The edge weight must exceed a minimum value:
   $$
   W_{\text{expr}}(L,R,s,t) > \theta_{\text{expr}}
   $$

## Algorithmic Implementation

### Cluster Statistics Computation

```r
# Pseudocode for cluster statistics
compute_cluster_stats <- function(expr, clusters) {
  for each cluster c:
    cells_in_c <- get_cells(clusters, c)
    for each gene g:
      n_expressing <- sum(expr[g, cells_in_c] > 0)
      pct[g,c] <- n_expressing / length(cells_in_c)
      mean[g,c] <- mean(expr[g, cells_in_c])
  
  # Compute specificity (row normalization)
  for each gene g:
    row_sum <- sum(mean[g, ])
    if row_sum > 0:
      specificity[g, ] <- mean[g, ] / row_sum
  
  return list(pct, mean, specificity)
}
```

### Edge Computation

```r
# Pseudocode for edge computation
compute_edges <- function(lr_pairs, lig_stats, rec_stats, clusters) {
  edges <- list()
  
  for each pair (L, R) in lr_pairs:
    for each sending cluster s:
      for each target cluster t:
        # Check detection threshold
        if lig_stats$pct[L,s] > min_pct AND 
           rec_stats$pct[R,t] > min_pct:
          
          # Compute weights
          weight_expr <- lig_stats$mean[L,s] * rec_stats$mean[R,t]
          weight_spec <- lig_stats$spec[L,s] * rec_stats$spec[R,t]
          
          if weight_expr > 0:
            edges.add(L, R, s, t, weight_expr, weight_spec)
  
  return edges
}
```

## Differential Analysis

### Fold Change Calculation

For comparing communication between two conditions (reference vs. target):

$$
\text{log}_2\text{FC} = \log_2\left(\frac{W_{\text{target}} + \epsilon}{W_{\text{reference}} + \epsilon}\right)
$$

where $\epsilon$ is a small constant (default: 0.001) to avoid division by zero.

### Delta Metrics

$$
\Delta_{\text{expr}} = W_{\text{expr,target}} - W_{\text{expr,reference}}
$$

$$
\Delta_{\text{spec}} = W_{\text{spec,target}} - W_{\text{spec,reference}}
$$

## Ligand-Receptor Database

### connectomeDB2020

NOVA utilizes the connectomeDB2020 database, which provides:

| Database | Description | Pairs |
|----------|-------------|-------|
| **lrc2p** | Literature-curated, high-confidence pairs | 2,293 |
| **lrc2a** | Extended set including predictions | ~15,000 |

### Database Structure

```{r db_structure}
library(NOVA)

# Load database
lr_db <- GetLRDatabase("lrc2p")
str(lr_db)
head(lr_db)
```

## Multi-Species Support

### Homology Mapping

NOVA supports cross-species analysis through NCBI HomoloGene:

```{r homology}
# Get supported species
species_list <- supported_species()
print(species_list)

# Convert mouse genes to human
# mouse_genes <- c("Cd4", "Cd8a", "Ptprc")
# human_orthologs <- ConvertGeneSymbols(mouse_genes, from = "mouse", to = "human")
```

### Supported Species

NOVA supports 21 species including:

- **Mammals**: Human, Mouse, Rat, Dog, Cattle, Chimpanzee, Monkey
- **Model organisms**: Zebrafish, Fruitfly, Nematode, Frog, Chicken
- **Microorganisms**: Yeast, Fission yeast

## Computational Efficiency

### Vectorization Strategy

NOVA employs several optimization strategies:

1. **Matrix operations**: Using `data.table` for efficient data manipulation
2. **Sparse matrix support**: Via the `Matrix` package for memory efficiency
3. **C++ acceleration**: Critical loops implemented in RcppArmadillo
4. **Parallel processing**: Optional parallelization via `future` package

### Complexity Analysis

| Operation | Time Complexity | Space Complexity |
|-----------|-----------------|------------------|
| Cluster stats | O(G × N) | O(G × C) |
| Edge computation | O(P × C²) | O(E) |
| Filtering | O(E) | O(E) |

Where:
- G = number of genes
- N = number of cells
- C = number of clusters
- P = number of LR pairs
- E = number of edges

## References

1. Hou, R., Denisenko, E., Ong, H.T. et al. Predicting cell-to-cell communication networks using NATMI. *Nat Commun* **11**, 5011 (2020).

2. Ramilowski, J.A., et al. A draft network of ligand-receptor-mediated multicellular signalling in human. *Nat Commun* **6**, 7866 (2015).

## Author

**Zaoqu Liu**

- Email: liuzaoqu@163.com
- GitHub: [Zaoqu-Liu](https://github.com/Zaoqu-Liu)