---
title: "Algorithm Principles in MOFSR"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Algorithm Principles in MOFSR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE,
  eval = FALSE
)
```

## Overview

MOFSR implements 15 state-of-the-art multi-omics clustering algorithms. This vignette provides theoretical background and mathematical foundations for each method.

### Algorithm Classification

| Paradigm | Algorithms | Key Idea |
|:---------|:-----------|:---------|
| **Network Fusion** | SNF, wSNF, CIMLR, NEMO | Construct similarity networks and fuse at network level |
| **Matrix Factorization** | IntNMF, LRAcluster | Joint non-negative or low-rank decomposition |
| **Factor Analysis** | MOFA, MCIA | Latent factor extraction across views |
| **Canonical Correlation** | RGCCA, SGCCA, CPCA | Maximize correlations between views |
| **Bayesian** | iClusterBayes, BCC | Probabilistic joint modeling |
| **Ensemble** | PINSPlus, Late Fusion | Combine view-specific clusterings |

---

## Network-Based Methods

### Similarity Network Fusion (SNF)

**Reference**: Wang et al., *Nature Methods*, 2014

SNF constructs patient similarity networks for each omics type and iteratively fuses them into a unified network.

**Mathematical Framework**:

1. **Affinity Matrix Construction**: For data matrix $X^{(v)}$ of view $v$:

$$W^{(v)}_{ij} = \frac{\exp(-d_{ij}^2/\mu\epsilon_{ij})}{\sum_{k \in N_i} \exp(-d_{ik}^2/\mu\epsilon_{ik})}$$

where $d_{ij}$ is the Euclidean distance between samples $i$ and $j$, $N_i$ is the set of $K$ nearest neighbors, and $\epsilon_{ij} = \frac{\text{mean}(d(i,N_i)) + \text{mean}(d(j,N_j)) + d_{ij}}{3}$.

2. **Network Normalization**: Construct transition probability matrix:

$$P^{(v)} = D^{-1}W^{(v)}$$

3. **Iterative Fusion**: Update networks until convergence:

$$P^{(v)}_{t+1} = S^{(v)} \times \frac{\sum_{u \neq v} P^{(u)}_t}{|V|-1} \times (S^{(v)})^T$$

where $S^{(v)}$ is the normalized kernel matrix.

```{r snf-demo}
library(MOFSR)
set.seed(42)

# Generate sample data
n <- 50
data_list <- list(
  omics1 = matrix(rnorm(n * 100), 100, n),
  omics2 = matrix(rnorm(n * 80), 80, n)
)
colnames(data_list$omics1) <- colnames(data_list$omics2) <- paste0("S", 1:n)

# Step 1: Compute affinity matrices
W1 <- snf_affinity_matrix(data_list$omics1, K = 10)
W2 <- snf_affinity_matrix(data_list$omics2, K = 10)

# Step 2: Fuse networks
W_fused <- snf_fuse(list(W1, W2), K = 10, t = 20)

# Step 3: Spectral clustering
clusters <- spectral_clustering(W_fused, n_clusters = 3)
table(clusters)
```

### CIMLR (Cancer Integration via Multi-kernel Learning)

**Reference**: Ramazzotti et al., *Nature Communications*, 2018

CIMLR extends SNF with multiple kernel learning to automatically learn optimal kernel combinations.

**Key Innovation**: Instead of a single Gaussian kernel, CIMLR uses:

$$K = \sum_{l=1}^{L} \beta_l K_l$$

where $K_l$ are Gaussian kernels with different bandwidths and $\beta_l$ are learned weights.

```{r cimlr-demo}
# Run CIMLR
result_cimlr <- run_cimlr(data_list, n_clusters = 3, n_cores = 1)
head(result_cimlr)
```

---

## Matrix Factorization Methods

### Integrative Non-negative Matrix Factorization (IntNMF)

**Reference**: Chalise & Fridley, *PLoS ONE*, 2017

IntNMF simultaneously factorizes multiple non-negative data matrices while sharing a common factor matrix.

**Objective Function**:

$$\min_{W,H^{(v)}} \sum_{v=1}^{V} \lambda_v \|X^{(v)} - WH^{(v)}\|_F^2$$

subject to $W \geq 0$, $H^{(v)} \geq 0$

where:
- $X^{(v)} \in \mathbb{R}^{n \times p_v}$: Data matrix for view $v$
- $W \in \mathbb{R}^{n \times K}$: Shared sample factor matrix
- $H^{(v)} \in \mathbb{R}^{K \times p_v}$: View-specific feature factor matrix

**Update Rules** (multiplicative):

$$W_{ik} \leftarrow W_{ik} \frac{\sum_v \lambda_v (X^{(v)}H^{(v)T})_{ik}}{\sum_v \lambda_v (WH^{(v)}H^{(v)T})_{ik}}$$

$$H^{(v)}_{kj} \leftarrow H^{(v)}_{kj} \frac{(W^TX^{(v)})_{kj}}{(W^TWH^{(v)})_{kj}}$$

```{r intnmf-demo}
# IntNMF requires non-negative data
pos_data <- lapply(data_list, function(x) abs(x))

# Run IntNMF
result_intnmf <- intnmf_cluster(pos_data, n_clusters = 3, max_iter = 100)
table(result_intnmf$cluster)
```

### LRAcluster (Low-Rank Approximation)

**Reference**: Wu et al., *Cancer Informatics*, 2015

LRAcluster uses low-rank approximation with different noise models for different data types.

**Framework**: For count data (e.g., RNA-seq), assume Poisson distribution:

$$X_{ij} \sim \text{Poisson}(\mu_{ij})$$
$$\log(\mu) = UV^T$$

where $U$ and $V$ are low-rank factors.

---

## Factor Analysis Methods

### Multi-Omics Factor Analysis (MOFA)

MOFSR implements a simplified variational inference approach inspired by MOFA.

**Generative Model**:

$$X^{(v)} = W^{(v)}Z + \epsilon^{(v)}$$

where:
- $Z \in \mathbb{R}^{K \times n}$: Latent factors
- $W^{(v)} \in \mathbb{R}^{p_v \times K}$: View-specific loadings
- $\epsilon^{(v)}$: Gaussian noise

**Prior Structure** (ARD - Automatic Relevance Determination):

$$W^{(v)}_{jk} \sim \mathcal{N}(0, 1/\alpha^{(v)}_k)$$
$$\alpha^{(v)}_k \sim \text{Gamma}(a_0, b_0)$$

This allows automatic selection of relevant factors per view.

```{r mofa-demo}
# Run factor analysis
result_mofa <- multi_view_factor_analysis(data_list, n_factors = 5, max_iter = 100)

# Examine variance explained
cat("Variance explained per factor:\n")
print(round(result_mofa$variance_explained, 3))
```

### Multiple Co-Inertia Analysis (MCIA)

**Reference**: Meng et al., *BMC Bioinformatics*, 2014

MCIA extends co-inertia analysis to multiple tables, finding common structures across omics.

**Objective**: Maximize the sum of squared covariances between the global scores and individual table scores:

$$\max_{\mathbf{f}} \sum_{v=1}^{V} \text{Cov}^2(\mathbf{f}, X^{(v)}\mathbf{q}^{(v)})$$

---

## Canonical Correlation Methods

### RGCCA (Regularized Generalized CCA)

**Reference**: Tenenhaus & Tenenhaus, *Psychometrika*, 2011

RGCCA finds latent components that maximize the sum of correlations between connected blocks.

**Optimization Problem**:

$$\max_{a^{(1)}, \ldots, a^{(V)}} \sum_{j < k} c_{jk} g(\text{Cov}(X^{(j)}a^{(j)}, X^{(k)}a^{(k)}))$$

subject to $(1-\tau_j)\text{Var}(X^{(j)}a^{(j)}) + \tau_j\|a^{(j)}\|^2 = 1$

where $c_{jk}$ defines the design matrix (which blocks are connected) and $\tau_j$ is the regularization parameter.

```{r rgcca-demo}
# Run RGCCA
result_rgcca <- run_rgcca(data_list, n_components = 3, n_clusters = 3)
head(result_rgcca)
```

### SGCCA (Sparse GCCA)

SGCCA adds L1 penalty for feature selection:

$$\|a^{(j)}\|_1 \leq s_j$$

This induces sparsity in the loadings, selecting relevant features.

---

## Bayesian Methods

### iClusterBayes

**Reference**: Mo et al., *Biostatistics*, 2018

iClusterBayes uses a Bayesian framework with Gibbs sampling.

**Model**:

$$X^{(v)} = W^{(v)}Z + \epsilon^{(v)}$$

with priors:
- $W^{(v)}_{jk} \sim \mathcal{N}(0, \sigma^2_w)$
- $Z_{ki} \sim \mathcal{N}(0, 1)$
- $\sigma^{-2}_v \sim \text{Gamma}(a, b)$

### Bayesian Consensus Clustering (BCC)

**Reference**: Lock & Dunson, *Bioinformatics*, 2013

BCC models the relationship between local (view-specific) and global clusters using a Dirichlet-Multinomial framework.

**Generative Process**:

1. Global cluster: $c_i \sim \text{Multinomial}(\pi)$
2. Local cluster: $c^{(v)}_i | c_i \sim \text{Multinomial}(\theta_{c_i}^{(v)})$

---

## Ensemble Methods

### PINSPlus (Perturbation Clustering)

**Reference**: Nguyen et al., *Genome Research*, 2017

PINSPlus uses perturbation-based subsampling to identify stable clusters.

**Algorithm**:

1. Subsample data with noise perturbation
2. Cluster each subsample
3. Build connectivity matrix
4. Final clustering from connectivity matrix

```{r pinsplus-demo}
# Run PINSPlus
result_pins <- run_pinsplus(data_list, n_clusters = 3, n_reps = 20)
head(result_pins)
```

---

## Algorithm Selection Guide

| Scenario | Recommended Algorithms |
|:---------|:-----------------------|
| High-dimensional, sparse data | CIMLR, SGCCA |
| Missing data | MOFA, iClusterBayes |
| Large sample size | SNF, IntNMF |
| Need feature selection | CIMLR, SGCCA |
| Interpretable factors | MOFA, MCIA |
| Robust to noise | PINSPlus, BCC |

## References

1. Wang B, et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. *Nature Methods*, 11(3):333-337.
2. Ramazzotti D, et al. (2018). Multi-omic tumor data reveal diversity of molecular mechanisms underlying survival. *Nature Communications*, 9(1):4453.
3. Mo Q, et al. (2018). A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. *Biostatistics*, 19(1):71-86.
4. Chalise P, Fridley BL (2017). Integrative clustering of multi-level 'omic data based on non-negative matrix factorization algorithm. *PLoS ONE*, 12(5):e0176278.
5. Tenenhaus A, Tenenhaus M (2011). Regularized generalized canonical correlation analysis. *Psychometrika*, 76(2):257-284.

## Session Info

```{r session}
sessionInfo()
```