--- title: "Algorithm Principles in MOFSR" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Algorithm Principles in MOFSR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE, eval = FALSE ) ``` ## Overview MOFSR implements 15 state-of-the-art multi-omics clustering algorithms. This vignette provides theoretical background and mathematical foundations for each method. ### Algorithm Classification | Paradigm | Algorithms | Key Idea | |:---------|:-----------|:---------| | **Network Fusion** | SNF, wSNF, CIMLR, NEMO | Construct similarity networks and fuse at network level | | **Matrix Factorization** | IntNMF, LRAcluster | Joint non-negative or low-rank decomposition | | **Factor Analysis** | MOFA, MCIA | Latent factor extraction across views | | **Canonical Correlation** | RGCCA, SGCCA, CPCA | Maximize correlations between views | | **Bayesian** | iClusterBayes, BCC | Probabilistic joint modeling | | **Ensemble** | PINSPlus, Late Fusion | Combine view-specific clusterings | --- ## Network-Based Methods ### Similarity Network Fusion (SNF) **Reference**: Wang et al., *Nature Methods*, 2014 SNF constructs patient similarity networks for each omics type and iteratively fuses them into a unified network. **Mathematical Framework**: 1. **Affinity Matrix Construction**: For data matrix $X^{(v)}$ of view $v$: $$W^{(v)}_{ij} = \frac{\exp(-d_{ij}^2/\mu\epsilon_{ij})}{\sum_{k \in N_i} \exp(-d_{ik}^2/\mu\epsilon_{ik})}$$ where $d_{ij}$ is the Euclidean distance between samples $i$ and $j$, $N_i$ is the set of $K$ nearest neighbors, and $\epsilon_{ij} = \frac{\text{mean}(d(i,N_i)) + \text{mean}(d(j,N_j)) + d_{ij}}{3}$. 2. **Network Normalization**: Construct transition probability matrix: $$P^{(v)} = D^{-1}W^{(v)}$$ 3. **Iterative Fusion**: Update networks until convergence: $$P^{(v)}_{t+1} = S^{(v)} \times \frac{\sum_{u \neq v} P^{(u)}_t}{|V|-1} \times (S^{(v)})^T$$ where $S^{(v)}$ is the normalized kernel matrix. ```{r snf-demo} library(MOFSR) set.seed(42) # Generate sample data n <- 50 data_list <- list( omics1 = matrix(rnorm(n * 100), 100, n), omics2 = matrix(rnorm(n * 80), 80, n) ) colnames(data_list$omics1) <- colnames(data_list$omics2) <- paste0("S", 1:n) # Step 1: Compute affinity matrices W1 <- snf_affinity_matrix(data_list$omics1, K = 10) W2 <- snf_affinity_matrix(data_list$omics2, K = 10) # Step 2: Fuse networks W_fused <- snf_fuse(list(W1, W2), K = 10, t = 20) # Step 3: Spectral clustering clusters <- spectral_clustering(W_fused, n_clusters = 3) table(clusters) ``` ### CIMLR (Cancer Integration via Multi-kernel Learning) **Reference**: Ramazzotti et al., *Nature Communications*, 2018 CIMLR extends SNF with multiple kernel learning to automatically learn optimal kernel combinations. **Key Innovation**: Instead of a single Gaussian kernel, CIMLR uses: $$K = \sum_{l=1}^{L} \beta_l K_l$$ where $K_l$ are Gaussian kernels with different bandwidths and $\beta_l$ are learned weights. ```{r cimlr-demo} # Run CIMLR result_cimlr <- run_cimlr(data_list, n_clusters = 3, n_cores = 1) head(result_cimlr) ``` --- ## Matrix Factorization Methods ### Integrative Non-negative Matrix Factorization (IntNMF) **Reference**: Chalise & Fridley, *PLoS ONE*, 2017 IntNMF simultaneously factorizes multiple non-negative data matrices while sharing a common factor matrix. **Objective Function**: $$\min_{W,H^{(v)}} \sum_{v=1}^{V} \lambda_v \|X^{(v)} - WH^{(v)}\|_F^2$$ subject to $W \geq 0$, $H^{(v)} \geq 0$ where: - $X^{(v)} \in \mathbb{R}^{n \times p_v}$: Data matrix for view $v$ - $W \in \mathbb{R}^{n \times K}$: Shared sample factor matrix - $H^{(v)} \in \mathbb{R}^{K \times p_v}$: View-specific feature factor matrix **Update Rules** (multiplicative): $$W_{ik} \leftarrow W_{ik} \frac{\sum_v \lambda_v (X^{(v)}H^{(v)T})_{ik}}{\sum_v \lambda_v (WH^{(v)}H^{(v)T})_{ik}}$$ $$H^{(v)}_{kj} \leftarrow H^{(v)}_{kj} \frac{(W^TX^{(v)})_{kj}}{(W^TWH^{(v)})_{kj}}$$ ```{r intnmf-demo} # IntNMF requires non-negative data pos_data <- lapply(data_list, function(x) abs(x)) # Run IntNMF result_intnmf <- intnmf_cluster(pos_data, n_clusters = 3, max_iter = 100) table(result_intnmf$cluster) ``` ### LRAcluster (Low-Rank Approximation) **Reference**: Wu et al., *Cancer Informatics*, 2015 LRAcluster uses low-rank approximation with different noise models for different data types. **Framework**: For count data (e.g., RNA-seq), assume Poisson distribution: $$X_{ij} \sim \text{Poisson}(\mu_{ij})$$ $$\log(\mu) = UV^T$$ where $U$ and $V$ are low-rank factors. --- ## Factor Analysis Methods ### Multi-Omics Factor Analysis (MOFA) MOFSR implements a simplified variational inference approach inspired by MOFA. **Generative Model**: $$X^{(v)} = W^{(v)}Z + \epsilon^{(v)}$$ where: - $Z \in \mathbb{R}^{K \times n}$: Latent factors - $W^{(v)} \in \mathbb{R}^{p_v \times K}$: View-specific loadings - $\epsilon^{(v)}$: Gaussian noise **Prior Structure** (ARD - Automatic Relevance Determination): $$W^{(v)}_{jk} \sim \mathcal{N}(0, 1/\alpha^{(v)}_k)$$ $$\alpha^{(v)}_k \sim \text{Gamma}(a_0, b_0)$$ This allows automatic selection of relevant factors per view. ```{r mofa-demo} # Run factor analysis result_mofa <- multi_view_factor_analysis(data_list, n_factors = 5, max_iter = 100) # Examine variance explained cat("Variance explained per factor:\n") print(round(result_mofa$variance_explained, 3)) ``` ### Multiple Co-Inertia Analysis (MCIA) **Reference**: Meng et al., *BMC Bioinformatics*, 2014 MCIA extends co-inertia analysis to multiple tables, finding common structures across omics. **Objective**: Maximize the sum of squared covariances between the global scores and individual table scores: $$\max_{\mathbf{f}} \sum_{v=1}^{V} \text{Cov}^2(\mathbf{f}, X^{(v)}\mathbf{q}^{(v)})$$ --- ## Canonical Correlation Methods ### RGCCA (Regularized Generalized CCA) **Reference**: Tenenhaus & Tenenhaus, *Psychometrika*, 2011 RGCCA finds latent components that maximize the sum of correlations between connected blocks. **Optimization Problem**: $$\max_{a^{(1)}, \ldots, a^{(V)}} \sum_{j < k} c_{jk} g(\text{Cov}(X^{(j)}a^{(j)}, X^{(k)}a^{(k)}))$$ subject to $(1-\tau_j)\text{Var}(X^{(j)}a^{(j)}) + \tau_j\|a^{(j)}\|^2 = 1$ where $c_{jk}$ defines the design matrix (which blocks are connected) and $\tau_j$ is the regularization parameter. ```{r rgcca-demo} # Run RGCCA result_rgcca <- run_rgcca(data_list, n_components = 3, n_clusters = 3) head(result_rgcca) ``` ### SGCCA (Sparse GCCA) SGCCA adds L1 penalty for feature selection: $$\|a^{(j)}\|_1 \leq s_j$$ This induces sparsity in the loadings, selecting relevant features. --- ## Bayesian Methods ### iClusterBayes **Reference**: Mo et al., *Biostatistics*, 2018 iClusterBayes uses a Bayesian framework with Gibbs sampling. **Model**: $$X^{(v)} = W^{(v)}Z + \epsilon^{(v)}$$ with priors: - $W^{(v)}_{jk} \sim \mathcal{N}(0, \sigma^2_w)$ - $Z_{ki} \sim \mathcal{N}(0, 1)$ - $\sigma^{-2}_v \sim \text{Gamma}(a, b)$ ### Bayesian Consensus Clustering (BCC) **Reference**: Lock & Dunson, *Bioinformatics*, 2013 BCC models the relationship between local (view-specific) and global clusters using a Dirichlet-Multinomial framework. **Generative Process**: 1. Global cluster: $c_i \sim \text{Multinomial}(\pi)$ 2. Local cluster: $c^{(v)}_i | c_i \sim \text{Multinomial}(\theta_{c_i}^{(v)})$ --- ## Ensemble Methods ### PINSPlus (Perturbation Clustering) **Reference**: Nguyen et al., *Genome Research*, 2017 PINSPlus uses perturbation-based subsampling to identify stable clusters. **Algorithm**: 1. Subsample data with noise perturbation 2. Cluster each subsample 3. Build connectivity matrix 4. Final clustering from connectivity matrix ```{r pinsplus-demo} # Run PINSPlus result_pins <- run_pinsplus(data_list, n_clusters = 3, n_reps = 20) head(result_pins) ``` --- ## Algorithm Selection Guide | Scenario | Recommended Algorithms | |:---------|:-----------------------| | High-dimensional, sparse data | CIMLR, SGCCA | | Missing data | MOFA, iClusterBayes | | Large sample size | SNF, IntNMF | | Need feature selection | CIMLR, SGCCA | | Interpretable factors | MOFA, MCIA | | Robust to noise | PINSPlus, BCC | ## References 1. Wang B, et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. *Nature Methods*, 11(3):333-337. 2. Ramazzotti D, et al. (2018). Multi-omic tumor data reveal diversity of molecular mechanisms underlying survival. *Nature Communications*, 9(1):4453. 3. Mo Q, et al. (2018). A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. *Biostatistics*, 19(1):71-86. 4. Chalise P, Fridley BL (2017). Integrative clustering of multi-level 'omic data based on non-negative matrix factorization algorithm. *PLoS ONE*, 12(5):e0176278. 5. Tenenhaus A, Tenenhaus M (2011). Regularized generalized canonical correlation analysis. *Psychometrika*, 76(2):257-284. ## Session Info ```{r session} sessionInfo() ```