Algorithm Theory and Mathematical Framework

Introduction

METAFLUX employs constraint-based metabolic modeling to infer intracellular metabolic fluxes from gene expression data. This vignette provides a comprehensive overview of the mathematical framework underlying the algorithm.

Theoretical Background

Genome-Scale Metabolic Models

A genome-scale metabolic model (GEM) represents the complete set of metabolic reactions in an organism. Human-GEM, used by METAFLUX, contains:

  • 8,378 metabolites
  • 13,082 reactions
  • 3,625 metabolic genes

Stoichiometric Matrix

The stoichiometric matrix \(\mathbf{S}\) encodes the structure of the metabolic network:

\[\mathbf{S} \in \mathbb{R}^{m \times n}\]

where: - \(m\) = number of metabolites (8,378) - \(n\) = number of reactions (13,082) - \(S_{ij}\) = stoichiometric coefficient of metabolite \(i\) in reaction \(j\)

Flux Balance Analysis (FBA)

Steady-State Assumption

At metabolic steady state, the rate of production equals the rate of consumption for each metabolite:

\[\mathbf{S} \cdot \mathbf{v} = 0\]

where \(\mathbf{v}\) is the flux vector.

Optimization Problem

METAFLUX solves a quadratic programming problem:

\[\min_{\mathbf{v}} \quad \mathbf{v}^T \mathbf{P} \mathbf{v} + \mathbf{q}^T \mathbf{v}\]

subject to: \[\mathbf{S} \cdot \mathbf{v} = 0\] \[\mathbf{lb} \leq \mathbf{v} \leq \mathbf{ub}\]

where: - \(\mathbf{P}\) = identity matrix (regularization) - \(\mathbf{q}\) = objective coefficients (biomass maximization) - \(\mathbf{lb}, \mathbf{ub}\) = flux bounds from MRAS

Biomass Objective

The objective function maximizes biomass production:

\[\mathbf{q} = \begin{cases} -10000 & \text{if reaction is biomass} \\ 0 & \text{otherwise} \end{cases}\]

Metabolic Reaction Activity Scores (MRAS)

Gene-Protein-Reaction (GPR) Rules

GPR rules define the relationship between genes, proteins, and reactions:

  1. Isoenzymes (OR relationship): Multiple genes can catalyze the same reaction
  2. Enzyme complexes (AND relationship): Multiple genes required for one reaction

Scoring Algorithm

Isoenzyme Score (OR)

For reactions with multiple isoenzymes:

\[\text{Score}_{\text{iso}} = \sum_{i=1}^{n} \frac{e_i}{w_i}\]

where: - \(e_i\) = expression level of gene \(i\) - \(w_i\) = gene occurrence weight (frequency in GPR rules)

Complex Score (AND)

For enzyme complexes:

\[\text{Score}_{\text{complex}} = \min_{i=1}^{n} \frac{e_i}{w_i}\]

The rate-limiting subunit determines the complex activity.

Mixed Relationships

For complex GPR rules with both AND and OR:

\[\text{Score}_{\text{mixed}} = f(\text{hierarchical combination})\]

Normalization

Scores are normalized per sample:

\[\text{MRAS}_{ij} = \frac{\text{Score}_{ij}}{\max_j(\text{Score}_{ij})}\]

Flux Bounds Construction

Reversible Reactions

For reversible reactions (\(\text{rev} = 1\)):

\[lb_i = -\text{MRAS}_i, \quad ub_i = \text{MRAS}_i\]

Irreversible Reactions

For irreversible reactions (\(\text{rev} = 0\)):

\[lb_i = 0, \quad ub_i = \text{MRAS}_i\]

Exchange Reactions

Exchange reactions represent nutrient uptake/secretion:

\[lb_i = \begin{cases} -1 & \text{if nutrient in medium} \\ 0 & \text{otherwise} \end{cases}\]

Community Modeling (Single-Cell)

Community Stoichiometric Matrix

For \(k\) cell types, the community matrix is:

\[\mathbf{S}_{\text{community}} = \begin{bmatrix} \mathbf{A}_1 & \mathbf{A}_2 & \cdots & \mathbf{A}_k & \mathbf{E} \\ \mathbf{S}_1 & \mathbf{0} & \cdots & \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{S}_2 & \cdots & \mathbf{0} & \mathbf{0} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ \mathbf{0} & \mathbf{0} & \cdots & \mathbf{S}_k & \mathbf{0} \end{bmatrix}\]

where: - \(\mathbf{A}_i\) = exchange matrix for cell type \(i\) - \(\mathbf{S}_i\) = intracellular stoichiometry - \(\mathbf{E}\) = external medium exchange

Weighted Objective

Cell type fractions weight the objective:

\[\mathbf{q}_{\text{community}} = \sum_{i=1}^{k} f_i \cdot \mathbf{q}_i\]

where \(f_i\) is the fraction of cell type \(i\).

Solver: OSQP

METAFLUX uses the OSQP (Operator Splitting Quadratic Program) solver:

Algorithm Settings

settings <- osqp::osqpSettings(
  max_iter = 1000000L,    # Maximum iterations
  eps_abs = 1e-04,         # Absolute tolerance
  eps_rel = 1e-04,         # Relative tolerance
  adaptive_rho_interval = 50,
  verbose = FALSE
)

Convergence

The solver converges when:

\[\|\mathbf{S} \cdot \mathbf{v}\|_{\infty} < \epsilon\]

Typical precision: \(\epsilon \approx 10^{-5}\)

Computational Complexity

Operation Complexity
MRAS calculation \(O(n \cdot g)\)
FBA (per sample) \(O(m \cdot n^2)\)
Community model \(O(k^2 \cdot m \cdot n^2)\)

where: - \(n\) = number of reactions - \(m\) = number of metabolites - \(g\) = number of genes - \(k\) = number of cell types

References

  1. Huang Y, et al. (2023). Characterizing cancer metabolism from bulk and single-cell RNA-seq data using METAFlux. Nature Communications, 14, 4883.

  2. Robinson JL, et al. (2020). An atlas of human metabolism. Science Signaling, 13, eaaz1482.

  3. Orth JD, et al. (2010). What is flux balance analysis? Nature Biotechnology, 28, 245-248.

  4. Stellato B, et al. (2020). OSQP: An operator splitting solver for quadratic programs. Mathematical Programming Computation, 12, 637-672.

Author

Zaoqu Liu - Package maintainer and optimization implementation - GitHub: Zaoqu-Liu - ORCID: 0000-0002-0452-742X