--- title: "Algorithm Theory: NSGA-II for Gene Selection" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 fig_caption: true vignette: > %\VignetteIndexEntry{Algorithm Theory: NSGA-II for Gene Selection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` ## Introduction This vignette provides a detailed explanation of the algorithms and mathematical foundations underlying **darwin**. Understanding these concepts will help users make informed decisions about parameter settings and interpret results correctly. ## The Multi-Objective Optimization Problem ### Problem Formulation Given a reference expression matrix $\mathbf{X} \in \mathbb{R}^{k \times n}$ where: - $k$ = number of cell types - $n$ = number of genes We seek a gene subset $S \subseteq \{1, \ldots, n\}$ that optimizes multiple conflicting objectives simultaneously: $$\min_{S} \mathbf{f}(S) = (f_1(\mathbf{X}_S), f_2(\mathbf{X}_S), \ldots, f_m(\mathbf{X}_S))$$ where $\mathbf{X}_S$ is the submatrix of $\mathbf{X}$ containing only the genes in $S$. ### Pareto Dominance In multi-objective optimization, we use the concept of **Pareto dominance** to compare solutions: **Definition**: Solution $\mathbf{a}$ **dominates** solution $\mathbf{b}$ (written $\mathbf{a} \prec \mathbf{b}$) if and only if: 1. $\mathbf{a}$ is no worse than $\mathbf{b}$ in all objectives: $\forall i: f_i(\mathbf{a}) \leq f_i(\mathbf{b})$ 2. $\mathbf{a}$ is strictly better than $\mathbf{b}$ in at least one objective: $\exists j: f_j(\mathbf{a}) < f_j(\mathbf{b})$ ### Pareto Front The **Pareto front** (or Pareto-optimal set) consists of all non-dominated solutions—solutions that cannot be improved in any objective without worsening another. ```{r pareto-demo, fig.cap="Illustration of Pareto dominance and the Pareto front. Points on the front (blue) are non-dominated."} library(darwin) library(ggplot2) # Create example fitness landscape set.seed(42) n_points <- 100 f1 <- runif(n_points, 0, 10) f2 <- 10 - f1 + rnorm(n_points, sd = 2) f2[f2 < 0] <- 0.5 # Find Pareto front is_pareto <- rep(FALSE, n_points) for (i in 1:n_points) { dominated <- FALSE for (j in 1:n_points) { if (i != j && f1[j] <= f1[i] && f2[j] >= f2[i] && (f1[j] < f1[i] || f2[j] > f2[i])) { dominated <- TRUE break } } is_pareto[i] <- !dominated } df <- data.frame(correlation = f1, distance = f2, pareto = is_pareto) ggplot(df, aes(x = correlation, y = distance, color = pareto)) + geom_point(size = 3, alpha = 0.7) + scale_color_manual(values = c("gray60", "#3498db"), labels = c("Dominated", "Pareto Front")) + geom_line(data = df[df$pareto, ][order(df[df$pareto, ]$correlation), ], color = "#3498db", linewidth = 1) + labs( title = "Pareto Front Visualization", subtitle = "Trade-off between minimizing correlation and maximizing distance", x = "Correlation (minimize)", y = "Distance (maximize)", color = "Solution Type" ) + theme_minimal(base_size = 12) + theme(legend.position = "bottom") ``` ## NSGA-II Algorithm darwin implements **NSGA-II** (Non-dominated Sorting Genetic Algorithm II), a widely-used evolutionary algorithm for multi-objective optimization. ### Algorithm Overview ``` Algorithm: NSGA-II Input: Population size N, generations G, objectives f₁...fₘ Output: Pareto-optimal gene subsets 1. Initialize population P₀ with N random gene selections 2. Evaluate objectives for all individuals 3. For generation t = 1 to G: a. Create offspring Q_t using selection, crossover, mutation b. Combine: R_t = P_t ∪ Q_t c. Non-dominated sorting of R_t into fronts F₁, F₂, ... d. Select N individuals for P_{t+1} using fronts and crowding distance 4. Return final Pareto front ``` ### Key Components #### 1. Non-dominated Sorting Individuals are sorted into **fronts** based on dominance: - **Front 1**: All non-dominated individuals - **Front 2**: Non-dominated among remaining individuals - **Front k**: Continue until all individuals are assigned ```{r nds-demo, fig.cap="Non-dominated sorting assigns individuals to Pareto fronts."} # Simulate fitness values set.seed(123) n <- 20 fitness <- matrix(runif(n * 2), ncol = 2) colnames(fitness) <- c("Obj1", "Obj2") # Non-dominated sorting (simplified) ranks <- darwin:::.non_dominated_sort(-fitness) # Negate for maximization df_fronts <- data.frame( Obj1 = fitness[, 1], Obj2 = fitness[, 2], Front = factor(ranks) ) ggplot(df_fronts, aes(x = Obj1, y = Obj2, color = Front, shape = Front)) + geom_point(size = 4) + scale_color_brewer(palette = "Set1") + labs( title = "Non-dominated Sorting", subtitle = "Individuals assigned to Pareto fronts", x = "Objective 1", y = "Objective 2" ) + theme_minimal(base_size = 12) ``` #### 2. Crowding Distance Within each front, **crowding distance** measures the density of solutions around each individual. Higher crowding distance indicates more isolated solutions, which are preferred to maintain diversity. For individual $i$ in front $F$: $$d_i = \sum_{m=1}^{M} \frac{f_m^{i+1} - f_m^{i-1}}{f_m^{max} - f_m^{min}}$$ where $f_m^{i+1}$ and $f_m^{i-1}$ are the objective values of neighboring individuals when sorted by objective $m$. ```{r crowding-demo, fig.cap="Crowding distance computation. Boundary solutions receive infinite distance."} set.seed(456) n <- 15 f1 <- sort(runif(n, 1, 10)) f2 <- 10 - f1 + rnorm(n, sd = 0.3) fitness_mat <- cbind(f1, f2) crowding <- crowding_distance(fitness_mat, ranks = rep(1, n)) crowding[is.infinite(crowding)] <- max(crowding[is.finite(crowding)]) * 1.5 df_crowd <- data.frame(f1 = f1, f2 = f2, crowding = crowding) ggplot(df_crowd, aes(x = f1, y = f2, size = crowding, color = crowding)) + geom_point(alpha = 0.8) + scale_size_continuous(range = c(3, 10)) + scale_color_gradient(low = "#3498db", high = "#e74c3c") + labs( title = "Crowding Distance Visualization", subtitle = "Larger points = higher crowding distance (more isolated)", x = "Objective 1", y = "Objective 2", size = "Crowding\nDistance", color = "Crowding\nDistance" ) + theme_minimal(base_size = 12) ``` #### 3. Selection Selection combines front rank and crowding distance: 1. Prefer individuals with better (lower) front rank 2. Within the same front, prefer higher crowding distance #### 4. Genetic Operators **Crossover**: darwin uses two crossover strategies: - **AND/OR crossover** (early generations): `child1 = parent1 AND parent2`, `child2 = parent1 OR parent2` - **Two-point crossover** (later generations): Standard segment exchange **Mutation**: Bit-flip mutation with probability $p_{mut}$ per gene: $$P(\text{flip gene } g) = p_{mut}$$ ## Objective Functions ### Correlation Objective (Minimize) Measures the total pairwise correlation between cell type expression profiles: $$f_{corr}(S) = \sum_{i < j} |\rho_{ij}|$$ where $\rho_{ij}$ is the Pearson correlation coefficient between cell types $i$ and $j$. **Intuition**: Low correlation ensures cell types are distinguishable, improving deconvolution accuracy. ```{r corr-demo, fig.cap="Correlation matrix heatmap for selected genes."} library(darwin) set.seed(42) # Create data with structure n_ct <- 5 n_genes <- 100 data <- matrix(rnorm(n_ct * n_genes), nrow = n_ct) rownames(data) <- paste0("CT", 1:n_ct) # Add markers for (i in 1:n_ct) { data[i, ((i-1)*10+1):(i*10)] <- data[i, ((i-1)*10+1):(i*10)] + 3 } # Compute correlation corr_mat <- cor(t(data)) # Plot df_corr <- expand.grid(CT1 = rownames(corr_mat), CT2 = colnames(corr_mat)) df_corr$correlation <- as.vector(corr_mat) ggplot(df_corr, aes(x = CT1, y = CT2, fill = correlation)) + geom_tile() + scale_fill_gradient2(low = "#3498db", mid = "white", high = "#e74c3c", midpoint = 0) + labs( title = "Cell Type Correlation Matrix", subtitle = paste("Total absolute correlation:", round(compute_correlation(data), 2)), x = "", y = "" ) + theme_minimal(base_size = 12) + coord_fixed() ``` ### Distance Objective (Maximize) Measures the total pairwise Euclidean distance between cell type profiles: $$f_{dist}(S) = \sum_{i < j} \|\mathbf{x}_i - \mathbf{x}_j\|_2$$ **Intuition**: Large distances ensure cell type profiles are well-separated in gene expression space. ### Condition Number (Minimize) For deconvolution stability, the condition number of the reference matrix is important: $$\kappa(\mathbf{X}) = \frac{\sigma_{max}}{\sigma_{min}}$$ A high condition number indicates potential numerical instability in deconvolution. ## Parameter Guidelines | Parameter | Typical Range | Effect | |-----------|---------------|--------| | `ngen` | 50-500 | More generations = better convergence, longer runtime | | `pop_size` | 50-200 | Larger = more diversity, slower | | `mutation_prob` | 0.001-0.01 | Higher = more exploration | | `crossover_prob` | 0.6-0.9 | Higher = more recombination | ## References 1. **NSGA-II**: Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. *IEEE Transactions on Evolutionary Computation*, 6(2), 182-197. 2. **AutoGeneS**: Aliee, H., & Theis, F. J. (2021). AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution. *Cell Systems*, 12(7), 706-715.e4. ## Session Info ```{r session} sessionInfo() ```