This vignette provides a detailed explanation of the algorithms and mathematical foundations underlying darwin. Understanding these concepts will help users make informed decisions about parameter settings and interpret results correctly.
Given a reference expression matrix \(\mathbf{X} \in \mathbb{R}^{k \times n}\) where:
We seek a gene subset \(S \subseteq \{1, \ldots, n\}\) that optimizes multiple conflicting objectives simultaneously:
\[\min_{S} \mathbf{f}(S) = (f_1(\mathbf{X}_S), f_2(\mathbf{X}_S), \ldots, f_m(\mathbf{X}_S))\]
where \(\mathbf{X}_S\) is the submatrix of \(\mathbf{X}\) containing only the genes in \(S\).
In multi-objective optimization, we use the concept of Pareto dominance to compare solutions:
Definition: Solution \(\mathbf{a}\) dominates solution \(\mathbf{b}\) (written \(\mathbf{a} \prec \mathbf{b}\)) if and only if:
The Pareto front (or Pareto-optimal set) consists of all non-dominated solutions—solutions that cannot be improved in any objective without worsening another.
library(darwin)
library(ggplot2)
# Create example fitness landscape
set.seed(42)
n_points <- 100
f1 <- runif(n_points, 0, 10)
f2 <- 10 - f1 + rnorm(n_points, sd = 2)
f2[f2 < 0] <- 0.5
# Find Pareto front
is_pareto <- rep(FALSE, n_points)
for (i in 1:n_points) {
dominated <- FALSE
for (j in 1:n_points) {
if (i != j && f1[j] <= f1[i] && f2[j] >= f2[i] && (f1[j] < f1[i] || f2[j] > f2[i])) {
dominated <- TRUE
break
}
}
is_pareto[i] <- !dominated
}
df <- data.frame(correlation = f1, distance = f2, pareto = is_pareto)
ggplot(df, aes(x = correlation, y = distance, color = pareto)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_manual(values = c("gray60", "#3498db"), labels = c("Dominated", "Pareto Front")) +
geom_line(data = df[df$pareto, ][order(df[df$pareto, ]$correlation), ],
color = "#3498db", linewidth = 1) +
labs(
title = "Pareto Front Visualization",
subtitle = "Trade-off between minimizing correlation and maximizing distance",
x = "Correlation (minimize)",
y = "Distance (maximize)",
color = "Solution Type"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom")Illustration of Pareto dominance and the Pareto front. Points on the front (blue) are non-dominated.
darwin implements NSGA-II (Non-dominated Sorting Genetic Algorithm II), a widely-used evolutionary algorithm for multi-objective optimization.
Algorithm: NSGA-II
Input: Population size N, generations G, objectives f₁...fₘ
Output: Pareto-optimal gene subsets
1. Initialize population P₀ with N random gene selections
2. Evaluate objectives for all individuals
3. For generation t = 1 to G:
a. Create offspring Q_t using selection, crossover, mutation
b. Combine: R_t = P_t ∪ Q_t
c. Non-dominated sorting of R_t into fronts F₁, F₂, ...
d. Select N individuals for P_{t+1} using fronts and crowding distance
4. Return final Pareto front
Individuals are sorted into fronts based on dominance:
# Simulate fitness values
set.seed(123)
n <- 20
fitness <- matrix(runif(n * 2), ncol = 2)
colnames(fitness) <- c("Obj1", "Obj2")
# Non-dominated sorting (simplified)
ranks <- darwin:::.non_dominated_sort(-fitness) # Negate for maximization
df_fronts <- data.frame(
Obj1 = fitness[, 1],
Obj2 = fitness[, 2],
Front = factor(ranks)
)
ggplot(df_fronts, aes(x = Obj1, y = Obj2, color = Front, shape = Front)) +
geom_point(size = 4) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Non-dominated Sorting",
subtitle = "Individuals assigned to Pareto fronts",
x = "Objective 1",
y = "Objective 2"
) +
theme_minimal(base_size = 12)Non-dominated sorting assigns individuals to Pareto fronts.
Within each front, crowding distance measures the density of solutions around each individual. Higher crowding distance indicates more isolated solutions, which are preferred to maintain diversity.
For individual \(i\) in front \(F\):
\[d_i = \sum_{m=1}^{M} \frac{f_m^{i+1} - f_m^{i-1}}{f_m^{max} - f_m^{min}}\]
where \(f_m^{i+1}\) and \(f_m^{i-1}\) are the objective values of neighboring individuals when sorted by objective \(m\).
set.seed(456)
n <- 15
f1 <- sort(runif(n, 1, 10))
f2 <- 10 - f1 + rnorm(n, sd = 0.3)
fitness_mat <- cbind(f1, f2)
crowding <- crowding_distance(fitness_mat, ranks = rep(1, n))
crowding[is.infinite(crowding)] <- max(crowding[is.finite(crowding)]) * 1.5
df_crowd <- data.frame(f1 = f1, f2 = f2, crowding = crowding)
ggplot(df_crowd, aes(x = f1, y = f2, size = crowding, color = crowding)) +
geom_point(alpha = 0.8) +
scale_size_continuous(range = c(3, 10)) +
scale_color_gradient(low = "#3498db", high = "#e74c3c") +
labs(
title = "Crowding Distance Visualization",
subtitle = "Larger points = higher crowding distance (more isolated)",
x = "Objective 1",
y = "Objective 2",
size = "Crowding\nDistance",
color = "Crowding\nDistance"
) +
theme_minimal(base_size = 12)Crowding distance computation. Boundary solutions receive infinite distance.
Selection combines front rank and crowding distance:
Crossover: darwin uses two crossover strategies:
child1 = parent1 AND parent2,
child2 = parent1 OR parent2Mutation: Bit-flip mutation with probability \(p_{mut}\) per gene:
\[P(\text{flip gene } g) = p_{mut}\]
Measures the total pairwise correlation between cell type expression profiles:
\[f_{corr}(S) = \sum_{i < j} |\rho_{ij}|\]
where \(\rho_{ij}\) is the Pearson correlation coefficient between cell types \(i\) and \(j\).
Intuition: Low correlation ensures cell types are distinguishable, improving deconvolution accuracy.
library(darwin)
set.seed(42)
# Create data with structure
n_ct <- 5
n_genes <- 100
data <- matrix(rnorm(n_ct * n_genes), nrow = n_ct)
rownames(data) <- paste0("CT", 1:n_ct)
# Add markers
for (i in 1:n_ct) {
data[i, ((i-1)*10+1):(i*10)] <- data[i, ((i-1)*10+1):(i*10)] + 3
}
# Compute correlation
corr_mat <- cor(t(data))
# Plot
df_corr <- expand.grid(CT1 = rownames(corr_mat), CT2 = colnames(corr_mat))
df_corr$correlation <- as.vector(corr_mat)
ggplot(df_corr, aes(x = CT1, y = CT2, fill = correlation)) +
geom_tile() +
scale_fill_gradient2(low = "#3498db", mid = "white", high = "#e74c3c", midpoint = 0) +
labs(
title = "Cell Type Correlation Matrix",
subtitle = paste("Total absolute correlation:", round(compute_correlation(data), 2)),
x = "", y = ""
) +
theme_minimal(base_size = 12) +
coord_fixed()Correlation matrix heatmap for selected genes.
Measures the total pairwise Euclidean distance between cell type profiles:
\[f_{dist}(S) = \sum_{i < j} \|\mathbf{x}_i - \mathbf{x}_j\|_2\]
Intuition: Large distances ensure cell type profiles are well-separated in gene expression space.
For deconvolution stability, the condition number of the reference matrix is important:
\[\kappa(\mathbf{X}) = \frac{\sigma_{max}}{\sigma_{min}}\]
A high condition number indicates potential numerical instability in deconvolution.
| Parameter | Typical Range | Effect |
|---|---|---|
ngen |
50-500 | More generations = better convergence, longer runtime |
pop_size |
50-200 | Larger = more diversity, slower |
mutation_prob |
0.001-0.01 | Higher = more exploration |
crossover_prob |
0.6-0.9 | Higher = more recombination |
NSGA-II: Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197.
AutoGeneS: Aliee, H., & Theis, F. J. (2021). AutoGeneS: Automatic gene selection using multi-objective optimization for RNA-seq deconvolution. Cell Systems, 12(7), 706-715.e4.
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_4.0.3 darwin_1.0.0 rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-5 future.apply_1.20.2 gtable_0.3.6
#> [4] jsonlite_2.0.0 dplyr_1.2.1 compiler_4.6.0
#> [7] Rcpp_1.1.1-1.1 tidyselect_1.2.1 parallel_4.6.0
#> [10] jquerylib_0.1.4 globals_0.19.1 scales_1.4.0
#> [13] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9
#> [16] R6_2.6.1 labeling_0.4.3 generics_0.1.4
#> [19] knitr_1.51 future_1.70.0 tibble_3.3.1
#> [22] maketools_1.3.2 bslib_0.11.0 pillar_1.11.1
#> [25] RColorBrewer_1.1-3 rlang_1.2.0 cachem_1.1.0
#> [28] xfun_0.57 sass_0.4.10 sys_3.4.3
#> [31] S7_0.2.2 otel_0.2.0 cli_3.6.6
#> [34] withr_3.0.2 magrittr_2.0.5 digest_0.6.39
#> [37] grid_4.6.0 lifecycle_1.0.5 vctrs_0.7.3
#> [40] evaluate_1.0.5 glue_1.8.1 listenv_0.10.1
#> [43] farver_2.1.2 codetools_0.2-20 buildtools_1.0.0
#> [46] parallelly_1.47.0 tools_4.6.0 pkgconfig_2.0.3
#> [49] htmltools_0.5.9