Getting Started with darwin

Introduction

darwin is an R package for automatic marker gene selection using multi-objective evolutionary optimization. The package implements the NSGA-II algorithm to identify Pareto-optimal gene subsets for bulk RNA-seq deconvolution.

Why darwin?

Traditional marker gene selection often relies on single-objective criteria, which may lead to suboptimal solutions. darwin addresses this by:

  • Multi-objective optimization: Simultaneously balancing multiple criteria
  • Pareto optimality: Providing a diverse set of trade-off solutions
  • Automated selection: Reducing manual intervention in gene selection

Installation

# From R-universe (recommended)
install.packages("darwin", repos = "https://zaoqu-liu.r-universe.dev")

# From GitHub
remotes::install_github("Zaoqu-Liu/darwin")

Quick Start

Load the Package

library(darwin)

Prepare Reference Data

darwin requires a reference expression matrix where rows are cell types and columns are genes.

set.seed(42)

# Simulate reference data: 5 cell types × 200 genes
n_celltypes <- 5
n_genes <- 200

reference <- matrix(
  abs(rnorm(n_celltypes * n_genes, mean = 2, sd = 1)),
  nrow = n_celltypes,
  ncol = n_genes
)
rownames(reference) <- paste0("CellType", 1:n_celltypes)
colnames(reference) <- paste0("Gene", 1:n_genes)

# Add cell-type specific marker genes
for (i in 1:n_celltypes) {
  marker_start <- (i - 1) * 10 + 1
  marker_end <- i * 10
  reference[i, marker_start:marker_end] <- reference[i, marker_start:marker_end] + 5
}

print(dim(reference))
#> [1]   5 200

Initialize darwin

dw <- darwin(reference)

Run Optimization

dw$optimize(
  ngen = 50,                                # Number of generations
  objectives = c("correlation", "distance"), # Objectives
  weights = c(-1, 1),                        # Minimize corr, maximize dist
  pop_size = 50,                             # Population size
  verbose = FALSE,
  parallel = FALSE
)

Visualize Pareto Front

dw$plot()
Pareto front showing the trade-off between correlation and distance objectives.

Pareto front showing the trade-off between correlation and distance objectives.

Select Optimal Solution

# Select using weighted criteria
dw$select(weights = c(-1, 1))

# Get selected genes
genes <- dw$get_genes()
cat("Number of selected genes:", length(genes), "\n")
#> Number of selected genes: 191
cat("First 10 genes:", paste(head(genes, 10), collapse = ", "), "\n")
#> First 10 genes: Gene1, Gene2, Gene3, Gene4, Gene5, Gene6, Gene7, Gene8, Gene9, Gene10

View Fitness Values

fitness <- dw$get_fitness()
head(fitness)
#>   correlation distance
#> 1   0.2371333 283.0623
#> 2   0.3596163 293.1472
#> 3   0.2375669 287.2533
#> 4   0.2515467 288.5683
#> 5   0.2597023 288.9972
#> 6   0.2436569 287.7592

Working with Seurat Objects

darwin seamlessly integrates with Seurat:

# From Seurat object
dw <- darwin(
  seurat_obj,
  celltype_key = "cell_type",
  assay = "RNA",
  layer = "data"
)

# Use only highly variable genes
dw <- darwin(
  seurat_obj,
  celltype_key = "cell_type",
  use_highly_variable = TRUE
)

Basic Deconvolution

# Create mock bulk data
bulk <- matrix(abs(rnorm(3 * n_genes)), nrow = 3, ncol = n_genes)
colnames(bulk) <- colnames(reference)
rownames(bulk) <- paste0("Sample", 1:3)

# Perform deconvolution
result <- dw$deconvolve(bulk, method = "nnls")

# View estimated proportions
print(round(result$proportions, 3))
#>         CellType1 CellType2 CellType3 CellType4 CellType5
#> Sample1     0.210     0.171     0.275     0.148     0.197
#> Sample2     0.078     0.131     0.299     0.233     0.261
#> Sample3     0.189     0.178     0.206     0.167     0.260

Summary

The darwin object provides a summary of the current state:

print(dw)

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_4.0.3  darwin_1.0.0   rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         future_1.70.0       generics_0.1.4     
#>  [4] lattice_0.22-9      listenv_0.10.1      digest_0.6.39      
#>  [7] magrittr_2.0.5      evaluate_1.0.5      grid_4.6.0         
#> [10] RColorBrewer_1.1-3  fastmap_1.2.0       jsonlite_2.0.0     
#> [13] Matrix_1.7-5        mgcv_1.9-4          scales_1.4.0       
#> [16] codetools_0.2-20    jquerylib_0.1.4     cli_3.6.6          
#> [19] rlang_1.2.0         parallelly_1.47.0   future.apply_1.20.2
#> [22] splines_4.6.0       withr_3.0.2         cachem_1.1.0       
#> [25] yaml_2.3.12         otel_0.2.0          tools_4.6.0        
#> [28] parallel_4.6.0      dplyr_1.2.1         globals_0.19.1     
#> [31] buildtools_1.0.0    vctrs_0.7.3         R6_2.6.1           
#> [34] lifecycle_1.0.5     pkgconfig_2.0.3     pillar_1.11.1      
#> [37] bslib_0.11.0        gtable_0.3.6        glue_1.8.1         
#> [40] Rcpp_1.1.1-1.1      xfun_0.57           tibble_3.3.1       
#> [43] tidyselect_1.2.1    sys_3.4.3           knitr_1.51         
#> [46] farver_2.1.2        htmltools_0.5.9     nlme_3.1-169       
#> [49] maketools_1.3.2     labeling_0.4.3      compiler_4.6.0     
#> [52] S7_0.2.2            nnls_1.6