--- title: "Getting Started with darwin" author: "Zaoqu Liu" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Getting Started with darwin} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` ## Introduction **darwin** is an R package for automatic marker gene selection using multi-objective evolutionary optimization. The package implements the NSGA-II algorithm to identify Pareto-optimal gene subsets for bulk RNA-seq deconvolution. ### Why darwin? Traditional marker gene selection often relies on single-objective criteria, which may lead to suboptimal solutions. darwin addresses this by: - **Multi-objective optimization**: Simultaneously balancing multiple criteria - **Pareto optimality**: Providing a diverse set of trade-off solutions - **Automated selection**: Reducing manual intervention in gene selection ## Installation ```{r install, eval = FALSE} # From R-universe (recommended) install.packages("darwin", repos = "https://zaoqu-liu.r-universe.dev") # From GitHub remotes::install_github("Zaoqu-Liu/darwin") ``` ## Quick Start ### Load the Package ```{r load} library(darwin) ``` ### Prepare Reference Data darwin requires a reference expression matrix where rows are cell types and columns are genes. ```{r create-data} set.seed(42) # Simulate reference data: 5 cell types × 200 genes n_celltypes <- 5 n_genes <- 200 reference <- matrix( abs(rnorm(n_celltypes * n_genes, mean = 2, sd = 1)), nrow = n_celltypes, ncol = n_genes ) rownames(reference) <- paste0("CellType", 1:n_celltypes) colnames(reference) <- paste0("Gene", 1:n_genes) # Add cell-type specific marker genes for (i in 1:n_celltypes) { marker_start <- (i - 1) * 10 + 1 marker_end <- i * 10 reference[i, marker_start:marker_end] <- reference[i, marker_start:marker_end] + 5 } print(dim(reference)) ``` ### Initialize darwin ```{r init} dw <- darwin(reference) ``` ### Run Optimization ```{r optimize} dw$optimize( ngen = 50, # Number of generations objectives = c("correlation", "distance"), # Objectives weights = c(-1, 1), # Minimize corr, maximize dist pop_size = 50, # Population size verbose = FALSE, parallel = FALSE ) ``` ### Visualize Pareto Front ```{r plot-pareto, fig.cap="Pareto front showing the trade-off between correlation and distance objectives."} dw$plot() ``` ### Select Optimal Solution ```{r select} # Select using weighted criteria dw$select(weights = c(-1, 1)) # Get selected genes genes <- dw$get_genes() cat("Number of selected genes:", length(genes), "\n") cat("First 10 genes:", paste(head(genes, 10), collapse = ", "), "\n") ``` ### View Fitness Values ```{r fitness} fitness <- dw$get_fitness() head(fitness) ``` ## Working with Seurat Objects darwin seamlessly integrates with Seurat: ```{r seurat, eval = FALSE} # From Seurat object dw <- darwin( seurat_obj, celltype_key = "cell_type", assay = "RNA", layer = "data" ) # Use only highly variable genes dw <- darwin( seurat_obj, celltype_key = "cell_type", use_highly_variable = TRUE ) ``` ## Basic Deconvolution ```{r deconv} # Create mock bulk data bulk <- matrix(abs(rnorm(3 * n_genes)), nrow = 3, ncol = n_genes) colnames(bulk) <- colnames(reference) rownames(bulk) <- paste0("Sample", 1:3) # Perform deconvolution result <- dw$deconvolve(bulk, method = "nnls") # View estimated proportions print(round(result$proportions, 3)) ``` ## Summary The darwin object provides a summary of the current state: ```{r print} print(dw) ``` ## Session Info ```{r session} sessionInfo() ```