---
title: "Getting Started with darwin"
author: "Zaoqu Liu"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Getting Started with darwin}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE
)
```

## Introduction

**darwin** is an R package for automatic marker gene selection using multi-objective evolutionary optimization. The package implements the NSGA-II algorithm to identify Pareto-optimal gene subsets for bulk RNA-seq deconvolution.

### Why darwin?

Traditional marker gene selection often relies on single-objective criteria, which may lead to suboptimal solutions. darwin addresses this by:

- **Multi-objective optimization**: Simultaneously balancing multiple criteria
- **Pareto optimality**: Providing a diverse set of trade-off solutions
- **Automated selection**: Reducing manual intervention in gene selection

## Installation

```{r install, eval = FALSE}
# From R-universe (recommended)
install.packages("darwin", repos = "https://zaoqu-liu.r-universe.dev")

# From GitHub
remotes::install_github("Zaoqu-Liu/darwin")
```

## Quick Start

### Load the Package

```{r load}
library(darwin)
```

### Prepare Reference Data

darwin requires a reference expression matrix where rows are cell types and columns are genes.

```{r create-data}
set.seed(42)

# Simulate reference data: 5 cell types × 200 genes
n_celltypes <- 5
n_genes <- 200

reference <- matrix(
  abs(rnorm(n_celltypes * n_genes, mean = 2, sd = 1)),
  nrow = n_celltypes,
  ncol = n_genes
)
rownames(reference) <- paste0("CellType", 1:n_celltypes)
colnames(reference) <- paste0("Gene", 1:n_genes)

# Add cell-type specific marker genes
for (i in 1:n_celltypes) {
  marker_start <- (i - 1) * 10 + 1
  marker_end <- i * 10
  reference[i, marker_start:marker_end] <- reference[i, marker_start:marker_end] + 5
}

print(dim(reference))
```

### Initialize darwin

```{r init}
dw <- darwin(reference)
```

### Run Optimization

```{r optimize}
dw$optimize(
  ngen = 50,                                # Number of generations
  objectives = c("correlation", "distance"), # Objectives
  weights = c(-1, 1),                        # Minimize corr, maximize dist
  pop_size = 50,                             # Population size
  verbose = FALSE,
  parallel = FALSE
)
```

### Visualize Pareto Front

```{r plot-pareto, fig.cap="Pareto front showing the trade-off between correlation and distance objectives."}
dw$plot()
```

### Select Optimal Solution

```{r select}
# Select using weighted criteria
dw$select(weights = c(-1, 1))

# Get selected genes
genes <- dw$get_genes()
cat("Number of selected genes:", length(genes), "\n")
cat("First 10 genes:", paste(head(genes, 10), collapse = ", "), "\n")
```

### View Fitness Values

```{r fitness}
fitness <- dw$get_fitness()
head(fitness)
```

## Working with Seurat Objects

darwin seamlessly integrates with Seurat:
```{r seurat, eval = FALSE}
# From Seurat object
dw <- darwin(
  seurat_obj,
  celltype_key = "cell_type",
  assay = "RNA",
  layer = "data"
)

# Use only highly variable genes
dw <- darwin(
  seurat_obj,
  celltype_key = "cell_type",
  use_highly_variable = TRUE
)
```

## Basic Deconvolution

```{r deconv}
# Create mock bulk data
bulk <- matrix(abs(rnorm(3 * n_genes)), nrow = 3, ncol = n_genes)
colnames(bulk) <- colnames(reference)
rownames(bulk) <- paste0("Sample", 1:3)

# Perform deconvolution
result <- dw$deconvolve(bulk, method = "nnls")

# View estimated proportions
print(round(result$proportions, 3))
```

## Summary

The darwin object provides a summary of the current state:

```{r print}
print(dw)
```

## Session Info

```{r session}
sessionInfo()
```