---
title: "Identifier Translation and Mapping in OmnipathR"
author:
- name: Zaoqu Liu
  email: liuzaoqu@163.com
  affiliation: Department of Interventional Radiology, The First Affiliated Hospital of Zhengzhou University
- name: Dénes Türei
  email: turei.denes@gmail.com
- name: Julio Saez-Rodriguez
  affiliation: Institute for Computational Biomedicine, Heidelberg University
package: OmnipathR
output:
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    toc: true
    toc_depth: 3
    number_sections: true
    fig_caption: true
pkgdown:
  as_is: true
vignette: |
  %\VignetteIndexEntry{Identifier Translation and Mapping in OmnipathR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE,
    collapse = TRUE,
    comment = "#>"
)
```

# Introduction

Biological databases use diverse identifier systems to reference genes, proteins, and other molecular entities. **OmnipathR** provides a unified framework for translating between these identifier systems, supporting seamless data integration across multiple resources.

## Supported Identifier Types

OmnipathR supports translation between numerous identifier systems:

| Category | Identifier Types |
|----------|------------------|
| **Protein** | UniProt AC, UniProt ID, RefSeq |
| **Gene** | HGNC Symbol, Entrez Gene ID, Ensembl Gene ID |
| **Transcript** | Ensembl Transcript ID, RefSeq mRNA |
| **Peptide** | Ensembl Peptide ID |
| **External** | KEGG, Reactome, PDB, InterPro |

# Theoretical Framework

## Identifier Mapping Architecture

The identifier translation in OmnipathR follows a hub-and-spoke model:

```
                    ┌─────────────┐
                    │   UniProt   │
                    │   (Hub)     │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │   Ensembl   │ │    HGNC     │ │   Entrez    │
    │   Gene ID   │ │   Symbol    │ │   Gene ID   │
    └─────────────┘ └─────────────┘ └─────────────┘
```

## Handling Ambiguous Mappings

Identifier mappings can be:

- **One-to-one**: Single source maps to single target
- **One-to-many**: Single source maps to multiple targets (e.g., gene with multiple isoforms)
- **Many-to-one**: Multiple sources map to single target (e.g., synonyms)
- **Many-to-many**: Complex relationships

OmnipathR provides tools to quantify and handle these ambiguities.

# Basic Usage

```{r load-packages}
library(OmnipathR)
library(dplyr)
library(tibble)
library(ggplot2)
```

## Simple Vector Translation

```{r simple-translation}
# Translate UniProt IDs to gene symbols
uniprot_ids <- c("P00533", "P04637", "P31749", "P42345")

# Using translate_ids with vector input
result <- translate_ids(
    uniprot_ids,
    uniprot,
    genesymbol
)

print(result)
```

## Data Frame Translation

```{r df-translation}
# Create example data frame
expression_data <- tibble(
    uniprot = c("P00533", "P04637", "P31749", "P42345", "P06400"),
    log2fc = c(2.5, -1.8, 1.2, 0.8, -2.1),
    pvalue = c(0.001, 0.005, 0.02, 0.15, 0.003)
)

# Add gene symbol column
expression_annotated <- translate_ids(
    expression_data,
    uniprot,
    genesymbol
)

print(expression_annotated)
```

# Visualization of Translation Results

```{r translation-viz, fig.cap="Expression data with translated gene symbols showing log2 fold change values."}
# Visualize translated data
if(nrow(expression_annotated) > 0 && "genesymbol" %in% colnames(expression_annotated)) {
    plot_data <- expression_annotated %>%
        filter(!is.na(genesymbol))
    
    if(nrow(plot_data) > 0) {
        ggplot(plot_data, aes(x = reorder(genesymbol, log2fc), y = log2fc, fill = log2fc > 0)) +
            geom_col(alpha = 0.8) +
            geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
            scale_fill_manual(values = c("TRUE" = "#E74C3C", "FALSE" = "#3498DB"), guide = "none") +
            coord_flip() +
            labs(
                title = "Differential Expression",
                subtitle = "UniProt IDs translated to gene symbols",
                x = "Gene Symbol",
                y = "Log2 Fold Change"
            ) +
            theme_minimal() +
            theme(plot.title = element_text(face = "bold"))
    }
}
```

# Advanced Translation Features

## Organism-Specific Translation

```{r organism-specific}
# Human protein translation (default)
human_result <- translate_ids(
    c("P00533", "P04637"),
    uniprot,
    genesymbol,
    organism = 9606  # Human NCBI Taxonomy ID
)

print(human_result)
```

# Integration with OmniPath Data

## Annotating Interaction Data

```{r annotate-interactions}
# Get interactions
interactions <- omnipath(resources = "SIGNOR")

# Check columns
cat("Interaction columns:", paste(head(colnames(interactions), 10), collapse = ", "), "...\n")
cat("Total interactions:", nrow(interactions), "\n")

# Show sample with gene symbols
interactions %>%
    select(source, source_genesymbol, target, target_genesymbol) %>%
    head(10)
```

# Mapping Statistics

```{r mapping-stats, fig.cap="Distribution of identifier mapping success rates across different resources."}
# Get all UniProts from interactions
all_uniprots <- unique(c(interactions$source, interactions$target))
cat("Unique proteins in network:", length(all_uniprots), "\n")

# Sample translation
sample_uniprots <- head(all_uniprots, 100)
sample_translated <- translate_ids(
    data.frame(uniprot = sample_uniprots),
    uniprot,
    genesymbol
)

# Calculate success rate
success_rate <- sum(!is.na(sample_translated$genesymbol)) / nrow(sample_translated) * 100
cat("Translation success rate:", round(success_rate, 1), "%\n")

# Visualize
mapping_summary <- data.frame(
    Status = c("Translated", "Not Found"),
    Count = c(sum(!is.na(sample_translated$genesymbol)), 
              sum(is.na(sample_translated$genesymbol)))
)

ggplot(mapping_summary, aes(x = Status, y = Count, fill = Status)) +
    geom_col(alpha = 0.8) +
    geom_text(aes(label = Count), vjust = -0.5) +
    scale_fill_manual(values = c("Translated" = "#27AE60", "Not Found" = "#E74C3C")) +
    labs(
        title = "ID Translation Results",
        subtitle = paste("Sample of", length(sample_uniprots), "UniProt IDs"),
        y = "Number of IDs"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold"),
        legend.position = "none"
    ) +
    expand_limits(y = max(mapping_summary$Count) * 1.1)
```

# Session Information

```{r session-info}
sessionInfo()
```

# References

1. Türei D, et al. Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. *Molecular Systems Biology* 2021;17:e9923.

2. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. *Nucleic Acids Research* 2021;49:D480-D489.

3. Cunningham F, et al. Ensembl 2022. *Nucleic Acids Research* 2022;50:D988-D995.