Identifier Translation and Mapping in OmnipathR

Introduction

Biological databases use diverse identifier systems to reference genes, proteins, and other molecular entities. OmnipathR provides a unified framework for translating between these identifier systems, supporting seamless data integration across multiple resources.

Supported Identifier Types

OmnipathR supports translation between numerous identifier systems:

Category Identifier Types
Protein UniProt AC, UniProt ID, RefSeq
Gene HGNC Symbol, Entrez Gene ID, Ensembl Gene ID
Transcript Ensembl Transcript ID, RefSeq mRNA
Peptide Ensembl Peptide ID
External KEGG, Reactome, PDB, InterPro

Theoretical Framework

Identifier Mapping Architecture

The identifier translation in OmnipathR follows a hub-and-spoke model:

                    ┌─────────────┐
                    │   UniProt   │
                    │   (Hub)     │
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
    ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
    │   Ensembl   │ │    HGNC     │ │   Entrez    │
    │   Gene ID   │ │   Symbol    │ │   Gene ID   │
    └─────────────┘ └─────────────┘ └─────────────┘

Handling Ambiguous Mappings

Identifier mappings can be:

  • One-to-one: Single source maps to single target
  • One-to-many: Single source maps to multiple targets (e.g., gene with multiple isoforms)
  • Many-to-one: Multiple sources map to single target (e.g., synonyms)
  • Many-to-many: Complex relationships

OmnipathR provides tools to quantify and handle these ambiguities.

Basic Usage

library(OmnipathR)
library(dplyr)
library(tibble)
library(ggplot2)

Simple Vector Translation

# Translate UniProt IDs to gene symbols
uniprot_ids <- c("P00533", "P04637", "P31749", "P42345")

# Using translate_ids with vector input
result <- translate_ids(
    uniprot_ids,
    uniprot,
    genesymbol
)

print(result)
#> [1] "EGFR" "TP53" "AKT1" "MTOR"

Data Frame Translation

# Create example data frame
expression_data <- tibble(
    uniprot = c("P00533", "P04637", "P31749", "P42345", "P06400"),
    log2fc = c(2.5, -1.8, 1.2, 0.8, -2.1),
    pvalue = c(0.001, 0.005, 0.02, 0.15, 0.003)
)

# Add gene symbol column
expression_annotated <- translate_ids(
    expression_data,
    uniprot,
    genesymbol
)

print(expression_annotated)
#> # A tibble: 5 × 4
#>   uniprot log2fc pvalue genesymbol
#>   <chr>    <dbl>  <dbl> <chr>     
#> 1 P00533     2.5  0.001 EGFR      
#> 2 P04637    -1.8  0.005 TP53      
#> 3 P31749     1.2  0.02  AKT1      
#> 4 P42345     0.8  0.15  MTOR      
#> 5 P06400    -2.1  0.003 RB1

Visualization of Translation Results

# Visualize translated data
if(nrow(expression_annotated) > 0 && "genesymbol" %in% colnames(expression_annotated)) {
    plot_data <- expression_annotated %>%
        filter(!is.na(genesymbol))
    
    if(nrow(plot_data) > 0) {
        ggplot(plot_data, aes(x = reorder(genesymbol, log2fc), y = log2fc, fill = log2fc > 0)) +
            geom_col(alpha = 0.8) +
            geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
            scale_fill_manual(values = c("TRUE" = "#E74C3C", "FALSE" = "#3498DB"), guide = "none") +
            coord_flip() +
            labs(
                title = "Differential Expression",
                subtitle = "UniProt IDs translated to gene symbols",
                x = "Gene Symbol",
                y = "Log2 Fold Change"
            ) +
            theme_minimal() +
            theme(plot.title = element_text(face = "bold"))
    }
}
Expression data with translated gene symbols showing log2 fold change values.

Expression data with translated gene symbols showing log2 fold change values.

Advanced Translation Features

Organism-Specific Translation

# Human protein translation (default)
human_result <- translate_ids(
    c("P00533", "P04637"),
    uniprot,
    genesymbol,
    organism = 9606  # Human NCBI Taxonomy ID
)

print(human_result)
#> [1] "EGFR" "TP53"

Integration with OmniPath Data

Annotating Interaction Data

# Get interactions
interactions <- omnipath(resources = "SIGNOR")

# Check columns
cat("Interaction columns:", paste(head(colnames(interactions), 10), collapse = ", "), "...\n")
#> Interaction columns: source, target, source_genesymbol, target_genesymbol, is_directed, is_stimulation, is_inhibition, consensus_direction, consensus_stimulation, consensus_inhibition ...
cat("Total interactions:", nrow(interactions), "\n")
#> Total interactions: 65532

# Show sample with gene symbols
interactions %>%
    select(source, source_genesymbol, target, target_genesymbol) %>%
    head(10)
#> # A tibble: 10 × 4
#>    source source_genesymbol target target_genesymbol
#>    <chr>  <chr>             <chr>  <chr>            
#>  1 Q13976 PRKG1             Q13507 TRPC3            
#>  2 P06241 FYN               Q9Y210 TRPC6            
#>  3 Q13976 PRKG1             Q9Y210 TRPC6            
#>  4 P12931 SRC               Q9Y210 TRPC6            
#>  5 Q13976 PRKG1             Q9HCX4 TRPC7            
#>  6 Q00535 CDK5              Q8NER1 TRPV1            
#>  7 Q13438 OS9               Q9HBA0 TRPV4            
#>  8 P18031 PTPN1             Q9H1D0 TRPV6            
#>  9 P63244 RACK1             Q9BX84 TRPM6            
#> 10 Q9BX84 TRPM6             Q96QT4 TRPM7

Mapping Statistics

# Get all UniProts from interactions
all_uniprots <- unique(c(interactions$source, interactions$target))
cat("Unique proteins in network:", length(all_uniprots), "\n")
#> Unique proteins in network: 6258

# Sample translation
sample_uniprots <- head(all_uniprots, 100)
sample_translated <- translate_ids(
    data.frame(uniprot = sample_uniprots),
    uniprot,
    genesymbol
)

# Calculate success rate
success_rate <- sum(!is.na(sample_translated$genesymbol)) / nrow(sample_translated) * 100
cat("Translation success rate:", round(success_rate, 1), "%\n")
#> Translation success rate: 100 %

# Visualize
mapping_summary <- data.frame(
    Status = c("Translated", "Not Found"),
    Count = c(sum(!is.na(sample_translated$genesymbol)), 
              sum(is.na(sample_translated$genesymbol)))
)

ggplot(mapping_summary, aes(x = Status, y = Count, fill = Status)) +
    geom_col(alpha = 0.8) +
    geom_text(aes(label = Count), vjust = -0.5) +
    scale_fill_manual(values = c("Translated" = "#27AE60", "Not Found" = "#E74C3C")) +
    labs(
        title = "ID Translation Results",
        subtitle = paste("Sample of", length(sample_uniprots), "UniProt IDs"),
        y = "Number of IDs"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(face = "bold"),
        legend.position = "none"
    ) +
    expand_limits(y = max(mapping_summary$Count) * 1.1)
Distribution of identifier mapping success rates across different resources.

Distribution of identifier mapping success rates across different resources.

Session Information

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#> [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] tibble_3.3.1     bookdown_0.46    magrittr_2.0.5   ggraph_2.2.2     igraph_2.3.1     ggplot2_4.0.3   
#>  [7] dplyr_1.2.1      Matrix_1.7-5     OmnipathR_3.19.1 BiocStyle_2.41.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1    viridisLite_0.4.3   farver_2.1.2        blob_1.3.0          viridis_0.6.5      
#>  [6] R.utils_2.13.0      S7_0.2.2            fastmap_1.2.0       tweenr_2.0.3        XML_3.99-0.23      
#> [11] digest_0.6.39       timechange_0.4.0    lifecycle_1.0.5     RSQLite_3.52.0      compiler_4.6.0     
#> [16] rlang_1.2.0         sass_0.4.10         progress_1.2.3      tools_4.6.0         utf8_1.2.6         
#> [21] yaml_2.3.12         knitr_1.51          labeling_0.4.3      graphlayouts_1.2.3  prettyunits_1.2.0  
#> [26] bit_4.6.0           curl_7.1.0          xml2_1.5.2          RColorBrewer_1.1-3  R.matlab_3.7.0     
#> [31] withr_3.0.2         purrr_1.2.2         sys_3.4.3           R.oo_1.27.1         polyclip_1.10-7    
#> [36] grid_4.6.0          scales_1.4.0        MASS_7.3-65         cli_3.6.6           rmarkdown_2.31     
#> [41] crayon_1.5.3        generics_0.1.4      otel_0.2.0          httr_1.4.8          tzdb_0.5.0         
#> [46] sessioninfo_1.2.3   readxl_1.5.0        DBI_1.3.0           cachem_1.1.0        ggforce_0.5.0      
#> [51] stringr_1.6.0       rvest_1.0.5         parallel_4.6.0      BiocManager_1.30.27 selectr_0.5-1      
#> [56] cellranger_1.1.0    vctrs_0.7.3         jsonlite_2.0.0      hms_1.1.4           ggrepel_0.9.8      
#> [61] bit64_4.8.2         maketools_1.3.2     tidyr_1.3.2         jquerylib_0.1.4     glue_1.8.1         
#> [66] lubridate_1.9.5     stringi_1.8.7       gtable_0.3.6        later_1.4.8         logger_0.4.2       
#> [71] pillar_1.11.1       rappdirs_0.3.4      htmltools_0.5.9     R6_2.6.1            httr2_1.2.2        
#> [76] tcltk_4.6.0         tidygraph_1.3.1     vroom_1.7.1         evaluate_1.0.5      lattice_0.22-9     
#> [81] readr_2.2.0         R.methodsS3_1.8.2   backports_1.5.1     memoise_2.0.1       bslib_0.11.0       
#> [86] Rcpp_1.1.1-1.1      zip_2.3.3           gridExtra_2.3       checkmate_2.3.4     xfun_0.57          
#> [91] fs_2.1.0            buildtools_1.0.0    pkgconfig_2.0.3

References

  1. Türei D, et al. Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. Molecular Systems Biology 2021;17:e9923.

  2. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research 2021;49:D480-D489.

  3. Cunningham F, et al. Ensembl 2022. Nucleic Acids Research 2022;50:D988-D995.