--- title: "Identifier Translation and Mapping in OmnipathR" author: - name: Zaoqu Liu email: liuzaoqu@163.com affiliation: Department of Interventional Radiology, The First Affiliated Hospital of Zhengzhou University - name: Dénes Türei email: turei.denes@gmail.com - name: Julio Saez-Rodriguez affiliation: Institute for Computational Biomedicine, Heidelberg University package: OmnipathR output: bookdown::html_document2: base_format: rmarkdown::html_vignette toc: true toc_depth: 3 number_sections: true fig_caption: true pkgdown: as_is: true vignette: | %\VignetteIndexEntry{Identifier Translation and Mapping in OmnipathR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>" ) ``` # Introduction Biological databases use diverse identifier systems to reference genes, proteins, and other molecular entities. **OmnipathR** provides a unified framework for translating between these identifier systems, supporting seamless data integration across multiple resources. ## Supported Identifier Types OmnipathR supports translation between numerous identifier systems: | Category | Identifier Types | |----------|------------------| | **Protein** | UniProt AC, UniProt ID, RefSeq | | **Gene** | HGNC Symbol, Entrez Gene ID, Ensembl Gene ID | | **Transcript** | Ensembl Transcript ID, RefSeq mRNA | | **Peptide** | Ensembl Peptide ID | | **External** | KEGG, Reactome, PDB, InterPro | # Theoretical Framework ## Identifier Mapping Architecture The identifier translation in OmnipathR follows a hub-and-spoke model: ``` ┌─────────────┐ │ UniProt │ │ (Hub) │ └──────┬──────┘ ┌───────────────┼───────────────┐ │ │ │ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │ Ensembl │ │ HGNC │ │ Entrez │ │ Gene ID │ │ Symbol │ │ Gene ID │ └─────────────┘ └─────────────┘ └─────────────┘ ``` ## Handling Ambiguous Mappings Identifier mappings can be: - **One-to-one**: Single source maps to single target - **One-to-many**: Single source maps to multiple targets (e.g., gene with multiple isoforms) - **Many-to-one**: Multiple sources map to single target (e.g., synonyms) - **Many-to-many**: Complex relationships OmnipathR provides tools to quantify and handle these ambiguities. # Basic Usage ```{r load-packages} library(OmnipathR) library(dplyr) library(tibble) library(ggplot2) ``` ## Simple Vector Translation ```{r simple-translation} # Translate UniProt IDs to gene symbols uniprot_ids <- c("P00533", "P04637", "P31749", "P42345") # Using translate_ids with vector input result <- translate_ids( uniprot_ids, uniprot, genesymbol ) print(result) ``` ## Data Frame Translation ```{r df-translation} # Create example data frame expression_data <- tibble( uniprot = c("P00533", "P04637", "P31749", "P42345", "P06400"), log2fc = c(2.5, -1.8, 1.2, 0.8, -2.1), pvalue = c(0.001, 0.005, 0.02, 0.15, 0.003) ) # Add gene symbol column expression_annotated <- translate_ids( expression_data, uniprot, genesymbol ) print(expression_annotated) ``` # Visualization of Translation Results ```{r translation-viz, fig.cap="Expression data with translated gene symbols showing log2 fold change values."} # Visualize translated data if(nrow(expression_annotated) > 0 && "genesymbol" %in% colnames(expression_annotated)) { plot_data <- expression_annotated %>% filter(!is.na(genesymbol)) if(nrow(plot_data) > 0) { ggplot(plot_data, aes(x = reorder(genesymbol, log2fc), y = log2fc, fill = log2fc > 0)) + geom_col(alpha = 0.8) + geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") + scale_fill_manual(values = c("TRUE" = "#E74C3C", "FALSE" = "#3498DB"), guide = "none") + coord_flip() + labs( title = "Differential Expression", subtitle = "UniProt IDs translated to gene symbols", x = "Gene Symbol", y = "Log2 Fold Change" ) + theme_minimal() + theme(plot.title = element_text(face = "bold")) } } ``` # Advanced Translation Features ## Organism-Specific Translation ```{r organism-specific} # Human protein translation (default) human_result <- translate_ids( c("P00533", "P04637"), uniprot, genesymbol, organism = 9606 # Human NCBI Taxonomy ID ) print(human_result) ``` # Integration with OmniPath Data ## Annotating Interaction Data ```{r annotate-interactions} # Get interactions interactions <- omnipath(resources = "SIGNOR") # Check columns cat("Interaction columns:", paste(head(colnames(interactions), 10), collapse = ", "), "...\n") cat("Total interactions:", nrow(interactions), "\n") # Show sample with gene symbols interactions %>% select(source, source_genesymbol, target, target_genesymbol) %>% head(10) ``` # Mapping Statistics ```{r mapping-stats, fig.cap="Distribution of identifier mapping success rates across different resources."} # Get all UniProts from interactions all_uniprots <- unique(c(interactions$source, interactions$target)) cat("Unique proteins in network:", length(all_uniprots), "\n") # Sample translation sample_uniprots <- head(all_uniprots, 100) sample_translated <- translate_ids( data.frame(uniprot = sample_uniprots), uniprot, genesymbol ) # Calculate success rate success_rate <- sum(!is.na(sample_translated$genesymbol)) / nrow(sample_translated) * 100 cat("Translation success rate:", round(success_rate, 1), "%\n") # Visualize mapping_summary <- data.frame( Status = c("Translated", "Not Found"), Count = c(sum(!is.na(sample_translated$genesymbol)), sum(is.na(sample_translated$genesymbol))) ) ggplot(mapping_summary, aes(x = Status, y = Count, fill = Status)) + geom_col(alpha = 0.8) + geom_text(aes(label = Count), vjust = -0.5) + scale_fill_manual(values = c("Translated" = "#27AE60", "Not Found" = "#E74C3C")) + labs( title = "ID Translation Results", subtitle = paste("Sample of", length(sample_uniprots), "UniProt IDs"), y = "Number of IDs" ) + theme_minimal() + theme( plot.title = element_text(face = "bold"), legend.position = "none" ) + expand_limits(y = max(mapping_summary$Count) * 1.1) ``` # Session Information ```{r session-info} sessionInfo() ``` # References 1. Türei D, et al. Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. *Molecular Systems Biology* 2021;17:e9923. 2. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. *Nucleic Acids Research* 2021;49:D480-D489. 3. Cunningham F, et al. Ensembl 2022. *Nucleic Acids Research* 2022;50:D988-D995.