Cross-Species Analysis

Introduction

The iTALK ligand-receptor database uses human gene symbols (e.g., TGFB1, VEGFA). This creates a challenge when analyzing data from other species like mouse, where gene symbols follow different conventions (e.g., Tgfb1, Vegfa).

This vignette describes iTALK’s automatic cross-species conversion system, which enables seamless analysis of non-human data through ortholog mapping via Ensembl BioMart.

Species Detection

Gene Naming Conventions

Different species follow distinct gene naming patterns:

Species Convention Examples
Human ALL UPPERCASE TGFB1, VEGFA, CD8A
Mouse Title Case Tgfb1, Vegfa, Cd8a
Rat Title Case Tgfb1, Vegfa, Cd8a

Automatic Detection

library(iTALK)

# Human genes
human_result <- detect_species(c("TGFB1", "VEGFA", "IL6", "TNF", "CD8A"))
cat("Human detection:\n")
#> Human detection:
cat("  Species:", human_result$species, "\n")
#>   Species: Homo_sapiens
cat("  Confidence:", round(human_result$confidence * 100, 1), "%\n")
#>   Confidence: 100 %
cat("  Method:", human_result$method, "\n\n")
#>   Method: uppercase_pattern

# Mouse genes
mouse_result <- detect_species(c("Tgfb1", "Vegfa", "Il6", "Tnf", "Cd8a"))
cat("Mouse detection:\n")
#> Mouse detection:
cat("  Species:", mouse_result$species, "\n")
#>   Species: Mus_musculus
cat("  Confidence:", round(mouse_result$confidence * 100, 1), "%\n")
#>   Confidence: 100 %
cat("  Method:", mouse_result$method, "\n\n")
#>   Method: titlecase_pattern

# Mixed (ambiguous)
mixed_result <- detect_species(c("TGFB1", "Vegfa", "IL6", "Tnf"))
cat("Mixed detection:\n")
#> Mixed detection:
cat("  Species:", mixed_result$species, "\n")
#>   Species: unknown
cat("  Confidence:", round(mixed_result$confidence * 100, 1), "%\n")
#>   Confidence: 50 %

Detection Algorithm

┌─────────────────────────────────────┐
│         Input Gene List             │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│  Sample up to 100 unique genes      │
│  Filter: length ≥ 3, contains A-Za-z│
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│     Pattern Matching                │
│  Human: ^[A-Z0-9]+$                 │
│  Mouse: ^[A-Z][a-z0-9]+[A-Za-z0-9]*$│
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│     Calculate Proportions           │
│  prop_human = n_human / n_total     │
│  prop_mouse = n_mouse / n_total     │
└────────────────┬────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────┐
│     Threshold Check (default: 70%)  │
│  if prop_human ≥ 0.7 → Homo_sapiens │
│  if prop_mouse ≥ 0.7 → Mus_musculus │
│  else → unknown                     │
└─────────────────────────────────────┘

Ortholog Mapping via BioMart

How It Works

When mouse genes are detected, iTALK queries Ensembl BioMart to retrieve ortholog mappings:

# Manual conversion example
conversion <- convert_species_biomart(
  genes = c("Tgfb1", "Vegfa", "Ctnnb1", "Cd8a", "Ptprc"),
  from_species = "Mus_musculus",
  to_species = "Homo_sapiens",
  ensembl_version = 103,  # Fixed version for reproducibility
  cache = TRUE
)

# View mapping results
conversion$mapping
#>   from_gene to_gene
#> 1    Tgfb1   TGFB1
#> 2    Vegfa   VEGFA
#> 3   Ctnnb1  CTNNB1
#> 4     Cd8a    CD8A
#> 5    Ptprc   PTPRC

# Statistics
conversion$stats
#> $n_input: 5
#> $n_mapped: 5
#> $mapping_rate: 1.0

BioMart Query Details

The query retrieves the associated_gene_name attribute for orthologs:

Dataset: mmusculus_gene_ensembl
Filter: external_gene_name (mouse symbols)
Attribute: hsapiens_homolog_associated_gene_name

Caching System

To avoid repeated BioMart queries, results are cached locally:

Cache location: ~/.Rcache/
Cache key: hash(genes) + species + ensembl_version
Cache format: R.cache RDS files

First query: ~15 seconds (network dependent)
Cached query: < 1 second

Automatic Conversion in FindLR

Seamless Workflow

When convert_species = TRUE (default), FindLR() automatically handles species conversion:

# Mouse data - automatic conversion
mouse_genes <- rawParse(mouse_data, top_genes = 50)

lr_pairs <- FindLR(
  data_1 = mouse_genes,
  datatype = "mean count",
  comm_type = "cytokine",
  convert_species = TRUE  # Default
)

# Console output:
# Detected species: Mus_musculus (95.2%)
# Converting mouse genes to human orthologs...
# Mapping complete: 847/1000 genes mapped (84.7%)

Disabling Auto-Conversion

For human data or when conversion is not desired:

lr_pairs <- FindLR(
  data_1 = human_genes,
  datatype = "mean count",
  comm_type = "cytokine",
  convert_species = FALSE
)

Mapping Rates and Considerations

Typical Mapping Rates

Conversion Mapping Rate Notes
Mouse → Human 85-95% Most comprehensive
Rat → Human 80-90% Good coverage
Other mammals 70-85% Variable

One-to-Many Mappings

Some genes have multiple orthologs. iTALK handles these by:

  1. Keeping all mappings in the conversion result
  2. Using aggregation (mean/sum/max) for expression matrices
# Convert expression matrix with one-to-many handling
converted <- convert_expression_matrix(
  expr_matrix = mouse_expr,
  gene_mapping = conversion$mapping,
  handle_duplicates = "mean"  # Options: "mean", "sum", "max"
)

Unmapped Genes

Genes without orthologs are:

  • Listed in conversion$unmapped
  • Excluded from downstream analysis
  • Logged in console messages
# Check unmapped genes
length(conversion$unmapped)
head(conversion$unmapped)
# Typically includes: pseudogenes, species-specific genes, novel transcripts

Advanced Configuration

Using Different Ensembl Versions

# Use specific version for reproducibility
conversion <- convert_species_biomart(
  genes = mouse_genes,
  from_species = "Mus_musculus",
  ensembl_version = 103  # Or "current_release" for latest
)

Mirror Selection

For faster access from different regions:

conversion <- convert_species_biomart(
  genes = mouse_genes,
  from_species = "Mus_musculus",
  mirror = "uswest"  # Options: "www", "uswest", "useast", "asia"
)

SSL Configuration

For environments with SSL certificate issues:

# Disable SSL verification (use with caution)
Sys.setenv(BIOMART_SSL_VERIFY = "0")

# Then run conversion
conversion <- convert_species_biomart(genes = mouse_genes, ...)

Performance Benchmarks

Performance benchmarks (typical workstation)
Operation Genes Time Memory
Species detection 1000 < 0.1s < 1 MB
BioMart query (first) 1000 ~15s ~10 MB
BioMart query (cached) 1000 < 1s < 1 MB
Full FindLR with conversion 1000 ~20s ~15 MB

Troubleshooting

Common Issues

1. BioMart connection timeout

# Increase retry attempts
conversion <- convert_species_biomart(
  genes = mouse_genes,
  from_species = "Mus_musculus",
  max_tries = 10
)

2. Low mapping rate - Check for non-standard gene symbols - Verify species detection is correct - Some genes may be species-specific

3. Cache issues

# Clear cache directory
unlink("~/.Rcache", recursive = TRUE)

Summary

Key points about cross-species analysis in iTALK:

  1. Automatic - Species detection and conversion happen transparently
  2. Accurate - Uses Ensembl BioMart for validated ortholog mappings
  3. Efficient - Intelligent caching minimizes redundant queries
  4. Flexible - Supports manual control when needed

Session Info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.2.1    igraph_2.3.2   iTALK_0.1.1    rmarkdown_2.31
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         generics_0.1.4      tidyr_1.3.2        
#>  [4] shape_1.4.6.1       stringi_1.8.7       hms_1.1.4          
#>  [7] digest_0.6.39       magrittr_2.0.5      evaluate_1.0.5     
#> [10] grid_4.6.0          RColorBrewer_1.1-3  circlize_0.4.18    
#> [13] fastmap_1.2.0       jsonlite_2.0.0      progress_1.2.3     
#> [16] GlobalOptions_0.1.4 purrr_1.2.2         scales_1.4.0       
#> [19] pbapply_1.7-4       randomcoloR_1.1.0.1 jquerylib_0.1.4    
#> [22] cli_3.6.6           crayon_1.5.3        rlang_1.2.0        
#> [25] withr_3.0.3         cachem_1.1.0        yaml_2.3.12        
#> [28] otel_0.2.0          Rtsne_0.17          parallel_4.6.0     
#> [31] tools_4.6.0         colorspace_2.1-2    ggplot2_4.0.3      
#> [34] curl_7.1.0          buildtools_1.0.0    vctrs_0.7.3        
#> [37] R6_2.6.1            lifecycle_1.0.5     stringr_1.6.0      
#> [40] V8_8.2.0            cluster_2.1.8.2     pkgconfig_2.0.3    
#> [43] pillar_1.11.1       bslib_0.11.0        gtable_0.3.6       
#> [46] glue_1.8.1          Rcpp_1.1.1-1.1      xfun_0.59          
#> [49] tibble_3.3.1        tidyselect_1.2.1    sys_3.4.3          
#> [52] knitr_1.51          farver_2.1.2        htmltools_0.5.9    
#> [55] maketools_1.3.2     compiler_4.6.0      prettyunits_1.2.0  
#> [58] S7_0.2.2