Here is a sample of gene names of the dataset I want to annotate

Question

Annotating gene names: Why having different number of ensembl IDs using biomaRt

0

Entering edit mode

3.8 years ago

Ridha ▴ 130

Hello everyone! I have a question regarding the biomaRt package. I have an RNA-seq dataset where gene identifiers are gene names(I think) and I want to retrieve the ensemble id and the description of each gene. However, the results of this annotation is different than the number of the rows I want to annotate. Is something wrong with my filter? because I am using hgnc_symbol as a filter. I am not sure whether what I have ARE actually gene ids as I am suspicious about the "genes" that start with RP1. Are these perhaps transcript ids rather than gene names? Additionally, the annotation I got have redundant ensemble IDs for the same gene ID. What should I do? Thank you very much in advance for your help!

Here is a sample of gene names of the dataset I want to annotate

#filtered_resdf$gene_id
[1] "WASH7P"         "RP11-34P13.15"  "RP11-34P13.16"  "FO538757.1"     "U6"             "RP5-857K21.4"   "MTND1P23"      
[8] "MTND2P28"       "MTCO1P12"       "MTCO2P12"       "MTATP6P1"       "MTCO3P12"       "RP11-206L10.2"     "RP11-206L10.9" 
[15] "RP11-206L10.8"  "FAM87B"         "LINC01128"      "LINC00115"      "RP11-54O7.3"    "SAMD11"         "NOC2L"         
[22] "KLHL17"         "HES4"           "ISG15"          "AGRN"           "C1orf159"       "SDF4"           "B3GALT6"       
[29] "FAM132A"        "UBE2J2"         "SCNN1D"         "ACAP3"          "PUSL1"          "CPSF3L"         "RP5-890O3.9"   
[36] "CPTP"           "TAS1R3"         "DVL1"           "MXRA8"          "AURKAIP1"       "CCNL2"          "RP4-758J18.2"  
[43] "MRPL20"         "RP4-758J18.13"  "VWA1"           "ATAD3B"         "ATAD3A"         "TMEM240"        "SSU72"         
[50] "RP5-832C2.5"    "FNDC10"         "RP11-345P4.9"   "MIB2"           "MMP23B"         "CDK11B"         "RP11-345P4.10"

The code I used for biomart annotation

ensembl<-useEnsembl("ensembl",verbose = T )

ensembl<-useDataset("hsapiens_gene_ensembl",mart = ensembl)

annotation<-getBM(attributes = c("ensembl_gene_id","description","external_gene_name"),
  filters = "hgnc_symbol",
  values = filtered_resdf$gene_id,
  mart=ensembl)# values are what you want to look up

Counting rows

nrow(annotation)# gives 15694
nrow(filtered_resdf) gives 18748

rna-seq R gene • 1.3k views

ADD COMMENT • link updated 3.7 years ago by jared.andrews07 ★ 18k • written 3.8 years ago by Ridha ▴ 130

1

Entering edit mode

Have you read through similar discussions on the site? 1:1 correspondence cannot be expected between HGNC symbols and ENSG IDs, not without filters restricting results to those in canonical chromosomes. Even then, based on mapping criteria, multiple ENSG identifiers may be returned for the same HGNC symbol.

ADD REPLY • link 3.7 years ago by Ram 44k

1

Entering edit mode

Genes started with "RP" as you mentioned are usually pseudogenes or lncRNAs, e.g. RP11-34P13.15 For the RP11 story can check this post: What Are These Rp11 'Genes' In The Genome?

It is not unusual that a gene symbol matched with multiple esids (See this post), or an esid matched with multiple gene symbols (See this post). The good scenario is using gene ids to do downstream analysis, but if you only have gene symbols, I think it's still OK to use them to do something like pathway enrichment analysis...

ADD REPLY • link updated 3.7 years ago by Ram 44k • written 3.7 years ago by darklings ▴ 580

score 1 · Answer 1 · 2021-02-08

The genes that start with "RP11" are usually long non-coding RNAs that little/nothing is known about. They are indeed gene symbols, though they often have issues being mapped to the various identifiers given that their annotations tend to be more wishy washy than protein-coding genes.

Biomart is not perfect and gene symbols often change. They are notoriously difficult to map back to stable gene identifiers and there will be some instances where biomart (and other annotation tools) simply won't be able to do it. No conversion will be perfect, unfortunately. Going the opposite way is generally a lot easier, which is why it's recommended to use stable gene identifiers for analysis and grab symbols for viz only if needed.