Hello everyone! I have a question regarding the biomaRt package. I have an RNA-seq dataset where gene identifiers are gene names(I think) and I want to retrieve the ensemble id and the description of each gene. However, the results of this annotation is different than the number of the rows I want to annotate. Is something wrong with my filter? because I am using hgnc_symbol as a filter. I am not sure whether what I have ARE actually gene ids as I am suspicious about the "genes" that start with RP1. Are these perhaps transcript ids rather than gene names? Additionally, the annotation I got have redundant ensemble IDs for the same gene ID. What should I do? Thank you very much in advance for your help!
Here is a sample of gene names of the dataset I want to annotate
#filtered_resdf$gene_id
[1] "WASH7P" "RP11-34P13.15" "RP11-34P13.16" "FO538757.1" "U6" "RP5-857K21.4" "MTND1P23"
[8] "MTND2P28" "MTCO1P12" "MTCO2P12" "MTATP6P1" "MTCO3P12" "RP11-206L10.2" "RP11-206L10.9"
[15] "RP11-206L10.8" "FAM87B" "LINC01128" "LINC00115" "RP11-54O7.3" "SAMD11" "NOC2L"
[22] "KLHL17" "HES4" "ISG15" "AGRN" "C1orf159" "SDF4" "B3GALT6"
[29] "FAM132A" "UBE2J2" "SCNN1D" "ACAP3" "PUSL1" "CPSF3L" "RP5-890O3.9"
[36] "CPTP" "TAS1R3" "DVL1" "MXRA8" "AURKAIP1" "CCNL2" "RP4-758J18.2"
[43] "MRPL20" "RP4-758J18.13" "VWA1" "ATAD3B" "ATAD3A" "TMEM240" "SSU72"
[50] "RP5-832C2.5" "FNDC10" "RP11-345P4.9" "MIB2" "MMP23B" "CDK11B" "RP11-345P4.10"
The code I used for biomart annotation
ensembl<-useEnsembl("ensembl",verbose = T )
ensembl<-useDataset("hsapiens_gene_ensembl",mart = ensembl)
annotation<-getBM(attributes = c("ensembl_gene_id","description","external_gene_name"),
filters = "hgnc_symbol",
values = filtered_resdf$gene_id,
mart=ensembl)# values are what you want to look up
Counting rows
nrow(annotation)# gives 15694
nrow(filtered_resdf) gives 18748
Have you read through similar discussions on the site? 1:1 correspondence cannot be expected between HGNC symbols and ENSG IDs, not without filters restricting results to those in canonical chromosomes. Even then, based on mapping criteria, multiple ENSG identifiers may be returned for the same HGNC symbol.
Genes started with "RP" as you mentioned are usually pseudogenes or lncRNAs, e.g. RP11-34P13.15 For the RP11 story can check this post: What Are These Rp11 'Genes' In The Genome?
It is not unusual that a gene symbol matched with multiple esids (See this post), or an esid matched with multiple gene symbols (See this post). The good scenario is using gene ids to do downstream analysis, but if you only have gene symbols, I think it's still OK to use them to do something like pathway enrichment analysis...