Question: BioMart in R retrieves multiple Entrez IDs for one Ensembl ID
gravatar for ali.cham
8 weeks ago by
ali.cham0 wrote:


I use the following code to retrieve some attributes including 'entrezgene_id'

ensembl <- useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl) <- getBM(attributes=c('ensembl_gene_id','entrezgene_id', 'entrezgene_accession', 'entrezgene_description'),filters = 'ensembl_gene_id', values = result.res$ID, mart = ensembl)

But after some filtering, I found that Biomart finds multiple 'entrezgene_id' for 'ensembl_gene_id'.

ENSG00000111215         5554         PRH1         proline rich protein HaeIII subfamily 1
ENSG00000111215        11272        PRR4         proline rich 4

When I use ensembl website and look for gene "ENSG00000111215", there is only one result but in the result report, there is a line like "PRH1 (NCBI gene (formerly Entrezgene) record", which is what Biomart finds as second 'entrezgene_id'.

I was wondering how to get rid of those "formerly Entrezgene"?


R • 122 views
ADD COMMENTlink modified 8 weeks ago by Kevin Blighe69k • written 8 weeks ago by ali.cham0
gravatar for Kevin Blighe
8 weeks ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

If you look at the region, you'll see the difficulty that any annotation database faces: UCSC Track.

There is no 'one size fits all' solution for these ambiguous mappings between the annotation databases and, sometimes, in order to avoid wasting weeks of time, one simply has to set a rule and then move forward.

Via biomaRt, one can also retrieve the cds_start and cds_end, so, my suggestion would be to always retain the longest CDS when there are ambiguities like this


ADD COMMENTlink written 8 weeks ago by Kevin Blighe69k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1836 users visited in the last hour