Question

BioMart in R retrieves multiple Entrez IDs for one Ensembl ID

0

Entering edit mode

3.4 years ago

ali.cham • 0

Hi,

I use the following code to retrieve some attributes including 'entrezgene_id'

ensembl <- useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl)
entrez.data <- getBM(attributes=c('ensembl_gene_id','entrezgene_id', 'entrezgene_accession', 'entrezgene_description'),filters = 'ensembl_gene_id', values = result.res$ID, mart = ensembl)

But after some filtering, I found that Biomart finds multiple 'entrezgene_id' for 'ensembl_gene_id'.

ENSG00000111215         5554         PRH1         proline rich protein HaeIII subfamily 1
ENSG00000111215        11272        PRR4         proline rich 4

When I use ensembl website and look for gene "ENSG00000111215", there is only one result but in the result report, there is a line like "PRH1 (NCBI gene (formerly Entrezgene) record", which is what Biomart finds as second 'entrezgene_id'.

I was wondering how to get rid of those "formerly Entrezgene"?

Thanks

R • 840 views

ADD COMMENT • link updated 3.4 years ago by Kevin Blighe 87k • written 3.4 years ago by ali.cham • 0

score 1 · Answer 1 · 2020-11-25

If you look at the region, you'll see the difficulty that any annotation database faces: UCSC Track.

There is no 'one size fits all' solution for these ambiguous mappings between the annotation databases and, sometimes, in order to avoid wasting weeks of time, one simply has to set a rule and then move forward.

Via biomaRt, one can also retrieve the cds_start and cds_end, so, my suggestion would be to always retain the longest CDS when there are ambiguities like this

Kevin