BioMart in R retrieves multiple Entrez IDs for one Ensembl ID
1
0
Entering edit mode
3.4 years ago
ali.cham • 0

Hi,

I use the following code to retrieve some attributes including 'entrezgene_id'

ensembl <- useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl", mart=ensembl)
entrez.data <- getBM(attributes=c('ensembl_gene_id','entrezgene_id', 'entrezgene_accession', 'entrezgene_description'),filters = 'ensembl_gene_id', values = result.res$ID, mart = ensembl)

But after some filtering, I found that Biomart finds multiple 'entrezgene_id' for 'ensembl_gene_id'.

ENSG00000111215         5554         PRH1         proline rich protein HaeIII subfamily 1
ENSG00000111215        11272        PRR4         proline rich 4

When I use ensembl website and look for gene "ENSG00000111215", there is only one result but in the result report, there is a line like "PRH1 (NCBI gene (formerly Entrezgene) record", which is what Biomart finds as second 'entrezgene_id'.

I was wondering how to get rid of those "formerly Entrezgene"?

Thanks

R • 840 views
ADD COMMENT
1
Entering edit mode
3.4 years ago

If you look at the region, you'll see the difficulty that any annotation database faces: UCSC Track.

There is no 'one size fits all' solution for these ambiguous mappings between the annotation databases and, sometimes, in order to avoid wasting weeks of time, one simply has to set a rule and then move forward.

Via biomaRt, one can also retrieve the cds_start and cds_end, so, my suggestion would be to always retain the longest CDS when there are ambiguities like this

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 2115 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6