Question: biomaRt duplicates some unique IDs during getSequence() query but not others
gravatar for eromasko
5.6 years ago by
United States
eromasko120 wrote:

Hello to everyone. After reading through the biomaRt Reference and Users guides, I have a question I'm hoping someone here can help with. I am using Bioconductor's biomaRt package to query and store 500 nt sequences from the Ensembl gene database (mouse dataset) based on lists of unique MGI IDs. These are ultimately being used on a local instance of the MEME suite for motif analyses. Currently, I am working with lists containing from ~20 to ~450 MGI IDs. In some, but not all lists, some IDs are being duplicated. For example, in a list of 425 MGI IDs, one ID is being duplicated (MGI:1920713). On another list, 2 IDs are being duplicated (MGI:102935 and 1923008). From visually checking (i.e.: looking at my "MGI_IDs_1.txt" and my "geneIDs" and "seqs" variables), I can tell that the duplication is occurring during the getSequence() step and not during any earlier step including the readLines() step. By looking at the MGI IDs on MGI's site, I also can't find any reason why those IDs would be singled out. Anyone have any ideas why? Thanks very much in advance. 

Here's my R code:


mgi bioconductor biomart R • 2.8k views
ADD COMMENTlink modified 5.4 years ago by Obi Griffith18k • written 5.6 years ago by eromasko120
gravatar for RamRS
5.6 years ago by
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

The first question if you see duplicate IDs is: Do your FASTA headers have white spaces? If you assume that each header is read as an ID only until the first white space, will the IDs still be unique?

If not, you're better off replacing blank spaces in headers with a placeholder, such as a double underscore (__)

ADD COMMENTlink written 5.6 years ago by RamRS30k

Hi Ram. I'm not sure I follow exactly what you mean. If you are asking about the FASTA file which is created at the end, there is no white space after the header. Here's an example of the first few lines:



ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by eromasko120

Can you give us the first few (~20) lines of theMGI_ IDs file please?

ADD REPLYlink written 5.6 years ago by RamRS30k
ADD REPLYlink written 5.6 years ago by eromasko120

OK, this doesn't seem to be the problem I thought it was. Let's wait for someone with better or more specific inputs.

ADD REPLYlink written 5.6 years ago by RamRS30k
gravatar for Obi Griffith
5.4 years ago by
Obi Griffith18k
Washington University, St Louis, USA
Obi Griffith18k wrote:

I believe the reason is that some of your MGI Ids actually correspond to multiple Ensembl Gene IDs. See the results of the BioMart query below using one of your example problematic IDs (MGI:1920713). You can see that two different Ensembl Genes have the same MGI ID and symbol (Als2cr11). This happens commonly when filtering Ensembl biomart results on external identifiers because there is not always a 1-to-1 relationship between identifiers. The fundamental key of Ensembl Biomart is the Ensembl gene. Even when using the 'unique results only' option you will sometimes see duplicate rows in a biomart result and this is usually the reason. I always carry forward Ensembl Gene (and often transcript) identifiers to explain what would otherwise appear as unexplained duplicates. I'm guessing in this specific case that you actually have two different fasta sequences as well. You will need to figure out which is the gene you actually want. One of the Ensembl genes (ENSMUSG00000047383) for MGI:1920713 has a single transcript for a known unprocessed pseudogene. The other Ensembl gene (ENSMUSG00000072295) has two transcripts, a non-coding processed transcript and a protein-coding transcript. I'm guessing the latter is what you are actually after. Incidentally, these two ensembl genes are close in proximity but probably are correctly marked as separate genes. The mistake was likely in assigning the MGI ID to both. But, I am not a genome annotation expert. Depending on what your ultimate analysis goals are you might be able to solve this issue by filtering on gene/transcript biotype.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Obi Griffith18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1264 users visited in the last hour