Question

biomaRt duplicates some unique IDs during getSequence() query but not others

2

Entering edit mode

9.5 years ago

eromasko ▴ 120

Hello to everyone. After reading through the biomaRt Reference and Users guides, I have a question I'm hoping someone here can help with. I am using Bioconductor's biomaRt package to query and store 500 nt sequences from the Ensembl gene database (mouse dataset) based on lists of unique MGI IDs. These are ultimately being used on a local instance of the MEME suite for motif analyses. Currently, I am working with lists containing from ~20 to ~450 MGI IDs. In some, but not all lists, some IDs are being duplicated. For example, in a list of 425 MGI IDs, one ID is being duplicated (MGI:1920713). On another list, 2 IDs are being duplicated (MGI:102935 and 1923008). From visually checking (i.e.: looking at my "MGI_IDs_1.txt" and my "geneIDs" and "seqs" variables), I can tell that the duplication is occurring during the getSequence() step and not during any earlier step including the readLines() step. By looking at the MGI IDs on MGI's site, I also can't find any reason why those IDs would be singled out. Anyone have any ideas why? Thanks very much in advance.

Here's my R code:

library("biomaRt")
ensembl<-useMart("ensembl",dataset"mmusculus_gene_ensembl")
geneIDs<-readLines("/home/ed/R/Projects/MGI_IDs_1.txt")
seqs<-getSequence(id=geneIDs,type="mgi_id",seqType="gene_flank",upstream=500,mart=ensembl)
exportFASTA(seqs,"/home/ed/R/Projects/list_1.fasta")

Bioconductor biomaRt MGI R • 4.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by eromasko ▴ 120

score 1 · Answer 1 · 2015-03-06

1

Entering edit mode

9.5 years ago

Ram 44k

The first question if you see duplicate IDs is: Do your FASTA headers have white spaces? If you assume that each header is read as an ID only until the first white space, will the IDs still be unique?

If not, you're better off replacing blank spaces in headers with a placeholder, such as a double underscore (__)

ADD COMMENT • link 9.5 years ago by Ram 44k

0

Entering edit mode

Hi Ram. I'm not sure I follow exactly what you mean. If you are asking about the FASTA file which is created at the end, there is no white space after the header. Here's an example of the first few lines:

>MGI:102556
GGGCCGTACAGGGCATATATGTCCTCGGTCACAGGGCTCCTAAGTCCACTGCGTGCAGGCGGGGCTGTGGGAGAGCCGGTGCAGGGAGGTGACCCGGCAAGTAGCCAGGCCCAGGCCTAGGAGTAGGCCGGGCTGTCCAGCCTCAGTGGGAACAGGGCACAGCGACCCAGCTGCCCGTCCCCTCTTCCCTCCTCTCTAATCTAAACCATGCGGGGCTCCAGGCCCCTGCCCAGCCCTAGGACCAAGGCGAGGACACCCTTTTGTACTGCTGAGAATCCCGCAGCCCAATTGAGGCTTGGGGTCTGGGAGGACCGAAGAGTCCCAGGACCCTAGAATTCCCCCTCTCCGGGGGGGGGGGGCGGCAGGCGAGAACTGATTGACAGCTGCGACGTCAGCACGGAGGGGGGGGCGGCCCCGGAGGGAGGGAGGACCAGGAGGAGGGAGCAGAGGCCCCAAAGTGCAGCCTCGGAGATCAAGGGCCCGCTCTACACCCGGGATGT

>MGI:95564
TTCCCCAAATCACTAGGTAGTAGAGCGATTTATTTTGCATCCTAGCCATAAGAGTCCCAGAAGTCAGGGTCAATGTAGTTATCACTGTTCCTAGATCAAAGTGTTCTCCAATGGAATTTATCACTTTTTTTTTACTTCCCTCCCGCCGCGACTTGCAGTTCAAACAGGTCCCACGCCAGTGCCGTGAAAAATCTACCCAATGCCCCTCCACTTCTGGTCGGGCCCCATTTTCGACAACTGGAGGCCTTTTCCCATGTCCTCGCTTCCTCCTGTACGGAAGAGGCCTCTAGTCCCCGCGGCGGGCACAGGCAGCCAGACTTAGTACTGCCCTTTACGCTTTCCCGCCTTTCACCAAGTGCGCGCGCCAAAGGGTGCGCCCTGATACCCTGAGGCAGGTACCCAGGAGGGCGCATGCGCGTGCTCCACTGCCCGAGAGGAGGCAAGGGCGGAGCGGAGAGGGCGGGGCCAAAGGGGAGGGGCTGCGCGGGTCACGTGACATC

>MGI:1333766
CCTTTTTATTGAAATGCACACACAGAAAAGTATACAAATCTGAAATATGCATAAGTGAATCTTCATGAAGTGAACGCACTTGTAGCAAAATAAATCAACCCTGTCTGGCACCTGGCCTCCTTCCTGTCTCTTCATCACAACATTTTTCACGTGCAAGAGGAAATACTAGGAGACCATGGTTCTGATTAGGCTAGGCGGGGTTCTGTGCTTGAGGATTTAAGGGAGGAGCATGGATTGATGGTTAAGGTAGTTATGTATGCCCTGGGGAATTGTTGAGGTAGGGGGGATGTGTTGGACTCTGAGTTAGGGCTTCATAAATTTGTTAGGTTTGTGTAGAAGCAGGAGCTATTTAGAGAAGGCCTTGGGGGCATATGTGGGAGGGGCGGACCGGTTTGGAGGCGGGGCCTAGGAGCGAGGCAGGGCGGAGCCAGTGTGTGTGGGAGGAGCCTCCGCGCTCGGAGGGCGGGGCTCGCGCGCGCGCGCTCGCTGCCGGCGCGCCC

ADD REPLY • link 9.5 years ago by eromasko ▴ 120

0

Entering edit mode

Can you give us the first few (~20) lines of theMGI_ IDs file please?

ADD REPLY • link 9.5 years ago by Ram 44k

0

Entering edit mode

MGI:1915804
MGI:1916541
MGI:1922895
MGI:1922764
MGI:1917623
MGI:1922873
MGI:1920637
MGI:1920832
MGI:1923913
MGI:1913628
MGI:1920923
MGI:1919402
MGI:1913561
MGI:1913828
MGI:1919831
MGI:1914669
MGI:1917941
MGI:1921185
MGI:1922105
MGI:1918893
MGI:1921916

ADD REPLY • link 9.5 years ago by eromasko ▴ 120

0

Entering edit mode

OK, this doesn't seem to be the problem I thought it was. Let's wait for someone with better or more specific inputs.

ADD REPLY • link 9.5 years ago by Ram 44k

score 0 · Answer 2 · 2015-04-26

I believe the reason is that some of your MGI Ids actually correspond to multiple Ensembl Gene IDs. See the results of the BioMart query below using one of your example problematic IDs (MGI:1920713). You can see that two different Ensembl Genes have the same MGI ID and symbol (Als2cr11). This happens commonly when filtering Ensembl biomart results on external identifiers because there is not always a 1-to-1 relationship between identifiers. The fundamental key of Ensembl Biomart is the Ensembl gene. Even when using the 'unique results only' option you will sometimes see duplicate rows in a biomart result and this is usually the reason. I always carry forward Ensembl Gene (and often transcript) identifiers to explain what would otherwise appear as unexplained duplicates. I'm guessing in this specific case that you actually have two different fasta sequences as well. You will need to figure out which is the gene you actually want. One of the Ensembl genes (ENSMUSG00000047383) for MGI:1920713 has a single transcript for a known unprocessed pseudogene. The other Ensembl gene (ENSMUSG00000072295) has two transcripts, a non-coding processed transcript and a protein-coding transcript. I'm guessing the latter is what you are actually after. Incidentally, these two ensembl genes are close in proximity but probably are correctly marked as separate genes. The mistake was likely in assigning the MGI ID to both. But, I am not a genome annotation expert. Depending on what your ultimate analysis goals are you might be able to solve this issue by filtering on gene/transcript biotype.

http://useast.ensembl.org/biomart/martview/35c7f452d93cd16d543c021bcf0174c8