Question: FASTA Headers Not Useful from Database Download
0
gravatar for mollysil
3.9 years ago by
mollysil0
mollysil0 wrote:

Hi,

I am trying to BLAST many reads against the MaarjAM database (http://maarjam.botany.ut.ee/?action=sTax), a database strictly for arbuscular mycorrhizal fungi sequences. I was able to download the FASTA sequences from the database into a text file and then converted that into a blast-able database using "makeblastdb -in MaarjAM_18s_seq.txt -out MaarjAMdb -dbtype nucl". The main problem is that the headers in the original file are not useful. They look like this:

>gb|AB076274_2004_Saito,_M._GlAc2.1_VTX00166
GGGACATCATGTCGGTCGTGCCTCGGTACGTACTGGTATTGTTGGTTTCTCCCTTCTGACGAACCATGATGTCATTTATT
TGGTGTTGTGGGGAATCAGGACTGTTACTTTGAAAA
>gb|LN620567_2015_Davison,_J._sp._VTX00311
AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGGTC
AGTAGATTGGTCGTGCCACTGGTACGTACTGGTCTTACTGATTCCTCCCTCCTGATGAACTGTAATGCCATTAAT

The headers list out the publication information for the sequence instead of having useful information like the Genus and species names (which is what I need it to say!). I use Blast+ for BLASTing against the NCBI database and it works fine. But using Blast+ for this database is not giving me taxon assignments. Instead, I get alignments with no assignments. Any ideas for fixing this problem easily, without much in the way of new software or package downloads? I am using Linux and I have been using MEGAN to import the blast files. I need to change the headers to provide taxonomic information OR figure out another way to get proper taxon assignments!

Thanks! Molly

headers blast+ fungi maarjam amf • 1.3k views
ADD COMMENTlink modified 3.9 years ago by wjidea50 • written 3.9 years ago by mollysil0
1
gravatar for wjidea
3.9 years ago by
wjidea50
United States
wjidea50 wrote:

It seems like the last part of the sequence header could lead you a taxid in NCBI GenBank.

My solution: parse fasta -> last part of your header (VTX00166) -> search entrez (e.g., API in biopython) -> get taxon id -> translate taxon id using taxdump -> get taxonomy info -> modify original fasta file

Hope it helps.

Edit1:

if you have a large sequence file to query, you may consider downloading the GI to taxid from ftp://ftp.ncbi.nih.gov/pub/taxonomy/. You will need to parse and query the results on your local machine.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by wjidea50

It's actually the first part of the sequence ID that represents the NCBI sequence accession number. So how would I go about matching up the first part of the headers with the NCBI database taxon information? I am able to remove the latter part of the header so it looks like:

gb|AB046938
TGAAACTGCTAATGGCTCATTAA

gb|AB046939
TGAAACTGCTAGGGGCTCATTAA

Do you know any scripts to make the relationship between these sequence IDs and the NCBI taxon information? And then change the headers to reflect taxon information?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by mollysil0

Here is one biopython solution:

from Bio import Entrez

Entrez.email = 'your@email.com' # tell NCBI who you are

fetch = Entrez.efetch(db="nucleotide", id="AB046939", rettype="gb", retmode="text")
result = fetch.read().split('\n')

for line in result:
    # to get taxonomy
    if 'ORGANISM' in line:
        print ' '.join(line.split()[1:])

    # if you want the taxid
    if 'taxon:' in line:
        print line.split('"')[1]
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by wjidea50

I cannot get Biopython. But thank you anyway!

ADD REPLYlink written 3.9 years ago by mollysil0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 843 users visited in the last hour