1
0
Entering edit mode
4.8 years ago
mollysil • 0

Hi,

I am trying to BLAST many reads against the MaarjAM database (http://maarjam.botany.ut.ee/?action=sTax), a database strictly for arbuscular mycorrhizal fungi sequences. I was able to download the FASTA sequences from the database into a text file and then converted that into a blast-able database using "makeblastdb -in MaarjAM_18s_seq.txt -out MaarjAMdb -dbtype nucl". The main problem is that the headers in the original file are not useful. They look like this:

>gb|AB076274_2004_Saito,_M._GlAc2.1_VTX00166
GGGACATCATGTCGGTCGTGCCTCGGTACGTACTGGTATTGTTGGTTTCTCCCTTCTGACGAACCATGATGTCATTTATT
TGGTGTTGTGGGGAATCAGGACTGTTACTTTGAAAA
>gb|LN620567_2015_Davison,_J._sp._VTX00311
AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGGTC
AGTAGATTGGTCGTGCCACTGGTACGTACTGGTCTTACTGATTCCTCCCTCCTGATGAACTGTAATGCCATTAAT


The headers list out the publication information for the sequence instead of having useful information like the Genus and species names (which is what I need it to say!). I use Blast+ for BLASTing against the NCBI database and it works fine. But using Blast+ for this database is not giving me taxon assignments. Instead, I get alignments with no assignments. Any ideas for fixing this problem easily, without much in the way of new software or package downloads? I am using Linux and I have been using MEGAN to import the blast files. I need to change the headers to provide taxonomic information OR figure out another way to get proper taxon assignments!

Thanks! Molly

fungi AMF MaarjAM headers Blast+ • 1.5k views
1
Entering edit mode
4.8 years ago
wjidea ▴ 50

It seems like the last part of the sequence header could lead you a taxid in NCBI GenBank.

My solution: parse fasta -> last part of your header (VTX00166) -> search entrez (e.g., API in biopython) -> get taxon id -> translate taxon id using taxdump -> get taxonomy info -> modify original fasta file

Hope it helps.

Edit1:

if you have a large sequence file to query, you may consider downloading the GI to taxid from ftp://ftp.ncbi.nih.gov/pub/taxonomy/. You will need to parse and query the results on your local machine.

0
Entering edit mode

It's actually the first part of the sequence ID that represents the NCBI sequence accession number. So how would I go about matching up the first part of the headers with the NCBI database taxon information? I am able to remove the latter part of the header so it looks like:

gb|AB046938
TGAAACTGCTAATGGCTCATTAA

gb|AB046939
TGAAACTGCTAGGGGCTCATTAA

Do you know any scripts to make the relationship between these sequence IDs and the NCBI taxon information? And then change the headers to reflect taxon information?

0
Entering edit mode

Here is one biopython solution:

from Bio import Entrez

Entrez.email = 'your@email.com' # tell NCBI who you are

fetch = Entrez.efetch(db="nucleotide", id="AB046939", rettype="gb", retmode="text")

for line in result:
# to get taxonomy
if 'ORGANISM' in line:
print ' '.join(line.split()[1:])

# if you want the taxid
if 'taxon:' in line:
print line.split('"')[1]

0
Entering edit mode

I cannot get Biopython. But thank you anyway!