Question: How Can I get a list of sequences in fasta format from the Blast output hits?
Question: How Can I get a list of sequences in fasta format from the Blast output hits?
if you're using the blast+ version : use the blastdbcmd utility (comes with the blast installation) to retrieve sequences from the blastDBs , either use '-entry <id>' for a single sequence or '-entry_batch <file>' for a whole list of IDs . You'll still have to first get the list of IDs you want to retrieve as suggested by st.ph.n
Another method:
-outfmt 5 in BLAST+ suite command line. This will give you a xml file as output.Use this little python script to parse rapidly data.
 import re
 output = open('file.out','w')
 n = 0
 print >> output, 'protein'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e-value'
 with open('blast_output.xml','r') as xml:
 for line in xml:
    if re.search('<Iteration_query-def>', line) != None:
        n = n + 1
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Iteration_query-def>')
        line = line.rstrip('</')
        query_def = line
    if re.search('No hits found', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Iteration_message>')
        line = line.rstrip('</')
        print n
        print >> output, query_def+'\t'+line
    if re.search('<Hit_def>', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Hit_def>')
        line = line.rstrip('</')
        hit_def = line
    if re.search('<Hit_accession>', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Hit_accession>')
        line = line.rstrip('</')
        hit_acc = line       
    if re.search('<Hsp_evalue>', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Hsp_evalue>')
        line = line.rstrip('</')
        e_val = line
        print n
        print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 
 output.close()
This step will give you a pretty file with proteins, hit name, acession number of hit and e-value as columns. This file can be also parsed easily.
Thanks!
I created something like that and it seems to work (for me:) ).
Best
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Use a format such as tab-delimited
-outfmt 6(makes it easier to parse IMO), and retrieve the query hit ids you want, an go back to your query FASTA and parse out the sequences that you want based on headers.Thank you for a suggestion!!!