How Can I get a list of sequences in fasta format from the Blast output hits?
3
1
Entering edit mode
6.7 years ago
yaadriana7 ▴ 10

Question: How Can I get a list of sequences in fasta format from the Blast output hits?

BLAST sequence blastn output fasta • 3.3k views
ADD COMMENT
1
Entering edit mode

Use a format such as tab-delimited -outfmt 6 (makes it easier to parse IMO), and retrieve the query hit ids you want, an go back to your query FASTA and parse out the sequences that you want based on headers.

ADD REPLY
0
Entering edit mode

Thank you for a suggestion!!!

ADD REPLY
1
Entering edit mode
6.7 years ago

if you're using the blast+ version : use the blastdbcmd utility (comes with the blast installation) to retrieve sequences from the blastDBs , either use '-entry <id>' for a single sequence or '-entry_batch <file>' for a whole list of IDs . You'll still have to first get the list of IDs you want to retrieve as suggested by st.ph.n

ADD COMMENT
1
Entering edit mode
6.7 years ago
anicet.ebou ▴ 170

Another method:

  1. Use option -outfmt 5 in BLAST+ suite command line. This will give you a xml file as output.
  2. Use this little python script to parse rapidly data.

     import re
    
     output = open('file.out','w')
     n = 0
     print >> output, 'protein'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e-value'
    
     with open('blast_output.xml','r') as xml:
     for line in xml:
        if re.search('<Iteration_query-def>', line) != None:
            n = n + 1
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_query-def>')
            line = line.rstrip('</')
            query_def = line
        if re.search('No hits found', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_message>')
            line = line.rstrip('</')
            print n
            print >> output, query_def+'\t'+line
        if re.search('<Hit_def>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hit_def>')
            line = line.rstrip('</')
            hit_def = line
        if re.search('<Hit_accession>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hit_accession>')
            line = line.rstrip('</')
            hit_acc = line       
        if re.search('<Hsp_evalue>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hsp_evalue>')
            line = line.rstrip('</')
            e_val = line
            print n
            print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 
    
     output.close()
    

This step will give you a pretty file with proteins, hit name, acession number of hit and e-value as columns. This file can be also parsed easily.

ADD COMMENT
0
Entering edit mode
6.6 years ago
yaadriana7 ▴ 10

Thanks!

I created something like that and it seems to work (for me:) ).

https://github.com/mmagnus/rna-pdb-tools/tree/master/rna_pdb_tools/utils/rna_seq_search_BLASTn_outfmt-6

Best

ADD COMMENT

Login before adding your answer.

Traffic: 2001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6