Question: How Can I get a list of sequences in fasta format from the Blast output hits?
0
gravatar for yaadriana7
2.2 years ago by
yaadriana70
yaadriana70 wrote:

Question: How Can I get a list of sequences in fasta format from the Blast output hits?

output blastn sequence blast fasta • 1.2k views
ADD COMMENTlink modified 2.1 years ago • written 2.2 years ago by yaadriana70
1

Use a format such as tab-delimited -outfmt 6 (makes it easier to parse IMO), and retrieve the query hit ids you want, an go back to your query FASTA and parse out the sequences that you want based on headers.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by st.ph.n2.5k

Thank you for a suggestion!!!

ADD REPLYlink written 2.2 years ago by yaadriana70
1
gravatar for lieven.sterck
2.2 years ago by
lieven.sterck7.2k
VIB, Ghent, Belgium
lieven.sterck7.2k wrote:

if you're using the blast+ version : use the blastdbcmd utility (comes with the blast installation) to retrieve sequences from the blastDBs , either use '-entry <id>' for a single sequence or '-entry_batch <file>' for a whole list of IDs . You'll still have to first get the list of IDs you want to retrieve as suggested by st.ph.n

ADD COMMENTlink written 2.2 years ago by lieven.sterck7.2k
1
gravatar for anicet.ebou
2.2 years ago by
anicet.ebou160
anicet.ebou160 wrote:

Another method:

  1. Use option -outfmt 5 in BLAST+ suite command line. This will give you a xml file as output.
  2. Use this little python script to parse rapidly data.

     import re
    
     output = open('file.out','w')
     n = 0
     print >> output, 'protein'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e-value'
    
     with open('blast_output.xml','r') as xml:
     for line in xml:
        if re.search('<Iteration_query-def>', line) != None:
            n = n + 1
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_query-def>')
            line = line.rstrip('</')
            query_def = line
        if re.search('No hits found', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_message>')
            line = line.rstrip('</')
            print n
            print >> output, query_def+'\t'+line
        if re.search('<Hit_def>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hit_def>')
            line = line.rstrip('</')
            hit_def = line
        if re.search('<Hit_accession>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hit_accession>')
            line = line.rstrip('</')
            hit_acc = line       
        if re.search('<Hsp_evalue>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hsp_evalue>')
            line = line.rstrip('</')
            e_val = line
            print n
            print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 
    
     output.close()
    

This step will give you a pretty file with proteins, hit name, acession number of hit and e-value as columns. This file can be also parsed easily.

ADD COMMENTlink written 2.2 years ago by anicet.ebou160
0
gravatar for yaadriana7
2.1 years ago by
yaadriana70
yaadriana70 wrote:

Thanks!

I created something like that and it seems to work (for me:) ).

https://github.com/mmagnus/rna-pdb-tools/tree/master/rna_pdb_tools/utils/rna_seq_search_BLASTn_outfmt-6

Best

ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by yaadriana70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2223 users visited in the last hour