Question: How Can I get a list of sequences in fasta format from the Blast output hits?
0
gravatar for yaadriana7
15 months ago by
yaadriana70
yaadriana70 wrote:

Question: How Can I get a list of sequences in fasta format from the Blast output hits?

output blastn sequence blast fasta • 823 views
ADD COMMENTlink modified 14 months ago • written 15 months ago by yaadriana70
1

Use a format such as tab-delimited -outfmt 6 (makes it easier to parse IMO), and retrieve the query hit ids you want, an go back to your query FASTA and parse out the sequences that you want based on headers.

ADD REPLYlink modified 15 months ago • written 15 months ago by st.ph.n2.4k

Thank you for a suggestion!!!

ADD REPLYlink written 15 months ago by yaadriana70
1
gravatar for lieven.sterck
15 months ago by
lieven.sterck4.5k
VIB, Ghent, Belgium
lieven.sterck4.5k wrote:

if you're using the blast+ version : use the blastdbcmd utility (comes with the blast installation) to retrieve sequences from the blastDBs , either use '-entry <id>' for a single sequence or '-entry_batch <file>' for a whole list of IDs . You'll still have to first get the list of IDs you want to retrieve as suggested by st.ph.n

ADD COMMENTlink written 15 months ago by lieven.sterck4.5k
1
gravatar for anicet.ebou
15 months ago by
anicet.ebou140
anicet.ebou140 wrote:

Another method:

  1. Use option -outfmt 5 in BLAST+ suite command line. This will give you a xml file as output.
  2. Use this little python script to parse rapidly data.

     import re
    
     output = open('file.out','w')
     n = 0
     print >> output, 'protein'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e-value'
    
     with open('blast_output.xml','r') as xml:
     for line in xml:
        if re.search('<Iteration_query-def>', line) != None:
            n = n + 1
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_query-def>')
            line = line.rstrip('</')
            query_def = line
        if re.search('No hits found', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_message>')
            line = line.rstrip('</')
            print n
            print >> output, query_def+'\t'+line
        if re.search('<Hit_def>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hit_def>')
            line = line.rstrip('</')
            hit_def = line
        if re.search('<Hit_accession>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hit_accession>')
            line = line.rstrip('</')
            hit_acc = line       
        if re.search('<Hsp_evalue>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Hsp_evalue>')
            line = line.rstrip('</')
            e_val = line
            print n
            print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 
    
     output.close()
    

This step will give you a pretty file with proteins, hit name, acession number of hit and e-value as columns. This file can be also parsed easily.

ADD COMMENTlink written 15 months ago by anicet.ebou140
0
gravatar for yaadriana7
14 months ago by
yaadriana70
yaadriana70 wrote:

Thanks!

I created something like that and it seems to work (for me:) ).

https://github.com/mmagnus/rna-pdb-tools/tree/master/rna_pdb_tools/utils/rna_seq_search_BLASTn_outfmt-6

Best

ADD COMMENTlink modified 14 months ago • written 14 months ago by yaadriana70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1678 users visited in the last hour