How Can I get a list of sequences in fasta format from the Blast output hits?

How Can I get a list of sequences in fasta format from the Blast output hits?

1

Entering edit mode

7.8 years ago

yaadriana7 ▴ 10

Question: How Can I get a list of sequences in fasta format from the Blast output hits?

BLAST sequence blastn output fasta • 4.2k views

ADD COMMENT • link 7.7 years ago by yaadriana7 ▴ 10

1

Entering edit mode

Use a format such as tab-delimited -outfmt 6 (makes it easier to parse IMO), and retrieve the query hit ids you want, an go back to your query FASTA and parse out the sequences that you want based on headers.

ADD REPLY • link 7.8 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thank you for a suggestion!!!

ADD REPLY • link 7.8 years ago by yaadriana7 ▴ 10

1

Entering edit mode

7.8 years ago

lieven.sterck 16k

if you're using the blast+ version : use the blastdbcmd utility (comes with the blast installation) to retrieve sequences from the blastDBs , either use '-entry <id>' for a single sequence or '-entry_batch <file>' for a whole list of IDs . You'll still have to first get the list of IDs you want to retrieve as suggested by st.ph.n

ADD COMMENT • link 7.8 years ago by lieven.sterck 16k

1

Entering edit mode

7.8 years ago

anicet.ebou ▴ 170

Another method:

Use option -outfmt 5 in BLAST+ suite command line. This will give you a xml file as output.

Use this little python script to parse rapidly data.

 import re

 output = open('file.out','w')
 n = 0
 print >> output, 'protein'+'\t'+'hit_def'+'\t'+'hit_acc'+'\t'+'e-value'

 with open('blast_output.xml','r') as xml:
 for line in xml:
    if re.search('<Iteration_query-def>', line) != None:
        n = n + 1
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Iteration_query-def>')
        line = line.rstrip('</')
        query_def = line
    if re.search('No hits found', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Iteration_message>')
        line = line.rstrip('</')
        print n
        print >> output, query_def+'\t'+line
    if re.search('<Hit_def>', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Hit_def>')
        line = line.rstrip('</')
        hit_def = line
    if re.search('<Hit_accession>', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Hit_accession>')
        line = line.rstrip('</')
        hit_acc = line       
    if re.search('<Hsp_evalue>', line) != None:
        line = line.strip()
        line = line.rstrip()
        line = line.strip('<Hsp_evalue>')
        line = line.rstrip('</')
        e_val = line
        print n
        print >> output, query_def+'\t'+hit_def+'\t'+hit_acc+'\t'+e_val 

 output.close()

This step will give you a pretty file with proteins, hit name, acession number of hit and e-value as columns. This file can be also parsed easily.

ADD COMMENT • link 7.8 years ago by anicet.ebou ▴ 170

0

Entering edit mode

7.7 years ago

yaadriana7 ▴ 10

Thanks!

I created something like that and it seems to work (for me:) ).

https://github.com/mmagnus/rna-pdb-tools/tree/master/rna_pdb_tools/utils/rna_seq_search_BLASTn_outfmt-6

Best

ADD COMMENT • link 7.7 years ago by yaadriana7 ▴ 10

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 6569 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6