Question: How Can I Automate Blast Of >1 Sequence And Output The Top Hits For Each Input?
12
gravatar for bow
8.6 years ago by
bow780
Netherlands
bow780 wrote:

I'm looking for a way to do BLAST simultaneously with >1 sequence. The input is several fasta files or one file containing several sequence formatted in fasta. For an initial testing, I want to try obtaining the top 10 blast results for each input sequence and output them to a text file (one text file for each input sequence, or one big file containing all). But currently, I'm drawing a blank. Is there a way to do this with (preferably) biopython?

python biopython blast • 16k views
ADD COMMENTlink modified 3.3 years ago by mirza80 • written 8.6 years ago by bow780
1

There's an easy way to do this with bash, if all you need is the information from the top 1-10 hits.

ADD REPLYlink written 8.6 years ago by Paulo Nuin3.7k
9
gravatar for brentp
8.6 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

you can do this quite easily via the commandline. assuming your query and subject sequences are not protein. here's a shell script, just adjust your QUERY and SUBJECT

QUERY=some.fasta
SUBJECT=/path/to/other.fasta

formatdb -p F -i $SUBJECT
blastall -p blastn -i $QUERY -d $SUBJECT -K 10 -m 8 -a 8 > blast.out

the -K 10 keeps only the 10 best hits -m 8 makes the output in tab-delimited format and -a 8 uses 8 processors. so blast.out will contain (up to) 10 best hits for each query.

ADD COMMENTlink written 8.6 years ago by brentp23k
1

Ah, found the problem! It should be '-b' and '-K' :).

ADD REPLYlink written 8.6 years ago by bow780

I tried this one too for a local search. It worked fine, except for the '-K' switch. I tried various values (1,5,10), but it keeps giving me plenty (~ >50) of results. Any pointers on where I might be doing wrong?

ADD REPLYlink written 8.6 years ago by bow780

Hello bow and brentp

I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,

"blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out"

1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?

2. It doesn't give the positives in percent, how to get it?

ADD REPLYlink written 3.2 years ago by mirza80
6
gravatar for Niek De Klein
8.6 years ago by
Niek De Klein2.5k
Netherlands
Niek De Klein2.5k wrote:

If you want to do it over the internet using blastn and multiple fasta files it would look something like this:

from Bio.Blast import  NCBIWWW
from Bio.Blast import NCBIXML
file_string = ""    
x = 1

fasta_files = [fasta1, fasta2, fasta3, fasta4]
for i in fasta_files:
    File = open("output"+x+".txt","w")
    fasta_string = open(i+".fasta").read() #or make the names fasta1.fasta and just do open(i).read
    result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, hitlist_size=10)

    blast_records = NCBIXML.parse(result_handle) 
    #or blast_record = NCBIXML.read(result_handle) if you only have one seq in file
    E_VALUE_THRESH = 0.001
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                   file_string += "alignment:",alignment.title+"\n"
                   file_string += "e-value:",hsp.expect+"\n"
    x += 1
    File.write(file_string)

Not sure if this was what you meant and I haven't tested it so may be some little faults in it, but it should take several different files of fasta sequences and write the results as a file as output1.txt, output2.txt etc. For multiple sequences in one file you can use the blast_records part

Hope this helps.

ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Niek De Klein2.5k

Thanks! I changed it a bit (File = open("output"+str(x)+".txt","w")) and it works! I'll try tweaking on my own to make it accept a single file with multiple fasta sequence in it.

ADD REPLYlink written 8.6 years ago by bow780

Hi, where you able to make it accept a single file with multiple fasta sequences in it? I really need help with this, thanks

ADD REPLYlink written 6.4 years ago by Powerpuffgirl0

I suggest using the 7.3 Parsing BLAST output chapter of biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc80

ADD REPLYlink written 8.6 years ago by Niek De Klein2.5k

I suggest using the "7.3 Parsing BLAST output" chapter of biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc80.

ADD REPLYlink written 8.6 years ago by Niek De Klein2.5k

Hi! I was wondering, will this work using protein sequences just by changing 'blastn' to 'blastp'?

ADD REPLYlink written 2.9 years ago by chups51910
2
gravatar for Pierre Lindenbaum
8.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

As far as I remember, the NCBI blastall command takes any number of sequences for its input. In http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/blastall_node26.html

The query must be in FASTA format. If multiple entries are in the input file, all queries will be searched.

Then, there are many engines/parsers for digesting the BLAST output . On my side, I would use the XML output, insert the results in a database via XSLT and select the 10 best hits.

ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Pierre Lindenbaum119k

update: yes Paulo is right. There is an easier solution using bash and the tab-delimited output.

ADD REPLYlink written 8.6 years ago by Pierre Lindenbaum119k

I'm not yet familiar with XML or even XSLT, but thanks! It looks like I won't be running out of self-learning materials for the near future :).

ADD REPLYlink written 8.6 years ago by bow780
0
gravatar for mirza
3.3 years ago by
mirza80
India
mirza80 wrote:

Hello bow and brentp

I used the above commands for local blast and it worked wonderfully, I just needed to replace -K with -b. Thank you. However, I have the following questions about this commands,

"blastall -p blastn -i $QUERY -d $SUBJECT -b 10 -m 8 -a 8 > blast.out"

1. The result does not show any headings for the columns, e-value & score columns r easily identifiable but I don't understand the other results. How can I get the headings?

2. It doesn't give the positives in percent, how to get it?

 

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by mirza80

@mirzabiotech2000:

  1. Column names can be found here for the outfmt=6: http://www.drive5.com/usearch/manual/blast6out.html More info: http://www.ncbi.nlm.nih.gov/books/NBK279675/

  2. In outfmt 6 you can see the identity/accuracy as a percentage in field 3. Is this what you are after?

ADD REPLYlink written 3.1 years ago by amblina70

thank you amblina for your reply. I will go through the link.

ADD REPLYlink written 3.1 years ago by mirza80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 756 users visited in the last hour