Hello all! I have a list of pairs of proteins and I want to compare speed and accuracy of "BLAST Two Sequences" to a Smith-Waterman program for alignment. I know there is a "Blast Two Sequences" option on NCBI website, but I would like to run it from a python script. Perhaps Biopython has this capability? If I cannot use Blast Two Sequences, I will compare different versions of Smith-Waterman, but this would not be nearly as exciting :) OR, if anyone has another idea for a great senior year project in Bioinformatics involving comparing pairs of proteins, please don't hesitate to let me know！ Thank you in advance.
Biopython includes the Bio.Blast.Applications module which provides wrappers for the NCBI blast tools. The Biopython Tutorial provides some examples on how to use them, and the BLAST chapter of the Biopython manual has more information. If you don't specify an output file, the alignment output will come back to you as a string in the standard output of the program. You should specify
outfmt=5 as one of the arguments so that your command returns XML output, which Biopython can parse (see below) You can use Python's StringIO module to convert the string to a filehandle, which you can parse using Bio.Blast.NCBIXML (alternatively, you can just specify
out="blast_results.xml" in the command to write the output to a file and then use
open("blast_results.xml", "r") to get a filehandle for reading the output). Again, see the BLAST chapter of the Biopython manual for examples of this.
#!/usr/bin/env python from Bio.Blast.Applications import NcbiblastpCommandline from StringIO import StringIO from Bio.Blast import NCBIXML from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO # Create two sequence files seq1 = SeqRecord(Seq("FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF"), id="seq1") seq2 = SeqRecord(Seq("FQTWEEFSRAEKLYLADPMKVRVVLRYRHVDGNLCIKVTDDLICLVYRTDQAQDVKKIEKF"), id="seq2") SeqIO.write(seq1, "seq1.fasta", "fasta") SeqIO.write(seq2, "seq2.fasta", "fasta") # Run BLAST and parse the output as XML output = NcbiblastpCommandline(query="seq1.fasta", subject="seq2.fasta", outfmt=5)() blast_result_record = NCBIXML.read(StringIO(output)) # Print some information on the result for alignment in blast_result_record.alignments: for hsp in alignment.hsps: print '****Alignment****' print 'sequence:', alignment.title print 'length:', alignment.length print 'e value:', hsp.expect print hsp.query print hsp.match print hsp.sbjct
For doing a full Smith-Waterman alignment, you can use Bio.Emboss.Applications.WaterCommandline to run the alignment, and use Bio.AlignIO to read the result. Or, as mentioned in another answer, you can use the
use_sw_tback=True option for blastp.
EDIT: I forgot to mention that the code here uses the NCBI Blast+ suite, which has separate commands for blastn, blastp, and son on, and not the old all-in-one blastall program.
When you put your two sequences into two separate fasta files, you can use the
-subject parameters to blast these files against each other (BLAST+). Also, be advised that BLAST has the option to do a full Smith-Waterman alignment with the
-use_sw_tback parameter. (Here's BLAST's manual.)
However, you say you want to measure speed. Speed for comparing two sequences against each other is normally not important, but rather the speed of comparing many sequences against each other. For two sequences you probably mainly measure the time to initialize the program than the actual alignment. Lastly, there are many optimized SW implementations, be sure to search for them.