Question: Compare Two Protein Sequences Using Local Blast
gravatar for satsurae
9.3 years ago by
satsurae120 wrote:


I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful!


biopython blast • 7.8k views
ADD COMMENTlink modified 5.5 years ago by Biostar ♦♦ 20 • written 9.3 years ago by satsurae120

@satsurae It seems that you have found your answer! :) For the benefice of the BioStar users, would you consider accepting one of the proposed answers? Cheers.

ADD REPLYlink written 9.3 years ago by Eric Normandeau10k
gravatar for brentp
9.3 years ago by
Salt Lake City, UT
brentp23k wrote:

hi, if you're having trouble, maybe it'd be wiser to do the simplest analysis--just do an all-vs-all blastp directly using the command-line-interface to blast. For example use E.coli as the subject and the other species as the query and a command like:

formatdb -p T -i ecoli.fa
blastall -d ecoli.fa -i other.fa -p blastn -m 8 -e 0.01 -o ecoli_other.blast

Using the -m 8 option, you can parse the tab-delimited blast file with few lines of code. This assumes you have the protein fasta files in hand--the are probably available from ncbi or your favorite bacteria genome repository.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.3 years ago by brentp23k

i didn't know BLAST had a blastall commandline argument ...I'll have to remember that for later.

ADD REPLYlink written 9.3 years ago by Will4.5k

This is an excellent solution - simple and effective.

ADD REPLYlink written 9.3 years ago by Neilfws48k

Except for the yeast genome part, since these are bacteria :-)

ADD REPLYlink written 9.3 years ago by Neilfws48k

I agree, this seems like the best way to accomplish your goal - don't overcomplicate with BioPython where it simply isn't necessary. By the way, you can grab the sequences you need from UniProt: search for "organism:$tax_id AND keyword:"Complete proteome"" where $tax_id is the taxonomy identifier of the strain you're interested in (so E. coli K12 is 83333 and C jejuni is 197). Though this method does rely on the curation levels of UniProt, which may not be so good for organisms a little off the beaten track.

ADD REPLYlink written 9.3 years ago by Simon Cockell7.3k

Thanks very much for your help!

ADD REPLYlink written 9.3 years ago by satsurae120

neilfws, i changed "yeast" to "bacteria". thanks.

ADD REPLYlink written 9.3 years ago by brentp23k

Small stuff: if possible move tabulated blast output into an SQL database. You will save time needed to write the code iterating over text columns, say for finding best reciprocal hits (BRH).

ADD REPLYlink written 9.3 years ago by Darked894.2k

Hi guys, thanks for all the help. However, after using the above formatdb line, I get a fatal error saying that the database was not found when using the blastall code. I think there is a problem in the indexing of the fasta to a 'database'. I wouldn't have thought I would need to index both database and query files. Is there another way to index the fasta to a db? Cheers!

ADD REPLYlink written 9.3 years ago by satsurae120

Scratch that! Got it working using the pretty well documented program parameters on the ncbi website:

ADD REPLYlink written 9.3 years ago by satsurae120
gravatar for Will
9.3 years ago by
United States
Will4.5k wrote:

The BLAST section of the BioPython module is not terribly well documented. The relevant section is here.

Before starting you'll need to create a local BLAST database. To my knowledge there is no way to do this directly through BioPython (but you could use the Subprocess module to automate the commandline if you really wanted too). The documentation for the BLAST tool is on the NCBI website here.

You'll obviously need to download the whole genome sequences. I suggest using the NCBI FTP site, I can never seem to find the right link in the normal webportal.

Once you have all of the relevant downloads and databases created you'll simply need to run the BLAST query in a loop that processes all of the data. Something like this should work (I don't have the required data to test this but it should get you 95% of the way there.)

from Bio.Blast.Applications import NcbiblastxCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
import subprocess

SOURCE_FASTA = '/path/to/source/seq.fasta'
DATABASE = 'databsename' #should be in the path but YMMV

with open(SOURCE_FASTA) as inhandle:
    for seq in SeqIO.parse(handle, 'fasta'):
        with open('scratch.fasta', 'w') as outhandle:
            #write a scratch file
            SeqIO.write(seq, outhandle, 'fasta')

        #create the commandline string
        cline = NcbiblastxCommandline(query='scratch.fasta', 
                    db=DATABASE, evalue=0.001, outfmt=5, out="scratch.xml")

        #actually run BLAST
        return_code =

        if return_code == 0:
            #if it was successful then process it
            with open('scratch.xml') as xmlhandle:
                blast_record =


Hope that helps.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.3 years ago by Will4.5k
gravatar for Neilfws
9.3 years ago by
Sydney, Australia
Neilfws48k wrote:

If the genomes are public, someone may have done this analysis for you, or there may be a web-based tool to do the job.

For example, here is a genome comparison tool at the JCVI/CMR -

The JGI Integrated Microbial Genome system is also very good - It allows you to select multiple genomes for various comparative analyses. I believe they may even have all versus all data buried away in their FTP archive somewhere. However, take some time to get used to the site - it has quite a steep learning curve.

ADD COMMENTlink written 9.3 years ago by Neilfws48k
gravatar for Nicojo
9.3 years ago by
Kyoto, Japan
Nicojo1.1k wrote:

The others have answered your question!

I just want to make a suggestion, if you are interested in visually comparing the two genomes, I highly recommend the Artemis and ACT (Artemis Comparison Tool) from the Sanger institute.

If you have the full genome, then you can easily visualize all the reading frames. With ACT, you use the two genome sequences (or one genome and a fasta file of genes/proteins or even two gene/protein fasta files) as well as a "comparison file", which is a blast output file.

It is very useful for annotation and curation!

ADD COMMENTlink written 9.3 years ago by Nicojo1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1886 users visited in the last hour