Question

Compare Two Protein Sequences Using Local Blast

4

Entering edit mode

14.1 years ago

satsurae ▴ 120

Hi,

I have been given a task to compare the all the protein sequences of a strain of campylobacter with a strain of E.coli. I would like to do this locally using Biopython and the inbuilt Blast tools. However, I'm stuck on how to program this and what tools I should be using. If anybody could point me in the right direction, I would be thankful!

Cheers

biopython blast • 11k views

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.1 years ago by satsurae ▴ 120

0

Entering edit mode

@satsurae It seems that you have found your answer! :) For the benefice of the BioStar users, would you consider accepting one of the proposed answers? Cheers.

ADD REPLY • link 14.1 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-03-16

11

Entering edit mode

14.1 years ago

brentp 24k

Hi, if you're having trouble, maybe it'd be wiser to do the simplest analysis -- just do an all-vs-all blastp directly using the command-line-interface to blast. For example use E.coli as the subject and the other species as the query and a command like:

formatdb -p T -i ecoli.fa
blastall -d ecoli.fa -i other.fa -p blastn -m 8 -e 0.01 -o ecoli_other.blast

Using the -m 8 option, you can parse the tab-delimited blast file with few lines of code. This assumes you have the protein fasta files in hand -- the are probably available from ncbi or your favorite bacteria genome repository.

ADD COMMENT • link updated 5 months ago by Ram 43k • written 14.1 years ago by brentp 24k

0

Entering edit mode

i didn't know BLAST had a blastall commandline argument ...I'll have to remember that for later.

ADD REPLY • link 14.1 years ago by Will 4.5k

0

Entering edit mode

This is an excellent solution - simple and effective.

ADD REPLY • link 14.1 years ago by Neilfws 49k

0

Entering edit mode

Except for the yeast genome part, since these are bacteria :-)

ADD REPLY • link 14.1 years ago by Neilfws 49k

0

Entering edit mode

I agree, this seems like the best way to accomplish your goal - don't overcomplicate with BioPython where it simply isn't necessary. By the way, you can grab the sequences you need from UniProt: search for "organism:$tax_id AND keyword:"Complete proteome"" where $tax_id is the taxonomy identifier of the strain you're interested in (so E. coli K12 is 83333 and C jejuni is 197). Though this method does rely on the curation levels of UniProt, which may not be so good for organisms a little off the beaten track.

ADD REPLY • link 14.1 years ago by Simon Cockell 7.4k

0

Entering edit mode

Thanks very much for your help!

ADD REPLY • link 14.1 years ago by satsurae ▴ 120

0

Entering edit mode

neilfws, i changed "yeast" to "bacteria". thanks.

ADD REPLY • link 14.1 years ago by brentp 24k

0

Entering edit mode

Small stuff: if possible move tabulated blast output into an SQL database. You will save time needed to write the code iterating over text columns, say for finding best reciprocal hits (BRH).

ADD REPLY • link 14.1 years ago by Darked89 4.6k

0

Entering edit mode

Hi guys, thanks for all the help. However, after using the above formatdb line, I get a fatal error saying that the database was not found when using the blastall code. I think there is a problem in the indexing of the fasta to a 'database'. I wouldn't have thought I would need to index both database and query files. Is there another way to index the fasta to a db? Cheers!

ADD REPLY • link 14.1 years ago by satsurae ▴ 120

0

Entering edit mode

Scratch that! Got it working using the pretty well documented program parameters on the ncbi website: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/

ADD REPLY • link updated 5 months ago by Ram 43k • written 14.1 years ago by satsurae ▴ 120

Ram · Answer 2 · 2010-03-16

The BLAST section of the BioPython module is not terribly well documented. The relevant section is here.

Before starting you'll need to create a local BLAST database. To my knowledge there is no way to do this directly through BioPython (but you could use the Subprocess module to automate the command line if you really wanted too). The documentation for the BLAST tool is on the NCBI website here.

You'll obviously need to download the whole genome sequences. I suggest using the NCBI FTP site, I can never seem to find the right link in the normal web-portal.

Once you have all of the relevant downloads and databases created you'll simply need to run the BLAST query in a loop that processes all of the data. Something like this should work (I don't have the required data to test this but it should get you 95% of the way there.)

from Bio.Blast.Applications import NcbiblastxCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
import subprocess

SOURCE_FASTA = '/path/to/source/seq.fasta'
DATABASE = 'databsename' #should be in the path but YMMV

with open(SOURCE_FASTA) as inhandle:
    for seq in SeqIO.parse(handle, 'fasta'):
        with open('scratch.fasta', 'w') as outhandle:
            #write a scratch file
            SeqIO.write(seq, outhandle, 'fasta')

        #create the commandline string
        cline = NcbiblastxCommandline(query='scratch.fasta', 
                    db=DATABASE, evalue=0.001, outfmt=5, out="scratch.xml")

        #actually run BLAST
        return_code = subprocess.call(str(cline))

        if return_code == 0:
            #if it was successful then process it
            with open('scratch.xml') as xmlhandle:
                blast_record = NCBIXML.read(xmlhandle)

                do_something_with_results(blast_record)

Hope that helps.

Ram · Answer 3 · 2010-03-16

If the genomes are public, someone may have done this analysis for you, or there may be a web-based tool to do the job.

For example, here is a genome comparison tool at the JCVI/CMR.

The JGI Integrated Microbial Genome system is also very good. It allows you to select multiple genomes for various comparative analyses. I believe they may even have all versus all data buried away in their FTP archive somewhere. However, take some time to get used to the site - it has quite a steep learning curve.

Ram · Answer 4 · 2010-03-16

The others have answered your question!

I just want to make a suggestion, if you are interested in visually comparing the two genomes, I highly recommend the Artemis and ACT (Artemis Comparison Tool) from the Sanger institute.

If you have the full genome, then you can easily visualize all the reading frames. With ACT, you use the two genome sequences (or one genome and a fasta file of genes/proteins or even two gene/protein fasta files) as well as a "comparison file", which is a blast output file. It is very useful for annotation and curation!