Question: Protein Sequence Alignment Using Blast
0
gravatar for p.s
5.4 years ago by
p.s0
p.s0 wrote:

Hi All,

I am trying to run Blast over protein sequences from two organisms. I downloaded the fasta from NCBI. I am trying to iterate over the list of sequences in the fasta file and do a sequence alignment of each sequence in one file with each sequence in the other. I want to run over local Blast but getting some error, I would greatly appreciate some suggestions.

'''from Bio.Blast.Applications import NcbiblastxCommandline
help(NcbiblastxCommandline)'''

from Bio.Blast.Applications import NcbiblastpCommandline
from StringIO import StringIO
from Bio.Blast import NCBIXML
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
from Bio.Blast import NCBIWWW
import cStringIO

def BlastSeq():
    SC_Fasta = open("sc.fsa","r")
    HS_Fasta = open("hsap.fsa","r")
    blastp = "C:\\Program Files\\NCBI\\blast-2.2.29+\\bin\\blastp"

    record1 = list(SeqIO.parse(SC_Fasta,"fasta"))
    for r1 in record1:
        r1.id
        r1.seq


    record2 = list(SeqIO.parse(HS_Fasta,"fasta"))
    for r2 in record2:
        r2.id
        r2.seq

    for r1 in record1:
        for r2 in record2:            
            output = NcbiblastpCommandline(blastp,query= r1.seq, subject=r2.seq, outfmt=5)()[0]
            blast_result_record = NCBIXML.read(StringIO(output))                   


def main():
    BlastSeq()

main()

Error: Bio.Application.ApplicationError: Command 'C:\Program Files\NCBI\blast-2.2.29+\bin\blastp -outfmt 5 -query    MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAVFNYGDF -subjectHGLQELKAELDAAVLKATGRQILTLRVRLAGAQLSWLYKEATVQEVDVIPEDGAADVRVIISNSAYGKFRKLFPG' returned non-zero exit status -1073741515

I understand that each seq should be passed as individual fasta file,but I don't understand how to proceed.

protein biopython blast • 3.1k views
ADD COMMENTlink modified 5.4 years ago by Zhaorong1.2k • written 5.4 years ago by p.s0
1
gravatar for Peter
5.4 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

The problem is essentially you asked BLAST to do this:

C:\Program Files\NCBI\blast-2.2.29+\bin\blastp -outfmt 5 -query MVKLTSIAAGVAAIAATASATTTLAQSDERVNLVELGVYVSDIRAHLAQYYMFQAAHPTETYPVEVAEAVFNYGDF -subject HGLQELKAELDAAVLKATGRQILTLRVRLAGAQLSWLYKEATVQEVDVIPEDGAADVRVIISNSAYGKFRKLFPG

And that fails (try it directly at the command line). Rather than raw sequence via the command line, you would normally have the sequences on disk in FASTA format, and tell BLAST the filenames.

The larger question is why are you doing lots of pair wise BLAST alignments? Might another tool be better matched? For example, EMBOSS needleall? http://emboss.sourceforge.net/apps/release/6.5/emboss/apps/needleall.html

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Peter5.8k
0
gravatar for Zhaorong
5.4 years ago by
Zhaorong1.2k
State College, PA
Zhaorong1.2k wrote:

You are right in saying "I understand that each seq should be passed as individual fasta file". The reason is that you have to pass the filename of the fasta file to BLAST, rather than the sequences inside it.

I modified your code just to pass the filenames to BLAST.

from Bio.Blast.Applications import NcbiblastpCommandline
from Bio.Blast import NCBIXML

def BlastSeq():
    SC_Fasta = "sc.fsa"
    HS_Fasta = "hsap.fsa"
    blastp = "C:\\Program Files\\NCBI\\blast-2.2.29+\\bin\\blastp"

    # I would suggest saving the blast output to a file, though you do not have to do this
    blast_output = "blast_output.xml"
    NcbiblastpCommandline(blastp, query=SC_Fasta, subject=HS_Fasta, outfmt=5, out=blast_output)()
    # Now you can read the blast output from the file and process it however you want
    blast_result_records = NCBIXML.parse(open(blast_output))

BlastSeq()

Please note that now all the pairwise alignments between the sequences in "sc.fsa" and "hsap.fsa" are stored in a single blast output file.

In your original code, it seems that you intended to call blastp once for each pair of sequences. You don't need to do this, as BLAST can handle multiple sequences in -query and -subject at the same time (thanks, Peter). However, if you insist on doing this, here is the code just to show you how hard it is.

Finally, if you would like to obtain global alignment between the sequences, I would suggest using EMBOSS needle or needleall. You can also call needle in Biopython if you like. A relevant discussion on local and global alignment can be found Is It Possible To Get Near Global Alignment From Blast?.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Zhaorong1.2k
1

Why split the FASTA files? You could try giving BLAST multiple record FASTA files using -query and -subject and it will do pairwise searches (note the e-values will not consider the other sequences, like it would if you used a database search), but the parsing is a little more complicated (loop over the results instead).

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Peter5.8k

Hi Peter, thank you, you are right. I had the idea in mind that the code could be adapted for using needle rather than BLAST as the alignment tool. I have modified my answers as you suggested.

ADD REPLYlink written 5.4 years ago by Zhaorong1.2k

Nice. For EMBOSS needle, try "needle all" if you want to do a comparison of many versus many.

ADD REPLYlink written 5.4 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1245 users visited in the last hour