Question

Running Blastp From Biopython And Only Looking For Exact Matches

0

Entering edit mode

11.4 years ago

Hmm ▴ 500

I have protein sequences that i split at every 'R' and convert them to peptides. Currently it searches for perfect and non perfect alignments. What i wanted to accomplish is:

1) Rather than searching one by one, is there a way to batch all these peptides and send them?

2) Find perfect match only

3) If i don't find the peptide, i want to log to the output file that the peptide sequence was not found.

This is my code so far:

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
mysplitseqs=[]
minPeptideLength=5
myEntrezQuery = "Homo sapiens[Organism]"

E_VALUE_THRESH = 0.04

with open("outfile.txt","w") as f:
        for seq_record in SeqIO.parse("my.fasta", "fasta"):
                mysplitseqs= seq_record.seq.split('R')
                f.writestrseq_record.id) + "\n")
                for peptide in mysplitseqs:
                    if len(peptide) >minPeptideLength:
                        result_handle = NCBIWWW.qblast("blastp", "nr", peptide,"-e 10000",entrez_query=myEntrezQuery)
                        blast_record = NCBIXML.read(result_handle)
                        for alignment in blast_record.alignments:
                            for hsp in alignment.hsps:
                                if hsp.expect < E_VALUE_THRESH:
                                    print '****Alignment****'
                                    print 'sequence:', alignment.title
                                    print 'length:', alignment.length
                                    print 'e value:', hsp.expect
                                    print hsp.query[0:75] + '...'
                                    print hsp.match[0:75] + '...'
                                    print hsp.sbjct[0:75] + '...'
                                    f.write('****Alignment****\n')
                                    f.write("sequence: "+ str(alignment.title) + "\n")
                                    f.write("length: "+ str(alignment.length) + "\n")
                                    f.write("e-value: "+ str(hsp.expect) + "\n")
                                    f.write("Query: "+ str(hsp.query[0:75]) + "\n")
                                    f.write("Match: "+ str(hsp.match[0:75]) + "\n")
                                    f.write("Subject: "+ str(hsp.sbjct[0:75]) + "\n")

Any help is really appreciated.

biopython fasta • 9.4k views

ADD COMMENT • link updated 11.4 years ago by Biojl ★ 1.7k • written 11.4 years ago by Hmm ▴ 500

score 3 · Answer 1 · 2013-06-09

I don't think the NCBI URL API (the one used by Biopython's NCBIWWW) provides any batch query options.

What you can do, is to download and install the BLAST executables locally (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download, or through your package manager if you're using Linux). These executables allow you to submit batch queries to the NCBI server, e.g.

$ blastp -out my_output.xml -outfmt 5 -query my_input.fasta -db refseq_protein -remote

The command above submits all the sequences in my_input.fasta as your input and writes the XML output to my_output.xml. Check out blastp -help to see all the available options.

Note that this is also doable via Biopython (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc90), e.g.

from Bio.Blast.Applications import Ncbiblastpcommandline

blastp_cline = Ncbiblastpcommandline(query='my_input.fasta', db='refseq_protein', outfmt=5, out='my_output.xml', remote=True)
stdout, stderr = blastp_cline()

As for the filtering, I'm not aware of any option to output only perfect matches, but this is easy to do with Biopython (Bio.Blast.NCBIXML as you've shown in your question, or the new Bio.SearchIO which feels very similar to Bio.SeqIO).

Here's an example with Bio.SearchIO:

from Bio import SearchIO

for query_results in Bio.SearchIO('my_output.xml', 'blast-xml'):
    print 'Results for query', query_results.id
    if not query_results:  # no hits found
        print 'No results found for', query_results.id
        continue
    for hit in query_results:
        print '    Hit', hit.id
        for hsp in hit:
            if hsp.aln_span == hsp.ident_num:   # perfect match if alignment span is equal to identity number
            print '     Perfect match found!'

Check out the SearchIO manual / tutorial for more ways to filter through your results (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100).

score 2 · Answer 2 · 2013-06-10

2

Entering edit mode

11.4 years ago

Biojl ★ 1.7k

Be careful because sometimes BlastP do not give a hit with perfect matches. This depends on the scoring matrix you're using, the default one is BLOSUM62. For very close sequences, which is your case you should use the BLOSUM80 or higher.

ADD COMMENT • link 11.4 years ago by Biojl ★ 1.7k