Question: Running Blastp From Biopython And Only Looking For Exact Matches
0
gravatar for Hmm
6.0 years ago by
Hmm500
Hmm500 wrote:

I have protein sequences that i split at every 'R' and convert them to peptides. Currently it searches for perfect and non perfect alignments. What i wanted to accomplish is:

1) Rather than searching one by one, is there a way to batch all these peptides and send them?

2) Find perfect match only

3) If i don't find the peptide, i want to log to the output file that the peptide sequence was not found.

This is my code so far:

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
mysplitseqs=[]
minPeptideLength=5
myEntrezQuery = "Homo sapiens[Organism]"

E_VALUE_THRESH = 0.04

with open("outfile.txt","w") as f:
        for seq_record in SeqIO.parse("my.fasta", "fasta"):
                mysplitseqs= seq_record.seq.split('R')
                f.writestrseq_record.id) + "\n")
                for peptide in mysplitseqs:
                    if len(peptide) >minPeptideLength:
                        result_handle = NCBIWWW.qblast("blastp", "nr", peptide,"-e 10000",entrez_query=myEntrezQuery)
                        blast_record = NCBIXML.read(result_handle)
                        for alignment in blast_record.alignments:
                            for hsp in alignment.hsps:
                                if hsp.expect < E_VALUE_THRESH:
                                    print '****Alignment****'
                                    print 'sequence:', alignment.title
                                    print 'length:', alignment.length
                                    print 'e value:', hsp.expect
                                    print hsp.query[0:75] + '...'
                                    print hsp.match[0:75] + '...'
                                    print hsp.sbjct[0:75] + '...'
                                    f.write('****Alignment****\n')
                                    f.write("sequence: "+ str(alignment.title) + "\n")
                                    f.write("length: "+ str(alignment.length) + "\n")
                                    f.write("e-value: "+ str(hsp.expect) + "\n")
                                    f.write("Query: "+ str(hsp.query[0:75]) + "\n")
                                    f.write("Match: "+ str(hsp.match[0:75]) + "\n")
                                    f.write("Subject: "+ str(hsp.sbjct[0:75]) + "\n")

Any help is really appreciated.

fasta biopython • 5.1k views
ADD COMMENTlink modified 6.0 years ago by Biojl1.6k • written 6.0 years ago by Hmm500
3
gravatar for bow
6.0 years ago by
bow790
Netherlands
bow790 wrote:

I don't think the NCBI URL API (the one used by Biopython's NCBIWWW) provides any batch query options.

What you can do, is to download and install the BLAST executables locally (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download, or through your package manager if you're using Linux). These executables allow you to submit batch queries to the NCBI server, e.g.

$ blastp -out my_output.xml -outfmt 5 -query my_input.fasta -db refseq_protein -remote

The command above submits all the sequences in my_input.fasta as your input and writes the XML output to my_output.xml. Check out blastp -help to see all the available options.

Note that this is also doable via Biopython (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc90), e.g.

from Bio.Blast.Applications import Ncbiblastpcommandline

blastp_cline = Ncbiblastpcommandline(query='my_input.fasta', db='refseq_protein', outfmt=5, out='my_output.xml', remote=True)
stdout, stderr = blastp_cline()

As for the filtering, I'm not aware of any option to output only perfect matches, but this is easy to do with Biopython (Bio.Blast.NCBIXML as you've shown in your question, or the new Bio.SearchIO which feels very similar to Bio.SeqIO).

Here's an example with Bio.SearchIO:

from Bio import SearchIO

for query_results in Bio.SearchIO('my_output.xml', 'blast-xml'):
    print 'Results for query', query_results.id
    if not query_results:  # no hits found
        print 'No results found for', query_results.id
        continue
    for hit in query_results:
        print '    Hit', hit.id
        for hsp in hit:
            if hsp.aln_span == hsp.ident_num:   # perfect match if alignment span is equal to identity number
            print '     Perfect match found!'

Check out the SearchIO manual / tutorial for more ways to filter through your results (http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100).

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by bow790
2
gravatar for Biojl
6.0 years ago by
Biojl1.6k
Barcelona
Biojl1.6k wrote:

Be careful because sometimes BlastP do not give a hit with perfect matches. This depends on the scoring matrix you're using, the default one is BLOSUM62. For very close sequences, which is your case you should use the BLOSUM80 or higher.

ADD COMMENTlink written 6.0 years ago by Biojl1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1869 users visited in the last hour