Help With BioPython BLAST Output
1
0
Entering edit mode
8.3 years ago
niamshah • 0

Hi,

I am new to BioPython and am writing a program to BLAST human amino acid sequences. My search query is:

NCBIWWW.qblast(
  program="blastp",
  database="nr",
  sequence=BLASTSeq,
  entrez_query="txid9606[ORGN]",
  matrix_name='BLOSUM62',
  word_size='2',
  expect='10',
  gapcosts='11 1',
  composition_based_statistics='no adjustment')

but the alignment title of the BLAST result looks like:

('sequence:', u'gi|964750848|ref|NP_001304891.1| periodic tryptophan protein 1 homolog isoform 2 [Homo sapiens] >gi|332241720|ref|XP_003270028.1| PREDICTED: periodic tryptophan protein 1 homolog isoform X2 [Nomascus leucogenys] >gi|194382424|dbj|BAG58967.1| unnamed protein product [Homo sapiens]')

I was wondering why the titles include multiple protein names and organisms, and how I can change my code so that my program only returns one human protein.

My full BLAST method is:

def callBLAST (BLASTSeq):
    from Bio.Blast import NCBIWWW
    from Bio.Blast import NCBIXML
    result_handle = NCBIWWW.qblast(program="blastp", database="nr", sequence=BLASTSeq, entrez_query="txid9606[ORGN]", matrix_name='BLOSUM62',word_size='2',expect='10',gapcosts='11 1',composition_based_statistics='no adjustment')
    blast_record = NCBIXML.read(result_handle)
    E_VALUE_THRESH = 0.04
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if (hsp.expect < E_VALUE_THRESH):
                print('****Alignment****')
                print('sequence:', alignment.title)
        break

Thanks!

blast • 2.9k views
ADD COMMENT
0
Entering edit mode
7.8 years ago
BI0SH0CK3D • 0

There is an argument in the qblast function called "hitlist_size" described as "Number of hits to return. Default 50."

When looking for one human result I use the arguments "hitlist_size = 1".

For more information, ener this into the intepreter:

from Bio.Blast import NCBIWWW

help(NCBIWWW.qblast)

I also had an issues with multiples names in one excel cell in the output like this:

[RecName: Full=Neuromodulin; AltName: Full=Axonal membrane protein GAP-43; AltName: Full=Growth-associated protein 43; AltName: Full=Neural phosphoprotein B-50; AltName: Full=pp46 >gi|61213904|sp|Q5IS67.1|NEUM_PANTR RecName: Full=Neuromodulin; AltName: Full=Axonal membrane protein GAP-43; AltName: Full=Growth-associated protein 43]

if it still appears with the hit list size of one, there is an excel trick that can help. I highlight the column with the cluttered results and do a search and replace for >*. This will delete anything after the > because * is a wildcard character. This will delete all the extra stuff. I would make sure that there is nothing in them you want first.

ADD COMMENT

Login before adding your answer.

Traffic: 1824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6