Question: Biopython BLAST or Blast parser returns "no"
0
gravatar for yarmda
2.2 years ago by
yarmda10
yarmda10 wrote:

I'm trying to parse my BLAST results and keep only records that don't have a gi identifier in a list. However, parsing the file is not quite working and the result is not informative.

Sorry if this type of question has been posted before. Search engines don't do well with the term "no"

from Bio import SeqIO
from Bio.Blast import SearchIO, NCBIWWW

#Forming Blast file. "record.seq" represents SeqIO.read("input.fasta", "fasta") where "input.fasta" is the sequence of the
#Bacillus anthracis strain with taxID = 1033843585

result_handle = NCBIWWW.qblast("blastn", "nt", record.seq, megablast=True)
result = open("new_tmp_blast.xml","w")
result.write(result_handle.read())
919480409
result.close()
result_handle.close()

#And the BLAST file is output in xml format, just like I wanted.

#Trying to parse the BLAST

hits = []
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
...     if entry.alignments:
...         hits.append(entry.query.split()[0])
...
hits
['No']
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
...     hits.append(entry.query.split()[0])
...
hits
['No', 'No']
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
...     entry.query.split()[0]
...
'No'
NCBIXML.parse(open("new_tmp_blast.xml"))
<generator object parse at 0x7fe6772430a0>

I have also tried using SearchIO and gotten identical results. I don't understand where the issue is.

Example of the BLAST result:

$ more new_tmp_blast.xml 

http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.7.0+</BlastOutput_version>
  <BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
  <BlastOutput_db>nt</BlastOutput_db>
  <BlastOutput_query-ID>Query_123951</BlastOutput_query-ID>
  <BlastOutput_query-def>No definition line</BlastOutput_query-def>
  <BlastOutput_query-len>5227292</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
      <Parameters_gap-open>0</Parameters_gap-open>
      <Parameters_gap-extend>0</Parameters_gap-extend>
      <Parameters_filter>L;m;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>Query_123951</Iteration_query-ID>
  <Iteration_query-def>No definition line</Iteration_query-def>
  <Iteration_query-len>5227292</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|1033843585|gb|CP015779.1|</Hit_id>
  <Hit_def>Bacillus anthracis strain Tangail-1, complete genome</Hit_def>
  <Hit_accession>CP015779</Hit_accession>
  <Hit_len>5227292</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>9652680</Hsp_bit-score>
      <Hsp_score>5227132</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>5227292</Hsp_query-to>
      <Hsp_hit-from>1</Hsp_hit-from>
      <Hsp_hit-to>5227292</Hsp_hit-to>
      <Hsp_query-frame>1</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>5227292</Hsp_identity>
      <Hsp_positive>5227292</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>5227292</Hsp_align-len>
      <Hsp_qseq>ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGTTGTGGACAATTTTATTCCACAAGGTATTGATTTTGTGGATAACTTTCTTAATTTCATTGCTATAGCTACTTTTTTTTGATATTATAGTTGTGTTTTCACTTTGAATAAGTTTTCCACATCTTTATCTTATCCACAATTTGTGTATAACATGTGGACAGTTTTAATCACATGTGGGTAAATGATTATCCACAT
TTGCTTTTTTGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACATTTTATATTTATTCAGGTTGTACATTTGTTGCACAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACACCTTTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGTAAACCAAGTTATGAAACATGGTTAAAATCAACAACCGCACATAATTTAAAGAAAGATG
TATTAACAATTACGGCTCCAAATGAATTCGCCCGTGATTGGTTAGAATCTCATTATTCAGAGCTAATTTCGGAAACACTTTATGATTTAACGGGGGCAAAATTAGCTATTCGCTTTATTATTCCCCAAAGTCAAGCTGAAGAGGAGATTGATCTTCCTCCTGCTAAACCAAATGCAGCACAAGATGATTCTAATCATTTACCACAGAGTATGCTAAACCCAAAATATACGTTTGATACATTTGTTATTGGCTCTGGTAACCGTTTTGCTC
ACGCTGCTTCATTGGCCGTAGCCGAAGCGCCAGCTAAAGCATATAATCCCCTCTTTATTTATGGGGGAGTTGGACTTGGAAAAACCCATTTAATGCATGCAATTGGCCATTATGTAATTGAACATAACCCAAATGCCAAAGTTGTATATTTATCATCAGAAAAATTTACAAATGAATTCATTAATTCTATTCGTGATAATAAAGCGGTCGATTTTCGTAATAAATACCGCAATGTAGATGTTTTATTGATAGATGATATTCAATTTTTAG
CGGGAAAAGAACAAACTCAAGAAGAGTTTTTCCATACATTCAATGCATTACACGAAGAAAGTAAACAAATTGTAATTTCCAGTGATCGGCCACCAAAAGAAA

for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
    entry.query
'No definition line'

Not sure what that means, either. Something missing in my BLAST command that I could include?

ADD COMMENTlink modified 2.2 years ago by Pierre Lindenbaum124k • written 2.2 years ago by yarmda10
0
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

not biopython, but using a simple xslt stylesheet, if your xml fit in memory:

usage:

xsltproc --novalid  transform.xsl blast.xml
ADD COMMENTlink written 2.2 years ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1175 users visited in the last hour