Entering edit mode
6.5 years ago
yarmda
▴
40
I'm trying to parse my BLAST results and keep only records that don't have a gi identifier in a list. However, parsing the file is not quite working and the result is not informative.
Sorry if this type of question has been posted before. Search engines don't do well with the term "no"
from Bio import SeqIO
from Bio.Blast import SearchIO, NCBIWWW
#Forming Blast file. "record.seq" represents SeqIO.read("input.fasta", "fasta") where "input.fasta" is the sequence of the
#Bacillus anthracis strain with taxID = 1033843585
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq, megablast=True)
result = open("new_tmp_blast.xml","w")
result.write(result_handle.read())
919480409
result.close()
result_handle.close()
#And the BLAST file is output in xml format, just like I wanted.
#Trying to parse the BLAST
hits = []
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
... if entry.alignments:
... hits.append(entry.query.split()[0])
...
hits
['No']
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
... hits.append(entry.query.split()[0])
...
hits
['No', 'No']
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
... entry.query.split()[0]
...
'No'
NCBIXML.parse(open("new_tmp_blast.xml"))
<generator object parse at 0x7fe6772430a0>
I have also tried using SearchIO and gotten identical results. I don't understand where the issue is.
Example of the BLAST result:
$ more new_tmp_blast.xml
http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.7.0+</BlastOutput_version>
<BlastOutput_reference>Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14.</BlastOutput_reference>
<BlastOutput_db>nt</BlastOutput_db>
<BlastOutput_query-ID>Query_123951</BlastOutput_query-ID>
<BlastOutput_query-def>No definition line</BlastOutput_query-def>
<BlastOutput_query-len>5227292</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>1</Parameters_sc-match>
<Parameters_sc-mismatch>-2</Parameters_sc-mismatch>
<Parameters_gap-open>0</Parameters_gap-open>
<Parameters_gap-extend>0</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_123951</Iteration_query-ID>
<Iteration_query-def>No definition line</Iteration_query-def>
<Iteration_query-len>5227292</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|1033843585|gb|CP015779.1|</Hit_id>
<Hit_def>Bacillus anthracis strain Tangail-1, complete genome</Hit_def>
<Hit_accession>CP015779</Hit_accession>
<Hit_len>5227292</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>9652680</Hsp_bit-score>
<Hsp_score>5227132</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>5227292</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>5227292</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>5227292</Hsp_identity>
<Hsp_positive>5227292</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>5227292</Hsp_align-len>
<Hsp_qseq>ATATTTTTTCTTGTTTTTTATATCCACAAACTCTTTTCGTACTTTTACACAGTATATCGTGTTGTGGACAATTTTATTCCACAAGGTATTGATTTTGTGGATAACTTTCTTAATTTCATTGCTATAGCTACTTTTTTTTGATATTATAGTTGTGTTTTCACTTTGAATAAGTTTTCCACATCTTTATCTTATCCACAATTTGTGTATAACATGTGGACAGTTTTAATCACATGTGGGTAAATGATTATCCACAT
TTGCTTTTTTGTCGAAAACCCTATCTCATATACAAACGACGTTTTTAGGTTTTAAAATACGTTTCGTATAAATATACATTTTATATTTATTCAGGTTGTACATTTGTTGCACAACCTTATTCTTTTACCATCTTAGTAAAGGAGGGACACCTTTGGAAAACATCTCTGATTTATGGAACAGCGCCTTAAAAGAACTCGAAAAAAAGGTCAGTAAACCAAGTTATGAAACATGGTTAAAATCAACAACCGCACATAATTTAAAGAAAGATG
TATTAACAATTACGGCTCCAAATGAATTCGCCCGTGATTGGTTAGAATCTCATTATTCAGAGCTAATTTCGGAAACACTTTATGATTTAACGGGGGCAAAATTAGCTATTCGCTTTATTATTCCCCAAAGTCAAGCTGAAGAGGAGATTGATCTTCCTCCTGCTAAACCAAATGCAGCACAAGATGATTCTAATCATTTACCACAGAGTATGCTAAACCCAAAATATACGTTTGATACATTTGTTATTGGCTCTGGTAACCGTTTTGCTC
ACGCTGCTTCATTGGCCGTAGCCGAAGCGCCAGCTAAAGCATATAATCCCCTCTTTATTTATGGGGGAGTTGGACTTGGAAAAACCCATTTAATGCATGCAATTGGCCATTATGTAATTGAACATAACCCAAATGCCAAAGTTGTATATTTATCATCAGAAAAATTTACAAATGAATTCATTAATTCTATTCGTGATAATAAAGCGGTCGATTTTCGTAATAAATACCGCAATGTAGATGTTTTATTGATAGATGATATTCAATTTTTAG
CGGGAAAAGAACAAACTCAAGAAGAGTTTTTCCATACATTCAATGCATTACACGAAGAAAGTAAACAAATTGTAATTTCCAGTGATCGGCCACCAAAAGAAA
for entry in NCBIXML.parse(open("new_tmp_blast.xml")):
entry.query
'No definition line'
Not sure what that means, either. Something missing in my BLAST command that I could include?