I have been using GNU Parallel to search an organism's entire proteome against a sequence database. It works great for BLAST tabular output, but I would like to use the traditional XML output, because of how well the SearchIO module of Biopython works. Specifically, I am interested in filtering the results and outputting the HSPs into a FASTA file, one for each query sequence.
However, my output from GNU parallel is not a valid BLAST XML file. Instead, each search is concatenated. If I try to read the file with SearchIO.read(), I get this error:
from Bio import SearchIO
all_results = SearchIO.read("results.xml","blast-xml")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 364, in read
second = next(generator)
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 316, in parse
for qresult in generator:
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 193, in __iter__
for qresult in self._parse_qresult():
File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 244, in _parse_qresult
for event, qresult_elem in self.xml_iter:
File "<string>", line 91, in next
cElementTree.ParseError: junk after document element: line 603, column 0
The same thing happens with SearchIO.parse() if I try to access the second element. Line 603 contains the comment line:
which indicates the beginning of the next BLAST result, done separately by GNU Parallel.
Alternatively, I can read in the full file as text and split into a list. However, it does not appear that Biopython's SearchIO can the use contents of a file (as a string) for an argument; it needs a file handle. Instead, I have been iterating over the results, generating a temporary xml file for each, and reading that back in with SearchIO. This seems very inefficient, as involves about 30,000 write and read commands.
Here are my questions:
- Is this the most efficient use of the tools I am using? I could alternatively have my GNU Parallel script output as tabular blast, but because the desired end result is target sequences, I would have to retrieve these anyway. Would that approach be more efficient?
- Is there a lower-level function within SearchIO that I could use to process the xml result from a string, rather than from a file handle?
- In the absence of the above improvements, what would be the best way to process the file without having to create 30,000 temporary files and re-reading them?