Question

Processing Blast XML output generated by GNU Parallel with Biopython SearchIO

1

Entering edit mode

10.0 years ago

mossmatters ▴ 90

I have been using GNU Parallel to search an organism's entire proteome against a sequence database. It works great for BLAST tabular output, but I would like to use the traditional XML output, because of how well the SearchIO module of Biopython works. Specifically, I am interested in filtering the results and outputting the HSPs into a FASTA file, one for each query sequence.

However, my output from GNU parallel is not a valid BLAST XML file. Instead, each search is concatenated. If I try to read the file with SearchIO.read(), I get this error:

from Bio import SearchIO
all_results = SearchIO.read("results.xml","blast-xml")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 364, in read
    second = next(generator)
  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 316, in parse
    for qresult in generator:
  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 193, in __iter__
    for qresult in self._parse_qresult():
  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 244, in _parse_qresult
    for event, qresult_elem in self.xml_iter:
  File "<string>", line 91, in next
cElementTree.ParseError: junk after document element: line 603, column 0

The same thing happens with SearchIO.parse() if I try to access the second element. Line 603 contains the comment line:

<?xml version="1.0"?>

which indicates the beginning of the next BLAST result, done separately by GNU Parallel.

Alternatively, I can read in the full file as text and split into a list. However, it does not appear that Biopython's SearchIO can the use contents of a file (as a string) for an argument; it needs a file handle. Instead, I have been iterating over the results, generating a temporary xml file for each, and reading that back in with SearchIO. This seems very inefficient, as involves about 30,000 write and read commands.

Here are my questions:

Is this the most efficient use of the tools I am using? I could alternatively have my GNU Parallel script output as tabular blast, but because the desired end result is target sequences, I would have to retrieve these anyway. Would that approach be more efficient?
Is there a lower-level function within SearchIO that I could use to process the xml result from a string, rather than from a file handle?
In the absence of the above improvements, what would be the best way to process the file without having to create 30,000 temporary files and re-reading them?

gnu-parallel searchio biopython • 3.7k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by mossmatters ▴ 90

0

Entering edit mode

what does your parallel cmd look like?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I have the basidiomycota sequences from NR in a custom database, and I am BLASTing an entire proteome against them. I have access to four servers, each of which has 12 CPUs. Some testing I did suggested that limiting the number of jobs parallel would execute, while also using some multi-threading for BLAST, would be more efficient than having each BLAST search run on a separate thread. This command completed in about 80 minutes (30,000 query sequences, 90,000 targets):

cat proteome.fa | time parallel -j12 -S n001,n002,n003,n004 --pipe -N1 --recstart ">" "blastp -db nr_basidiomycota -query - -evalue 0.0001 -max_target_seqs=25 -outfmt 5 -num_threads 4" >basidiomycota.xml

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by mossmatters ▴ 90

0

Entering edit mode

It sounds like you have 30,000 proteins and you are running one BLAST job per protein. That is a lot of overhead. I would batch things, say 1000 queries in each FASTA file, giving just 33 chunks. This is what the BLAST+ wrappers for Galaxy do, with extra code to merge the XML output properly instead of simple concatenation which (as you have seen) results in an invalid XML file.

ADD REPLY • link 10.0 years ago by Peter 6.0k

Ram · Answer 1 · 2014-05-13

Most efficient? I suspect it would be faster to use bigger chunks, say 33 sub-jobs of 1000 query sequences each.
Python's StringIO might be one way round this. Or simply leave the files as separate files (a folder of 33 BLAST XML files is not a big burden), and loop over the input files in your Python script.
You could try merging your BLAST XML files to make a single valid XML file (with one header/footer), see How To Concatenate/Merge Blast Xml Outputs? (someone turned my Galaxy BLAST XML merge code into a standalone script)