Question: Processing Blast XML output generated by GNU Parallel with Biopython SearchIO
1
gravatar for mossmatters
4.5 years ago by
mossmatters90
Chicago Botanic Garden, Glencoe, IL
mossmatters90 wrote:

I have been using GNU Parallel to search an organism's entire proteome against a sequence database. It works great for BLAST tabular output, but I would like to use the traditional XML output, because of how well the SearchIO module of Biopython works. Specifically, I am interested in filtering the results and outputting the HSPs into a FASTA file, one for each query sequence.

However, my output from GNU parallel is not a valid BLAST XML file. Instead, each search is concatenated. If I try to read the file with SearchIO.read(), I get this error:


from Bio import SearchIO

all_results = SearchIO.read("results.xml","blast-xml")


Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 364, in read

    second = next(generator)

  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/__init__.py", line 316, in parse

    for qresult in generator:

  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 193, in __iter__

    for qresult in self._parse_qresult():

  File "/opt/apps/lib/python2.7/site-packages/biopython-1.63-py2.7-linux-x86_64.egg/Bio/SearchIO/BlastIO/blast_xml.py", line 244, in _parse_qresult

    for event, qresult_elem in self.xml_iter:

  File "<string>", line 91, in next

cElementTree.ParseError: junk after document element: line 603, column 0


The same thing happens with SearchIO.parse() if I try to access the second element. Line 603 contains the comment line:

<?xml version="1.0"?>

which indicates the beginning of the next BLAST result, done separately by GNU Parallel.


Alternatively, I can read in the full file as text and split into a list. However, it does not appear that Biopython's SearchIO can the use contents of a file (as a string) for an argument; it needs a file handle. Instead, I have been iterating over the results, generating a temporary xml file for each, and reading that back in with SearchIO. This seems very inefficient, as involves about 30,000 write and read commands.

Here are my questions:

  1. Is this the most efficient use of the tools I am using? I could alternatively have my GNU Parallel script output as tabular blast, but because the desired end result is target sequences, I would have to retrieve these anyway. Would that approach be more efficient?
  2. Is there a lower-level function within SearchIO that I could use to process the xml result from a string, rather than from a file handle?
  3. In the absence of the above improvements, what would be the best way to process the file without having to create 30,000 temporary files and re-reading them?

 

ADD COMMENTlink modified 4.5 years ago by Peter5.7k • written 4.5 years ago by mossmatters90

what does your `parallel` cmd look like ?
 

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum114k

I have the basidiomycota sequences from NR in a custom database, and I am BLASTing an entire proteome against them. I have access to four servers, each of which has 12 CPUs. Some testing I did suggested that limiting the number of jobs parallel would execute, while also using some multi-threading for BLAST, would be more efficient than having each BLAST search run on a separate thread. This command completed in about 80 minutes (30,000 query sequences, 90,000 targets):

cat proteome.fa | time parallel -j12 -S n001,n002,n003,n004 --pipe -N1 --recstart ">" "blastp -db nr_basidiomycota -query - -evalue 0.0001 -max_target_seqs=25 -outfmt 5 -num_threads 4" >basidiomycota.xml

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by mossmatters90

It sounds like you have 30,000 proteins and you are running one BLAST job per protein. That is a lot of overhead. I would batch things, say 1000 queries in each FASTA file, giving just 33 chunks. This is what the BLAST+ wrappers for Galaxy do, with extra code to merge the XML output properly instead of simple concatenation which (as you have seen) results in an invalid XML file.

ADD REPLYlink written 4.5 years ago by Peter5.7k
0
gravatar for Peter
4.5 years ago by
Peter5.7k
Scotland, UK
Peter5.7k wrote:

1. Most efficient? I suspect it would be faster to use bigger chunks, say 33 sub-jobs of 1000 query sequences each.

2. Python's StringIO might be one way round this. Or simply leave the files as separate files (a folder of 33 BLAST XML files is not a big burden), and loop over the input files in your Python script.

3. You could try merging your BLAST XML files to make a single valid XML file (with one header/footer), see  How To Concatenate/Merge Blast Xml Outputs? (someone turned my Galaxy BLAST XML merge code into a standalone script)

ADD COMMENTlink written 4.5 years ago by Peter5.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 770 users visited in the last hour