I process my blast (BLASTN 2.2.26+) searches through a script that divides the fasta inputs into N pieces, distributes one blast instance for each piece (in N processes), and, once over, concatenates the outputs into one file. It works pretty well with the table output format, but what about the xml? Indeed, I want to test a software that takes as input only xml format (a dummy cat does not work), and I'm struggling with that case.
Have you an idea on how to make one consistant xml output from several (it can be hundreds) "sub-"outputs? (I need a biopython NCBIXML parsability) ;-) ).
I did the same task in the Galaxy wrappers for BLAST+ (although task splitting like this isn't on by default). If you understand Python, you can see the code here in the merge method of the BlastXml class
Hi Peter, it appears as though the merge method would work well for my situation, but I am unsure how to 'call' the merge function. I have copied the def merge: to a new python file -- can I simply paste this definition before the start of my actual script? And do I call the merge function by writing merge(directorycontainingxmlfilestobemerged, output_filename) ?
Hey, it's good to know I'm not the only one interested as I had no vote for the question ;-)
You can find the modified code to make it working as is here:
http://code.google.com/p/bioman/source/browse/BlastXMLmerge.py
Enjoy!
(And thanks again to Galaxy folks)
This is a comman line length problem. First of all run this in the same directory as the input XML files to you can use local paths (avoid directory names naming the command longer).
If that is not enough, either do it in batches, or modify the script to accept a folder name as input (although then you will have less control over the file order...).
if the XML is well formatted, statistics and the Hit-index are meaningless :
#get one headerwhile read L;do[["${L}"=="<Hit>"]]&&break;echo$L;done< blast1.xml > output.xml
#loop over each file. Take the fragments between "<Hit>" and "</Hit>".for B in blast1.xml blast2.xml blast3.xml ;dowhile read L;doif[["${L}"== *"<Hit>"* ]];thenecho$L>> output.xml ;while read L;doecho$L>> output.xml ;if[["${L}"== *"</Hit>"* ]];thenbreak;fi;done;fi;done<$B;done#get one footergrep -A 10000 "</Iteration_hits>" blast1.xml >> output.xml
Here is the python code from galaxy cleaned up, tested, and bundled in a bash callable:
#!/usr/bin/env python"""
A custom made script to merge the XML blast outputs
when queries are run in parallel by input-choping.
Copied and adapted from https://bitbucket.org/peterjc/galaxy-central/src/5cefd5d5536e/tools/ncbi_blast_plus/blast.py
Tested working with BLAST 2.2.28+
The fields:
<Iteration_iter-num></Iteration_iter-num>
<Iteration_query-ID></Iteration_query-ID>
should not be used from the merged file.
You can use this script from the shell like this:
$ blast_merge_xml reads*.xml > merged.xml
"""# Modules #import sys, os, sh
from contextlib import nested
###############################################################################
def merge_blast_xml(inputs, output):
"""Will merge the results from several BLAST searches
when XML output is used."""for i,handle in enumerate(inputs):
path = handle.name
header = handle.readline()if not header:
raise Exception("BLAST XML file '%s' was empty" % path)if header.strip()!='':
raise Exception("BLAST file '%s' is not an XML file" % path)
line = handle.readline()
header += line
if line.strip()[0:59]!='" in line: break
if len(header) > 10000:
raise Exception("BLAST file '%s' has a too long a header" % path)
if "<BlastOutput>" not in header:
raise Exception("BLAST XML file '%s' header's seems bad" % path)
if i == 0:
output.write(header)
old_header = header
elif old_header[:300] != header[:300]:
raise Exception("BLAST XML headers don't match" % path)
else: output.write(" <Iteration>\n")
for line in handle:
if "</BlastOutput_iterations>" in line: break
output.write(line)
output.write("</BlastOutput_iterations>\n")
output.write("</BlastOutput>\n\n")
output.flush()
###############################################################################
# Get arguments #
xml_paths = sys.argv[1:]
# Check for non-evaluated wild char #
if len(xml_paths)==1 and '*' in xml_paths[0]: xml_paths = sh.glob(xml_paths[0])
# Check for existence #
for path in xml_paths:
if not os.path.exists(path): raise Exception("No file at '%s'." % path)
# Check for user intelligence #
if len(xml_paths) == 1: raise Exception("No need to join just one file.")# Do it #
with nested(*map(open,xml_paths)) as inputs: merge_blast_xml(inputs, sys.stdout)
That worked perfectly. Thanks!
Hi Peter, it appears as though the merge method would work well for my situation, but I am unsure how to 'call' the merge function. I have copied the def merge: to a new python file -- can I simply paste this definition before the start of my actual script? And do I call the merge function by writing merge(directorycontainingxmlfilestobemerged, output_filename) ?
Hey, it's good to know I'm not the only one interested as I had no vote for the question ;-) You can find the modified code to make it working as is here: http://code.google.com/p/bioman/source/browse/BlastXMLmerge.py Enjoy! (And thanks again to Galaxy folks)
I guess in some sense I'm one of the Galaxy folk, although not one of their core developers.
Dear @Manu Prestat,
Thanks for the python script.
If I'm not mistaken in order to execute the script, I should do
python BlastXMLmerge.py merged.xml output1.xml output2.xml output3.xml
Please correct me if I was wrong. Thanks.
You're right KJ ;-)
Dear @Manu Prestat,
I tried to run the script with my xml files, but python raised this error:
-bash: /usr/local/bin/python: Argument list too long
I trying to run this script with 7,749 items.
I tried in my mac (OSX 10.8.4) and in redhat linux, both failed.
Any hints?
Thank you very much
This is a comman line length problem. First of all run this in the same directory as the input XML files to you can use local paths (avoid directory names naming the command longer).
If that is not enough, either do it in batches, or modify the script to accept a folder name as input (although then you will have less control over the file order...).