Question: How To Concatenate/Merge Blast Xml Outputs?
4
gravatar for Manu Prestat
6.5 years ago by
Manu Prestat3.9k
Marseille, France
Manu Prestat3.9k wrote:

I process my blast (BLASTN 2.2.26+) searches through a script that divides the fasta inputs into N pieces, distributes one blast instance for each piece (in N processes), and, once over, concatenates the outputs into one file. It works pretty well with the table output format, but what about the xml? Indeed, I want to test a software that takes as input only xml format (a dummy cat does not work), and I'm struggling with that case.

Have you an idea on how to make one consistant xml output from several (it can be hundreds) "sub-"outputs? (I need a biopython NCBIXML parsability) ;-) ).

xml blast • 6.6k views
ADD COMMENTlink modified 5.6 years ago by xapple230 • written 6.5 years ago by Manu Prestat3.9k
3
gravatar for Peter
6.5 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

I did the same task in the Galaxy wrappers for BLAST+ (although task splitting like this isn't on by default). If you understand Python, you can see the code here in the merge method of the BlastXml class

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Peter5.8k

That worked perfectly. Thanks!

ADD REPLYlink written 6.5 years ago by Manu Prestat3.9k

Hi Peter, it appears as though the merge method would work well for my situation, but I am unsure how to 'call' the merge function. I have copied the def merge: to a new python file -- can I simply paste this definition before the start of my actual script? And do I call the merge function by writing merge(directorycontainingxmlfilestobemerged, output_filename) ?

ADD REPLYlink written 6.4 years ago by charles.bridges70

Hey, it's good to know I'm not the only one interested as I had no vote for the question ;-) You can find the modified code to make it working as is here: http://code.google.com/p/bioman/source/browse/BlastXMLmerge.py Enjoy! (And thanks again to Galaxy folks)

ADD REPLYlink written 6.4 years ago by Manu Prestat3.9k

I guess in some sense I'm one of the Galaxy folk, although not one of their core developers.

ADD REPLYlink written 6.4 years ago by Peter5.8k

Dear @Manu Prestat,

Thanks for the python script.

If I'm not mistaken in order to execute the script, I should do

python BlastXMLmerge.py merged.xml output1.xml output2.xml output3.xml

Please correct me if I was wrong. Thanks.

ADD REPLYlink written 5.5 years ago by KJ Lim120

You're right KJ ;-)

ADD REPLYlink written 5.5 years ago by Manu Prestat3.9k

Dear @Manu Prestat,

I tried to run the script with my xml files, but python raised this error:

-bash: /usr/local/bin/python: Argument list too long

I trying to run this script with 7,749 items.

I tried in my mac (OSX 10.8.4) and in redhat linux, both failed.

Any hints?

Thank you very much

ADD REPLYlink written 5.4 years ago by pedrommeirelles0

This is a comman line length problem. First of all run this in the same directory as the input XML files to you can use local paths (avoid directory names naming the command longer).

If that is not enough, either do it in batches, or modify the script to accept a folder name as input (although then you will have less control over the file order...).

ADD REPLYlink written 5.4 years ago by Peter5.8k
1
gravatar for Pierre Lindenbaum
6.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

if the XML is well formatted, statistics and the Hit-index are meaningless :

#get one header
while read L; do [[ "${L}" == "<Hit>" ]] && break ; echo $L; done < blast1.xml > output.xml

#loop over each file. Take the fragments  between "<Hit>" and "</Hit>".
for B in blast1.xml blast2.xml blast3.xml ; do while read L; do if [[  "${L}" == *"<Hit>"* ]] ; then echo $L  >> output.xml ; while read L; do echo $L  >> output.xml ;  if [[  "${L}" == *"</Hit>"* ]] ; then break; fi; done; fi ; done < $B ; done 

#get one footer
 grep -A 10000 "</Iteration_hits>" blast1.xml  >> output.xml
ADD COMMENTlink written 6.5 years ago by Pierre Lindenbaum118k

I don't know if it worked properly. I think there's a problem in the loop. I tried but after 3 hours of run, I stopped it.

ADD REPLYlink written 6.5 years ago by Manu Prestat3.9k
0
gravatar for xapple
5.6 years ago by
xapple230
UU
xapple230 wrote:

Here is the python code from galaxy cleaned up, tested, and bundled in a bash callable:

#!/usr/bin/env python

"""
A custom made script to merge the XML blast outputs
when queries are run in parallel by input-choping.

Copied and adapted from https://bitbucket.org/peterjc/galaxy-central/src/5cefd5d5536e/tools/ncbi_blast_plus/blast.py

Tested working with BLAST 2.2.28+

The fields:
  <Iteration_iter-num></Iteration_iter-num>
  <Iteration_query-ID></Iteration_query-ID>
should not be used from the merged file.

You can use this script from the shell like this:
$ blast_merge_xml reads*.xml > merged.xml
"""

# Modules #
import sys, os, sh
from contextlib import nested

###############################################################################
def merge_blast_xml(inputs, output):
    """Will merge the results from several BLAST searches
    when XML output is used."""
    for i,handle in enumerate(inputs):
        path = handle.name
        header = handle.readline()
        if not header:
            raise Exception("BLAST XML file '%s' was empty" % path)
        if header.strip() != '':
            raise Exception("BLAST file '%s' is not an XML file" % path)
        line = handle.readline()
        header += line
        if line.strip()[0:59] != '" in line: break
            if len(header) > 10000:
                raise Exception("BLAST file '%s' has a too long a header" % path)
        if "<BlastOutput>" not in header:
            raise Exception("BLAST XML file '%s' header's seems bad" % path)
        if i == 0:
            output.write(header)
            old_header = header
        elif old_header[:300] != header[:300]:
            raise Exception("BLAST XML headers don't match" % path)
        else: output.write("    <Iteration>\n")
        for line in handle:
            if "</BlastOutput_iterations>" in line: break
            output.write(line)
    output.write("</BlastOutput_iterations>\n")
    output.write("</BlastOutput>\n\n")
    output.flush()

###############################################################################
# Get arguments #
xml_paths = sys.argv[1:]
# Check for non-evaluated wild char #
if len(xml_paths)==1 and '*' in xml_paths[0]: xml_paths = sh.glob(xml_paths[0])
# Check for existence #
for path in xml_paths:
    if not os.path.exists(path): raise Exception("No file at '%s'." % path)
# Check for user intelligence #
if len(xml_paths) == 1: raise Exception("No need to join just one file.")
# Do it #
with nested(*map(open,xml_paths)) as inputs: merge_blast_xml(inputs, sys.stdout)
ADD COMMENTlink modified 5.6 years ago • written 5.6 years ago by xapple230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 860 users visited in the last hour