Question: Reduce Blast Xml Size?
1
gravatar for Prohan
7.8 years ago by
Prohan350
United States
Prohan350 wrote:

Hi, I have a really large BLAST XML file - something like 30gb in size.

I'd like to reduce it so I can run through it quicker with Biopython.

Is there a way to reduce the file by keeping something like the top 25 hits based on bitscore for each query.

Preferably I'd like to do with Python/Biopython.

Thanks

python biopython blast xml • 3.0k views
ADD COMMENTlink written 7.8 years ago by Prohan350
3

It would be more sensible to generate a smaller XML file in the first instance, using BLAST command-line parameters. But perhaps you did not generate the original file?

ADD REPLYlink written 7.8 years ago by Neilfws48k

Currently Biopython parses BLAST XML, but doesn't write it out again. That would be the most elegant Biopython-based solution to your needs.

Otherwise you'll need to write something specific, possibly using one of the built in Python XML libraries, or hand coded for this specific need.

ADD REPLYlink written 7.8 years ago by Peter5.8k

Yes I next time I'll reduce the number of reported HSPs. I was trying to be smart by keeping everything and didn't think parsing would be a problem later.

ADD REPLYlink written 7.8 years ago by Prohan350

If you are interested in parsing some information from the big XML file you can use the event parsing strategy. lxml in Python has great event parser. Please drop a line after this comment if you are interested in this approach and I will post a detailed methodology.

ADD REPLYlink written 7.5 years ago by Bio_Neo30

@Bio_Neo: Could you describe this in more detail? I have a BLAST result with top 50 hits of a large multifasta in XML format. I need to reduce this to top 5 hits for using for some downstream processes. Could you explain how to use the event parsing strategy for this?

ADD REPLYlink written 7.3 years ago by User 80800
3
gravatar for Pierre Lindenbaum
7.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

To extract a subpart of a large XML file, the idea is to use a XML pull parser. Read and echo each node of your XML until you find a [?]Hit[?] element. Then, parse but only echo the blocks of XML elements you need.

In java, a Pull parser specific to a given DTD can be generated with xjc.

For an example, see http://biostar.stackexchange.com/questions/5862/how-to-retrieve-human-proteins-sequence-containing-a-given-domain/5868#5868 .

If your document can fit in memory, you can use XSLT:

see http://biostar.stackexchange.com/questions/2869 or http://biostar.stackexchange.com/questions/8361

ADD COMMENTlink written 7.8 years ago by Pierre Lindenbaum118k

Ok thanks - this sounds like what I need to do.

ADD REPLYlink written 7.8 years ago by Prohan350
1
gravatar for Michael Barton
7.8 years ago by
Michael Barton1.8k
Akron, Ohio, United States
Michael Barton1.8k wrote:

You could compress the file using something like parallel bzip then read from the file as a compressed IO pipe. Obviously this only reduces the physical size of the file not its contents. You'll also expect a longer running time too I imagine.

ADD COMMENTlink written 7.8 years ago by Michael Barton1.8k
1
gravatar for Karthik
7.1 years ago by
Karthik20
India
Karthik20 wrote:

A solution for the future would perhaps be to bunch queries into multiple files, so that you will have multiple smaller XML outputs, which you could easily process using Biopython, and perhaps even in parallel, if you use a cluster.

ADD COMMENTlink written 7.1 years ago by Karthik20
0
gravatar for Lee Katz
7.1 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

Someone else had a similar question at about the same time as you. This looks like a good answer:

http://biostar.stackexchange.com/questions/17748/can-biopython-parse-gzipped-xml-from-blast/17752#17752

import gzip
from Bio.Blast import NCBIXML

blast_file = gzip.open('output.xml.gz', 'rb')
blast_records = NCBIXML.parse(blast_file)
ADD COMMENTlink written 7.1 years ago by Lee Katz2.9k

The file size isn't so much an issue but the time it takes to iterate through the files that's annoying.

The whole reason I'm trying to do this is to get the full description of the hit. My solution now has been to directly parse the human readable default blast out to tab delimited.

These are much smaller files and parse quickly.

Thanks for the thoughts though.

ADD REPLYlink written 7.1 years ago by Prohan350

Have you had any problems parsing the tabbed output? I am more 'comfortable' with XML, since I am confident that it will be correctly and reliably parsed (XML is for parsing anyways, right?).

ADD REPLYlink written 7.1 years ago by Karthik20

I've been parsing the human readable blast output just fine without any problems - although it supposedly can change whenever NCBI feels like it.

As for the tabbed output - never had a problem with that.

I'd prefer XML too but the files are sssoooo big

ADD REPLYlink written 6.9 years ago by Prohan350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2150 users visited in the last hour