Question

Reduce Blast Xml Size?

1

Entering edit mode

14.1 years ago

Prohan ▴ 350

Hi, I have a really large BLAST XML file - something like 30gb in size.

I'd like to reduce it so I can run through it quicker with Biopython.

Is there a way to reduce the file by keeping something like the top 25 hits based on bitscore for each query.

Preferably I'd like to do with Python/Biopython.

Thanks

python biopython blast xml • 5.6k views

ADD COMMENT • link updated 14.1 years ago by Lee Katz ★ 3.2k • written 14.1 years ago by Prohan ▴ 350

3

Entering edit mode

It would be more sensible to generate a smaller XML file in the first instance, using BLAST command-line parameters. But perhaps you did not generate the original file?

ADD REPLY • link 14.1 years ago by Neilfws 49k

0

Entering edit mode

Currently Biopython parses BLAST XML, but doesn't write it out again. That would be the most elegant Biopython-based solution to your needs.

Otherwise you'll need to write something specific, possibly using one of the built in Python XML libraries, or hand coded for this specific need.

ADD REPLY • link 14.1 years ago by Peter 6.0k

0

Entering edit mode

Yes I next time I'll reduce the number of reported HSPs. I was trying to be smart by keeping everything and didn't think parsing would be a problem later.

ADD REPLY • link 14.1 years ago by Prohan ▴ 350

0

Entering edit mode

If you are interested in parsing some information from the big XML file you can use the event parsing strategy. lxml in Python has great event parser. Please drop a line after this comment if you are interested in this approach and I will post a detailed methodology.

ADD REPLY • link 13.8 years ago by Bio_Neo ▴ 30

0

Entering edit mode

@Bio_Neo: Could you describe this in more detail? I have a BLAST result with top 50 hits of a large multifasta in XML format. I need to reduce this to top 5 hits for using for some downstream processes. Could you explain how to use the event parsing strategy for this?

ADD REPLY • link 13.6 years ago by User 8080 • 0

Ram · Answer 1 · 2011-06-16

3

Entering edit mode

14.1 years ago

Pierre Lindenbaum 166k

To extract a subpart of a large XML file, the idea is to use a XML pull parser. Read and echo each node of your XML until you find a [?]Hit[?] element. Then, parse but only echo the blocks of XML elements you need.

In java, a Pull parser specific to a given DTD can be generated with xjc.

For an example, see A: How To Retrieve Human Proteins Sequence Containing A Given Domain .

If your document can fit in memory, you can use XSLT:

see Standalone Blast Options or How To Filter A Blast Xml Output ?

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Ok thanks - this sounds like what I need to do.

ADD REPLY • link 14.1 years ago by Prohan ▴ 350

score 1 · Answer 2 · 2011-06-16

1

Entering edit mode

14.1 years ago

Michael Barton ★ 1.9k

You could compress the file using something like parallel bzip then read from the file as a compressed IO pipe. Obviously this only reduces the physical size of the file not its contents. You'll also expect a longer running time too I imagine.

ADD COMMENT • link 14.1 years ago by Michael Barton ★ 1.9k

score 1 · Answer 3 · 2012-02-23

1

Entering edit mode

13.4 years ago

Karthik ▴ 20

A solution for the future would perhaps be to bunch queries into multiple files, so that you will have multiple smaller XML outputs, which you could easily process using Biopython, and perhaps even in parallel, if you use a cluster.

ADD COMMENT • link 13.4 years ago by Karthik ▴ 20

Ram · Answer 4 · 2012-02-23

0

Entering edit mode

13.4 years ago

Lee Katz ★ 3.2k

Someone else had a similar question at about the same time as you. This looks like a good answer: A: Can Biopython Parse Gzipped Xml From Blast?

import gzip
from Bio.Blast import NCBIXML

blast_file = gzip.open('output.xml.gz', 'rb')
blast_records = NCBIXML.parse(blast_file)

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 13.4 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

The file size isn't so much an issue but the time it takes to iterate through the files that's annoying.

The whole reason I'm trying to do this is to get the full description of the hit. My solution now has been to directly parse the human readable default blast out to tab delimited.

These are much smaller files and parse quickly.

Thanks for the thoughts though.

ADD REPLY • link 13.4 years ago by Prohan ▴ 350

0

Entering edit mode

Have you had any problems parsing the tabbed output? I am more 'comfortable' with XML, since I am confident that it will be correctly and reliably parsed (XML is for parsing anyways, right?).

ADD REPLY • link 13.4 years ago by Karthik ▴ 20

0

Entering edit mode

I've been parsing the human readable blast output just fine without any problems - although it supposedly can change whenever NCBI feels like it.

As for the tabbed output - never had a problem with that.

I'd prefer XML too but the files are sssoooo big

ADD REPLY • link 13.3 years ago by Prohan ▴ 350