Reduce Blast Xml Size?
4
1
Entering edit mode
12.9 years ago
Prohan ▴ 350

Hi, I have a really large BLAST XML file - something like 30gb in size.

I'd like to reduce it so I can run through it quicker with Biopython.

Is there a way to reduce the file by keeping something like the top 25 hits based on bitscore for each query.

Preferably I'd like to do with Python/Biopython.

Thanks

python biopython blast xml • 4.7k views
ADD COMMENT
3
Entering edit mode

It would be more sensible to generate a smaller XML file in the first instance, using BLAST command-line parameters. But perhaps you did not generate the original file?

ADD REPLY
0
Entering edit mode

Currently Biopython parses BLAST XML, but doesn't write it out again. That would be the most elegant Biopython-based solution to your needs.

Otherwise you'll need to write something specific, possibly using one of the built in Python XML libraries, or hand coded for this specific need.

ADD REPLY
0
Entering edit mode

Yes I next time I'll reduce the number of reported HSPs. I was trying to be smart by keeping everything and didn't think parsing would be a problem later.

ADD REPLY
0
Entering edit mode

If you are interested in parsing some information from the big XML file you can use the event parsing strategy. lxml in Python has great event parser. Please drop a line after this comment if you are interested in this approach and I will post a detailed methodology.

ADD REPLY
0
Entering edit mode

@Bio_Neo: Could you describe this in more detail? I have a BLAST result with top 50 hits of a large multifasta in XML format. I need to reduce this to top 5 hits for using for some downstream processes. Could you explain how to use the event parsing strategy for this?

ADD REPLY
3
Entering edit mode
12.9 years ago

To extract a subpart of a large XML file, the idea is to use a XML pull parser. Read and echo each node of your XML until you find a [?]Hit[?] element. Then, parse but only echo the blocks of XML elements you need.

In java, a Pull parser specific to a given DTD can be generated with xjc.

For an example, see A: How To Retrieve Human Proteins Sequence Containing A Given Domain .

If your document can fit in memory, you can use XSLT:

see Standalone Blast Options or How To Filter A Blast Xml Output ?

ADD COMMENT
0
Entering edit mode

Ok thanks - this sounds like what I need to do.

ADD REPLY
1
Entering edit mode
12.9 years ago
Michael Barton ★ 1.9k

You could compress the file using something like parallel bzip then read from the file as a compressed IO pipe. Obviously this only reduces the physical size of the file not its contents. You'll also expect a longer running time too I imagine.

ADD COMMENT
1
Entering edit mode
12.2 years ago
Karthik ▴ 20

A solution for the future would perhaps be to bunch queries into multiple files, so that you will have multiple smaller XML outputs, which you could easily process using Biopython, and perhaps even in parallel, if you use a cluster.

ADD COMMENT
0
Entering edit mode
12.2 years ago
Lee Katz ★ 3.1k

Someone else had a similar question at about the same time as you. This looks like a good answer: A: Can Biopython Parse Gzipped Xml From Blast?

import gzip
from Bio.Blast import NCBIXML

blast_file = gzip.open('output.xml.gz', 'rb')
blast_records = NCBIXML.parse(blast_file)
ADD COMMENT
0
Entering edit mode

The file size isn't so much an issue but the time it takes to iterate through the files that's annoying.

The whole reason I'm trying to do this is to get the full description of the hit. My solution now has been to directly parse the human readable default blast out to tab delimited.

These are much smaller files and parse quickly.

Thanks for the thoughts though.

ADD REPLY
0
Entering edit mode

Have you had any problems parsing the tabbed output? I am more 'comfortable' with XML, since I am confident that it will be correctly and reliably parsed (XML is for parsing anyways, right?).

ADD REPLY
0
Entering edit mode

I've been parsing the human readable blast output just fine without any problems - although it supposedly can change whenever NCBI feels like it.

As for the tabbed output - never had a problem with that.

I'd prefer XML too but the files are sssoooo big

ADD REPLY

Login before adding your answer.

Traffic: 2313 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6