Question

Blast+ Query Splitting/Chopping In Blastxml Format

2

Entering edit mode

13.6 years ago

Lythimus ▴ 210

What's the easiest way to split a BLAST+ query into pieces BLAST+ all of the chunks with blastn against NT, and merge them back together? I presume using query_loc is better than literally splitting the sequence file. Afterwards, should I just write a script to strip out the headers, parse all of the documents as XML files and export only what I need (probably needlessly memory intensive), or is there a tool to join BLAST+ results automatically or simply automate the entire process?

It just seems like this should exist given how much memory is required by BLAST+.

blast blast xml parsing • 3.8k views

ADD COMMENT • link updated 10.7 years ago by xapple ▴ 230 • written 13.6 years ago by Lythimus ▴ 210

1

Entering edit mode

Have you verified that BLAST is indeed taking (excessively) more memory when processing a larger subject file? I thought that the queries were handled sequentially and thus there is just a little more overhead.

ADD REPLY • link 13.6 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

what do you mean by "joining the the chunks" ? do you want to merge the Hsps ?

ADD REPLY • link 13.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, I want to join the HSPs, basically anything resulting from the alignment. Essentially I believe I need to place anything between and including the [?] tags in the XML format, keeping the headers of one file to preserve the structure.

And I have verified that BLAST+ does indeed consume more swap as it runs longer and it runs fine on smaller, but still large, data sets. In case you're wondering, I am working from a precompiled version of the 64 Linux version of BLAST+. I am using soft filtering options though I have attempted to run it without them. And an e value of 0.0001.

ADD REPLY • link 13.6 years ago by Lythimus ▴ 210

0

Entering edit mode

BLAST+ doesn't require lots of memory: instead it simply requires as much RAM as the size of the database you are searching against. Splitting your input will not change this.

ADD REPLY • link 10.7 years ago by xapple ▴ 230

score 2 · Answer 1 · 2010-10-05

2

Entering edit mode

13.6 years ago

Istvan Albert 100k

The results of aligning a query sequence to a genome will likely differ from results obtained by splitting the query into shorter sequences, aligning those then merging the matches in the same way that the query was originally split.

You should try a different solution if the query size is indeed the problem.

ADD COMMENT • link 13.6 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks. I was given the impression that it would not affect the results other than possibly returning duplicates. I guess I need to read up on the technology more.

ADD REPLY • link 13.6 years ago by Lythimus ▴ 210

score 1 · Answer 2 · 2013-08-26

If you have ten queries to blast against a unique database such as NT, you can blast five on one computer and five on the other and then join the outputs. This will produce the same result as blasting all the ten on one computer. You run into a slight difficulty if you have chosen XML output instead tabular output, since now a simple cat command will not suffice for the joining of the outputs. That's the input-chopping strategy and it is embarrassingly parallel.

There is an other strategy where you chop-up the database. This can be desirable when the database is larger than the RAM of the nodes. As the database is put in memory and the queries run serially. This is not embarrassingly parallel and requires message passing amongst the computers running the job. This has been implemented in a very un-userfriendly package named mpiblast.