Question

Have You Ever Tried Megablast Indexation ?

0

Entering edit mode

12.3 years ago

Manu Prestat 4.1k

Hi, I am surprised to see as low number of posts about megablast indexing... Is this because it does not work? If I believe this one, this should really help to get results faster. But after some trials, I really cannot observe such a good improvement. One potential problem is that the makembindex command results in creating one file less than it says in the output:

creating GG.00.idx
creating GG.01.idx

But only GG.00.idx appeared in the system files. (I tried with 2 computers with different processors with blast+.2.2.25 compiled independently on both machines.

First, I tried to megablast a file against Greengenes and except the fact it took the same time to run, the only difference was that the index megablast charged the RAM 6 to 7 times more than the non-index run. Despite of the potentially missing index file, the blast result was exactly the same (using the UNIX diff command). I made the assumption that indexing improves the speed only for bigger DBs:

So I tried against a huge db, i.e. genbank nt:

############ indexing db
makembindex -input nt -output nt -iformat blastdb

########################## megablast
### index
time blastn -task megablast -use_index true -db nt -query E1.454.fasta.1 -out megaBIGWithIndexNT.blast -evalue 1e-05 -num_descriptions 1 -num_alignments 1 -outfmt 6 > megaBIGWithIndexNT.out&

### without
time blastn -task megablast -use_index false -db nt -query E1.454.fasta.1 -out megaBIGNoIndexNT.blast -evalue 1e-05 -num_descriptions 1 -num_alignments 1 -outfmt 6 > megaBIGNoIndexNT.out&

The results are very bad: - there are less results with indexation - it took 1 day without index, and 3 days with index...

What do you think about that?

blast • 4.5k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 12.3 years ago by Manu Prestat 4.1k

score 2 · Answer 1 · 2011-12-20

An excerpt from the Megablast indexing paper's conclusion section:

We presented a new implementation of the seed search phase of MegaBLAST (Zhang et al., 2000) in which seeds are found by searching an index structure of k-mers derived from preprocessing the database. We showed that this ‘indexed MegaBLAST’ is faster than the ‘baseline MegaBLAST’, which preprocesses the query, in most cases and especially for masked databases. When indexed MegaBLAST is slower because there are too many seeds, performance degradation is limited enough that the code can be used in production.

The paper compared indexed/non-indexed human genomes. Perhaps MegaBlast works best on non-redundant sequences (genomic sequences) where there are less seeds. For genbank nt, there would be a large amount of seeds, using up a lot more RAM, maybe slowing down the process?

Ram · Answer 2 · 2015-03-25

Upon reading Megablast paper in 2008, I thought an index might be able to reduce my search time by 50%, after looking at Figure 2, that showed the index was most helpful with short sequences, and hey, shotgun reads range from log 1 to log 2.2 (10-150), so I thought this would work well for a sequencing pipeline.

However, I have gotten only parity performance to megablast without indexation.

NCBI's help desk told me that the benefit is possible with careful tweaking of the -nmer, -word_size and -evalue parameters depending on the circumstance.

So yeah, any benefit from this does seem to be rare.

A possible reason for dramatically worse performance with an index is lack of sufficient RAM to load the index into memory.

In my recent tests with sufficient memory, with indexation is marginally slower (maybe 5%). However, it should be noted I am generating a high number of homologous hits, which I believe is one of the bad circumstances mentioned in the paper, where indexation does not perform well.

For posterity's sake, anyone who finds this post via search engine may find this information helpful:

Kraken is a recent development in software that uses lookup table similar to the megablast index, but delivers a large speed benefit.

Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).