Question

How Fundamental Is The 4 Gb Size Limit For Building Burrows-Wheeler Indices

9

Entering edit mode

11.6 years ago

Owen S. ▴ 370

This is a question about short read B-W aligners in general, using bowtie2 and bwa as examples.

Both bowtie2 and bwa (and novoalign AFAIK) use 32-bit pointers internally. My question is, how much of a headache would it be to modify the code to use 64-bit pointers? Why has this never been done? Is it just because most users are doing single genome alignments (thus obviating the need for large index support)? Or is the choice of 32-bit vs 64-bit pointers so fundamental that it would essentially require a full re-write of the code? I would think there would be pretty considerable interest in this, as more people are doing metagenomic sampling and our databases of known genomes are getting quite large. Dividing the reference sequences up is a work-around, but not a satisfying one.

From the bowtie2 manual: "Because bowtie2-build uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, bowtie2-build will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each."

And here are the error messages generated by these two programs (including this mainly for the key words):

$ bowtie2-build five_gigabyte.fa this_index_will_fail
Reading reference sizes
Error: Reference sequence has more than 2^32-1 characters!  Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.
Error: Encountered internal Bowtie 2 exception (#1)

$ bwa index -a bwtsw five_gigabyte.fa
[bwa_index] Pack FASTA... 67.67 sec
[bwa_index] Reverse the packed sequence... 18.31 sec
[bwa_index] Construct BWT for the packed sequence...
TextLengthFromBytePacked(): text length > 2^32!

EDIT FEBRUARY 2, 2014: Note that Bowtie2 version 2.2.0 has just been released, which no longer has the 4GB limit! Huge!

bowtie bowtie2 bwa • 8.2k views

ADD COMMENT • link 10.2 years ago by Owen S. ▴ 370

0

Entering edit mode

I guess it really depends on how portable the code was written. I think the biggest difference is that long variable types are different. Any bit shifting operation might have to be modified too.

ADD REPLY • link 11.6 years ago by Damian Kao 16k

0

Entering edit mode

Ok for the bowtie2 "divide your reference sequences into smaller batches" but then they should handle the alignment vs. multiple indeces

ADD REPLY • link 11.0 years ago by Federico Giorgi ▴ 730

Ram · Answer 1 · 2012-09-11

bwa versions 0.6 and above do support 64-bit indices. I assume your (failing) command above is an older version of bwa. From the bwa mailing list, Heng states:

"Since version 0.6, genomes longer than 4GB will be supported. At a cost, bwa will use about 30%-50% more memory, depending on commands. This is simply because we have to double the memory allocated to a 64-bit integer array in comparison to a 32-bit array. On the good side, the support of longer genomes enables optimizations that cannot be done with the old 32-bit index. The aln command of bwa-0.6 will be about 20% faster. The bwasw command will be twice as fast."

His comments also provide insight on the reasons for/against this change (memory usage versus speed, plus the coding headache against 32-bit inertia).

However, I believe BAM restricts each chromosome to be at most 4gb, since the position is 32 bits in the spec. This shouldn't be a huge problem in most cases, though, including your metagenomic case.

For those that will try the newer bwa versions, I should add that the 0.6.x and 0.5.x indices are incompatible and by default will overwrite each other (the names are the same by default). To solve that, use bwa 0.6.2 and pass the -6 flag to bwa index. This will give the 64-bit index a different name and allow the indices to coexist.