This is a question about short read B-W aligners in general, using bowtie2 and bwa as examples.
Both bowtie2 and bwa (and novoalign AFAIK) use 32-bit pointers internally. My question is, how much of a headache would it be to modify the code to use 64-bit pointers? Why has this never been done? Is it just because most users are doing single genome alignments (thus obviating the need for large index support)? Or is the choice of 32-bit vs 64-bit pointers so fundamental that it would essentially require a full re-write of the code? I would think there would be pretty considerable interest in this, as more people are doing metagenomic sampling and our databases of known genomes are getting quite large. Dividing the reference sequences up is a work-around, but not a satisfying one.
From the bowtie2 manual: "Because bowtie2-build uses 32-bit pointers internally, it can handle up to a theoretical maximum of 2^32-1 (somewhat more than 4 billion) characters in an index, though, with other constraints, the actual ceiling is somewhat less than that. If your reference exceeds 2^32-1 characters, bowtie2-build will print an error message and abort. To resolve this, divide your reference sequences into smaller batches and/or chunks and build a separate index for each."
And here are the error messages generated by these two programs (including this mainly for the key words):
$ bowtie2-build five_gigabyte.fa this_index_will_fail
Reading reference sizes
Error: Reference sequence has more than 2^32-1 characters! Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.
Error: Encountered internal Bowtie 2 exception (#1)
$ bwa index -a bwtsw five_gigabyte.fa
[bwa_index] Pack FASTA... 67.67 sec
[bwa_index] Reverse the packed sequence... 18.31 sec
[bwa_index] Construct BWT for the packed sequence...
TextLengthFromBytePacked(): text length > 2^32!
EDIT FEBRUARY 2, 2014: Note that Bowtie2 version 2.2.0 has just been released, which no longer has the 4GB limit! Huge!
I guess it really depends on how portable the code was written. I think the biggest difference is that long variable types are different. Any bit shifting operation might have to be modified too.
Ok for the bowtie2 "divide your reference sequences into smaller batches" but then they should handle the alignment vs. multiple indeces