Burrow-Wheeler Alignment
1
12
Entering edit mode
11.6 years ago
Bioscientist ★ 1.7k

Can anyone, in a simple, explicit way, point out the trick which improve aligning speed/efficiency by burrow-wheeler transform (BWA/Bowtie compared to MAQ)?

I read the paper about MAQ,BWA,Bowtie, and also review, but only become more confused. What's the trick here?

thx

bwa alignment bowtie • 8.7k views
29
Entering edit mode
11.6 years ago
lh3 33k

Let's consider exact matching and inexact matching separately.

For the exact matching problem, it is true that BWT-based algorithms are faster. Suppose we have a read of length "l" (letter ell, not one) and a genome of length "L". If we use a naive k-mer hash table, each k-mer appears L/4^k times. Thus the time complexity is O(l*L/4^k). For human, L=3e9 and typically k=13. L/4^k is about ~50, which is not a small number already. In practice, a hash-table based method does even worse because the human genome is much more repetitive than a random string. A lot of k-mer have thousands of occurrences. Visiting each of them is quite slow. In contrast, with BWT (more exactly FM-Index), we can always tell if the read has a perfect match in O(l) time, no matter how long or how repetitive the genome is. Essentially, FM-index collapses all the copies of a substring. We simultaneously align a sequence to all copies, rather than align to each copy like what we do with a hash table. This is why BWT-based algorithms are faster.

On the other hand, I do not think BWT-based algorithms are faster for inexact matching problem. For inexact matching, we may still need to visit multiple copies similar to the read sequence. Although with BWT we still do better than with a hash table in theory, as each BWT operation is more expensive, the advantage of BWT is reduced. Moreover, for longer short reads, we can start to use multiple seeds, which may greatly boost the performance of hash table based algorithms. In fact, for 100bp reads, several hash table based mappers compete BWT-based ones on speed.

Most of us get the impression that BWT-based algorithms are faster because for 30bp reads they were by far faster than the first-generation hash-based short-read mappers such as SOAP1 and MAQ. This is not necessarily true nowadays. In addition, we may misbelieve BWT-based ones are faster because they are popular, but this may be more related to the fact they are relatively well developed (e.g. user friendliness, memory and even reputation), instead of speed. Another fact we frequently overlook is that the most accurate short-read mappers are still based on hash tables. For BWT-based methods, differentiating subtle hits dramatically increases the computing time. It is very difficult to get an extreme accuracy while maintaining a reasonable speed.

0
Entering edit mode

very informative response.