How do BWA and Bowtie2 decide where to place paired reads and what flags affect this?
1
2
Entering edit mode
3.3 years ago

Both BWA and Bowtie2 are used to map reads to genomes pretty extensively with 2x150 bp or 2x250 bp Illumina data with insert sizes in the range of 400-1000 bp these days.

The vast majority of datasets are also paired end 2x150 bp datasets, at least in microbiology. Yet, I have found that the default parameters for these programs are not necessarily appropriate for the standard Illumina data coming off of (almost everyone's?) HiSeqs and NovaSeqs for the past few years.

For example, the default -X (capital X) parameter for bowtie2 is expected insert size and is 500 bp. However, for most modern Illumina datasets the insert sizes can be substantially larger than this a large fraction of the time. Running with the default -X 500 will therefore flag read pairs outside of this insert size as improper (which will be completely ignored by samtools mpileup by default!), and presumably be less likely to accurately assign their location.

This led me down trials and tribulations of trying to figure out exactly how both bowtie2 and bwa decide where to place two paired reads, and ending up concluding that the information largely doesn't exist on the internet in a form understandable by all of the thousands of biologists who run these programs almost exclusively on paired read data.

So, can anyone explain how both programs work at assigning the location of a pair of reads? Some questions that would need to be answered by a proper explanation:

1. If read 1 maps equally well to N positions, is it randomly assigned a position?
2. If the aligner then finds that read 2 maps perfectly well to a position nearby only one of those N positions, will it go back and adjust the position of read 1 to the position that both reads are best aligned to?
3. If read 1 maps perfectly to a position where-as read 2 maps equally well to two positions, will it be placed at the position closer to read 1? How do parameters like the -X parameter affect this?
4. Will the aligner ever choose a (valid) alignment with suboptimal identity for read 1 in order to make it closer to read 2 (and vice versa?)
5. Do the mapq scores of reads in a pair influence each other?
0
Entering edit mode

Sometimes these sort of questions is best answered by preparing some test cases and see what happens. For example, write a dummy genome of 1000bp with some repeats, write a pair of fastq files with a handful of reads mapping to or near the repeats, index, align and see the results. (In theory, a conclusive answer would come from inspecting the source code, but good luck with it!)

1
Entering edit mode
3.3 years ago
Carambakaracho ★ 3.0k

Hi there,

I might not be the best algorithmic expert on this, but most of your questions are fundamental problems that, for example Heng Li (author of bwa, minimap, etc...) extensively thought about.

Note that IMHO you need to understand at least how to use a complex program in order to use it croorrectly, so I don't see a problem in bowtie2's default fragment sizes. Note also, bwa mem estimates the fragment size automatically

1. If read 1 maps equally well to N positions, is it randomly assigned a position?

This is the Multimapping problem, see for example here

1. If the aligner then finds that read 2 maps perfectly well to a position nearby only one of those N positions, will it go back and adjust the position of read 1 to the position that both reads are best aligned to?

In case you choose paired, the algorithms will prefer concordant alignments, depending on your settings.

1. If read 1 maps perfectly to a position where-as read 2 maps equally well to two positions, will it be placed at the position closer to read 1? How do parameters like the -X parameter affect this?

As above, the algorithms will prefer concordant alignments, depending on your settings. Wrong or too stringent -X will penalize otherwise correct alignments. bwa mem estimates insert size automatically, see above

1. Will the aligner ever choose a (valid) alignment with suboptimal identity for read 1 in order to make it closer to read 2 (and vice versa?)

If it's the best alignment for read 1, sure. In case not, certainly not.Your reference might not be identical to your sequenced organism.

5.Do the mapq scores of reads in a pair influence each other?

No, afaik, computation of MAPQ based on the read only and is not even consistent between aligners.

0
Entering edit mode

Thank you for your reply - very enlightening. But do you have some links that specify exactly how bowtie2 and bwa handle the multimapping problem? Is it random (equal) distribution at equally likely sites, or something more complicated?

1
Entering edit mode

You can read here in bowtie2's manual how it handles it. Note how the manual states it won't search eternally for another alignment, so the first location might be favored. See this thread for what that means in practice.

For bwa mem the read assignment is supposedly random, and I've not read othewise. More details by Istvan in this thread