bwa mem and multiply mapped reads
2
2
Entering edit mode
4.3 years ago
lfreeborn ▴ 20

I would like to know how BWA mem handles repetitive hits.

I know that bwa aln/samse- given a read that maps to multiple regions- will randomly select one region. Each query will be listed once in the sam file, and alternative mapping locations for that read will be listed in XA if the number of hits is less than INT hits.

But what about BWA mem? I keep reading that BWA mem differs from bwa aln in that it will split queries, but what about reads that aren't split but still map to multiple locations? Does mem also chose a random location? Are multiply mapped reads reported in the same fashion in bwa mem as they are for bwa aln/samse (that is, with one listing per query and XAs reported if less than INT)?

For simplicity's sake, I am not interested in using other aligners. I have pre-trimmed RAD data (80-139bp) from a frog species with a highly repetitive genome.

Thank you.

bwa repetitive genome alignment • 11k views
13
Entering edit mode
4.3 years ago

In general, short read aligners take some liberties in the way they report the alignments - as it turns out bwa is no different.

Alas in most cases there is surprisingly little information in the docs on the specifics on what gets reported in which way. The following is my best understanding of how bwa mem chooses to report alignments:

1. When a read matches in its entirety, with an equal score in multiple locations, one of the locations is picked at random, is labeled as primary, will be given a mapping quality of zero and will have an XA tag that contains the alternative locations (this is identical to how bwa aln worked)

2. When different, non-overlapping regions of a read align with high scores to different, non-linear locations in the genome, the higher score alignment will be labeled as primary, the others may be reported as secondary alignments. There is some threshold on how many of these secondary alignments will be reported (bwa aln did not produce secondary alignments)

3. When complementary regions of a read (the pieces add up to the full read) align to different, non-linear genomic locations, with no little to no overlap, one of the alignments will be labeled as primary, the others as supplementary alignments (bwa aln did not produce supplementary alignments)

0
Entering edit mode

Do you know if bwa mem reports this for all possible matches for a read or only a default number?

0
Entering edit mode

There are some options to filter for hits, though behind the scenes may be other tacit and undisclosed assumptions as to how short of an alignment it can find, regardless of the score. My guess that the mem method itself won't work under ~ 20bp.

  -T INT        minimum score to output [30]
-h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]
-a            output all alignments for SE or unpaired PE

0
Entering edit mode

Thanks.

You made a good point above. Even with well written programs (like bwa) it is not always possible to know the entire truth (for an end-user). Much as I like bbmap I discovered the other day that it does not seem to write all alignments for multi-mapping reads to output file (again a decision that may have been made to keep the file size sane).

0
Entering edit mode

Thank you for the helpful reply! I was definitely surprised by how little information I could find about multiply mapped reads in the official docs.

1
Entering edit mode

The essential thing to remember with all short read aligners that they were never designed to be "optimal" from any point of view.

This is why these algorithms are so mindbogglingly fast. They use various shortcuts and sacrifice optimality to be able to achieve their primary use case - finding one, the best single hit if one such hit exists. That's why they can align tens of thousands of reads per second against the human genome.

If you want to find near-exact matches, with few alternatives, the algorithms work very well.

For any other use case, the results are approximations with a varying range of errors.

0
Entering edit mode

When using a tool like reformat.sh from BBMap, would the secondary alignments (i.e. when it maps to another spot in the genome) be included in the unmappedonly=t flag?

1
Entering edit mode

Not if you set primaryonly=t. Default is primaryonly=f.

0
Entering edit mode

Oh ok cool, so that means both the reads would be considered mapped. Awesome, I was trying to extract the unmapped reads to figure out what was going on with my assembly.

0
Entering edit mode

Does anyone know where:

When a read matches in its entirety, with an equal score in multiple locations, one of the locations is picked at random, is labeled as primary, will be given a mapping quality of zero and will have an XA tag that contains the alternative locations (this is identical to how bwa aln worked)

happens in the bwa source code? https://github.com/lh3/bwa/blob/9f26bfcc7780753129b60717ecab0ebba6f04b7c/bwamem.c#L521 was showing promise, but I can't figure out where exactly the random selection occurs.

0
Entering edit mode
4.3 years ago

My suggestion to complement the other answer(s) is to prepare some test data and see what bwa does. For example, create a dummy genome of few hundred bases where two or more regions are exactly duplicated. Then prepare a fastq file with one or more reads matching (part of) the duplicated region. Map and see what happens. You can play with the number of mismatches, base qualities, read length etc.

Of course, you don't get a definitive answer since you can't cover all cases, but at least you can tell for sure how bwa behaves with known cases. Also, this approach forces you to think what it is you want to test. (Essentially, this kind of approach is similar to test driven development).

0
Entering edit mode

Someone had apparently done this exercise. While we can't see the actual data files Heng Li's answer is visible in the thread.

0
Entering edit mode

One challenge with this approach is that one needs to understand what they are trying to capture - possibly before knowing that.

The information encoded in sequences (genomes) is very rich and it is difficult to capture the characteristics of each problem. A method that works well for uniform base distribution may perform inadequately if the genome has internal patterns, repeating motifs etc.