Question: bwa mem and multiply mapped reads
gravatar for lfreeborn
2.7 years ago by
lfreeborn20 wrote:

I would like to know how BWA mem handles repetitive hits.

I know that bwa aln/samse- given a read that maps to multiple regions- will randomly select one region. Each query will be listed once in the sam file, and alternative mapping locations for that read will be listed in XA if the number of hits is less than INT hits.

But what about BWA mem? I keep reading that BWA mem differs from bwa aln in that it will split queries, but what about reads that aren't split but still map to multiple locations? Does mem also chose a random location? Are multiply mapped reads reported in the same fashion in bwa mem as they are for bwa aln/samse (that is, with one listing per query and XAs reported if less than INT)?

For simplicity's sake, I am not interested in using other aligners. I have pre-trimmed RAD data (80-139bp) from a frog species with a highly repetitive genome.

Thank you.

ADD COMMENTlink modified 20 days ago by Biostar ♦♦ 20 • written 2.7 years ago by lfreeborn20
gravatar for Istvan Albert
2.7 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

In general, short read aligners take some liberties in the way they report the alignments - as it turns out bwa is no different.

Alas in most cases there is surprisingly little information in the docs on the specifics on what gets reported in which way. The following is my best understanding of how bwa mem chooses to report alignments:

  1. When a read matches in its entirety, with an equal score in multiple locations, one of the locations is picked at random, is labeled as primary, will be given a mapping quality of zero and will have an XA tag that contains the alternative locations (this is identical to how bwa aln worked)

  2. When different, non-overlapping regions of a read align with high scores to different, non-linear locations in the genome, the higher score alignment will be labeled as primary, the others may be reported as secondary alignments. There is some threshold on how many of these secondary alignments will be reported (bwa aln did not produce secondary alignments)

  3. When two complementary regions of a read (the two pieces add up to the full read) align to two different, non-linear genomic locations one of the alignment will be labeled as primary, the other as supplementary alignment (bwa aln did not produce supplementary alignments)

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Istvan Albert ♦♦ 85k

Do you know if bwa mem reports this for all possible matches for a read or only a default number?

ADD REPLYlink written 2.7 years ago by genomax92k

There are some options to filter for hits, though behind the scenes may be other tacit and undisclosed assumptions as to how short of an alignment it can find, regardless of the score. My guess that the mem method itself won't work under ~ 20bp.

  -T INT        minimum score to output [30]
  -h INT[,INT]  if there are <INT hits with score >80% of the max score, output all in XA [5,200]
  -a            output all alignments for SE or unpaired PE
ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Istvan Albert ♦♦ 85k


You made a good point above. Even with well written programs (like bwa) it is not always possible to know the entire truth (for an end-user). Much as I like bbmap I discovered the other day that it does not seem to write all alignments for multi-mapping reads to output file (again a decision that may have been made to keep the file size sane).

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by genomax92k

Thank you for the helpful reply! I was definitely surprised by how little information I could find about multiply mapped reads in the official docs.

ADD REPLYlink written 2.7 years ago by lfreeborn20

The essential thing to remember with all short read aligners that they were never designed to be "optimal" from any point of view.

This is why these algorithms are so mindbogglingly fast. They use various shortcuts and sacrifice optimality to be able to achieve their primary use case - finding one, the best single hit if one such hit exists. That's why they can align tens of thousands of reads per second against the human genome.

If you want to find near-exact matches, with few alternatives, the algorithms work very well.

For any other use case, the results are approximations with a varying range of errors.

ADD REPLYlink written 2.7 years ago by Istvan Albert ♦♦ 85k

When using a tool like from BBMap, would the secondary alignments (i.e. when it maps to another spot in the genome) be included in the unmappedonly=t flag?

ADD REPLYlink written 16 months ago by O.rka220

Not if you set primaryonly=t. Default is primaryonly=f.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax92k

Oh ok cool, so that means both the reads would be considered mapped. Awesome, I was trying to extract the unmapped reads to figure out what was going on with my assembly.

ADD REPLYlink written 16 months ago by O.rka220
gravatar for dariober
2.7 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

My suggestion to complement the other answer(s) is to prepare some test data and see what bwa does. For example, create a dummy genome of few hundred bases where two or more regions are exactly duplicated. Then prepare a fastq file with one or more reads matching (part of) the duplicated region. Map and see what happens. You can play with the number of mismatches, base qualities, read length etc.

Of course, you don't get a definitive answer since you can't cover all cases, but at least you can tell for sure how bwa behaves with known cases. Also, this approach forces you to think what it is you want to test. (Essentially, this kind of approach is similar to test driven development).

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by dariober11k

Someone had apparently done this exercise. While we can't see the actual data files Heng Li's answer is visible in the thread.

ADD REPLYlink written 2.7 years ago by genomax92k

One challenge with this approach is that one needs to understand what they are trying to capture - possibly before knowing that.

The information encoded in sequences (genomes) is very rich and it is difficult to capture the characteristics of each problem. A method that works well for uniform base distribution may perform inadequately if the genome has internal patterns, repeating motifs etc.

ADD REPLYlink written 2.7 years ago by Istvan Albert ♦♦ 85k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1239 users visited in the last hour