Mapq Scores Inside Repeats
1
6
Entering edit mode
9.9 years ago

I would like to know what formulas are being used in the most common genome aligners to determine the MAPQ of reads that fall inside repeats in their entirety? By falling inside repeats in their entirety, I mean that the whole read length is inside the repeat. Also that both read pairs of a PE fall inside the repeat. My definition of repeats here, just as a simplification, is "stretches of DNA sequences in a genome that are identically found in multiple locations".

Is there a consensus on which scoring formula to use for reads falling on repetitive regions of the genome?

EDITED: As an example, does bowtie score the same as bwa, or mosaik? Does bwa aln score the same as bwa mem?

Highlighted below is the relevant code snippet for mem_sam_pe in bwamem_pair.c that makes things different between bwa aln and bwa mem.

alignment • 4.8k views
5
Entering edit mode
9.9 years ago

There's absolutely no consensus on how MAPQ scores should be calculated in pretty much any circumstance. For bowtie2, the MAPQ is determined by the best (AS:i:X auxiliary tag) and second best (XS:i:X auxiliary tag) alignment scores as well as whatever --score-min is set to. If you really want to know the exact algorithm, a C implementation starts at line 109. For BWA, the MAPQ is dependent on

1. The number of equally best mapping reads (p->c1, below)
2. The number of second best mapping reads (p->c2, below)
3. The number of observed (p->n_mm) and maximum allowed (mm) mismatches.

int bwa_approx_mapQ(const bwa_seq_t *p, int mm) {
int n;
if (p->c1 == 0) return 23;
if (p->c1 > 1) return 0;
if (p->n_mm == mm) return 25;
if (p->c2 == 0) return 37;
n = (p->c2 >= 255)? 255 : p->c2;
return (23 < g_log_n[n])? 0 : 23 - g_log_n[n];
}


where g_log_n[] is defined as

for (i = 1; i != 256; ++i) g_log_n[i] = (int)(4.343 * log(i) + 0.5);


If this reminds anyone of how MAQ works/worked, there's a reason for that and you can find the reasoning for this method in the supplemental methods to the original MAQ paper.

Those are just the two most popular aligners, others may behave completely differently. Many RNAseq aligners, for example, will just give MAPQ scores dependent on the number of equally valid alignments.

0
Entering edit mode

I believe bwamem has changed the formula around pair-endness with respect to bwa aln. Is that so? Has anybody looked at it extensively?

1
Entering edit mode

That could well be, I've yet to look.

Edit: It doesn't look completely unrelated. See the mem_sam_pe function in bwamem_pair.c.

Edit2: In fact, it's very similar and can be vaguely defined as q_pe = 6*(o-subo) - (4.343*log(n_sub+1)+0.499) with some limits on its range and further tweaks depending on what the single-end scores would be.

0
Entering edit mode

I think I identified the bit that may be different between bwa-mem and bwa-aln: is in deciding if paired alignment is preferred or unpaired alignment is preferred based on o > score_un. Highlighted in the screenshot in the question.

1
Entering edit mode

Yeah, it really does look like that's the main difference, thanks for updating the original question. Keep in mind that the actual reported MAPQ is given by q_se.

0
Entering edit mode

Thanks for the great answer. Any thoughts on how bwa-mem assigns the QMAP score?

0
Entering edit mode

Hi Devon, your explanation helps me a lot. But I'm still confused about "best mapping" and "second best mapping" in bwa. Are they defined as the mapping with least difference with reference and second least?

0
Entering edit mode

I believe that they're defined according to the alignment score, which should be dependent on the edit distance (match/mismatch/indel).

0
Entering edit mode

If this is the case, why can we get some short reads with score "23"? I looked at the output of bwa. Actually there are some short reads have specified matching positions but the score is 23. If there is no "best mapping", I think it means that we cannot find mapping position satisfies the max edits. Then it cannot have a matching position in the output right? Or is it possible that there are "second best mapping" even though number of "best mapping" is 0?

0
Entering edit mode

You'd need to show exact examples.

0
Entering edit mode

Here are two examples: https://drive.google.com/file/d/0B0yotzh1GssPWTZvV2NHNzF3amc/view?usp=sharing https://drive.google.com/file/d/0B0yotzh1GssPNGdhem5kZEtlM0k/view?usp=sharing I use human_g1k_v37.fasta as reference and ERR013029_1.filt.fastq as short reads file. Thanks so much!

0
Entering edit mode

Do you have any ideas about these alignments? Thanks!

0
Entering edit mode

In each case, there's a single "best hit" (X0:i:1), which results in a MAPQ of 23 (p->c1 == 0).

0
Entering edit mode

For single "best hit", I think c1 is 1..? As you explained earlier, c1 is the number of best hit right? Here is another example with single "best hit" https://drive.google.com/open?id=0B0yotzh1GssPUDh0MGoyTU5wMUk, but the score is 37... I guess I have some misunderstanding about "best hit" x0 and "suboptimal hit" x1?

0
Entering edit mode

You'll have to go through the code and add some printf's in to completely determine this.

0
Entering edit mode

So what's the relation between c1 and x0? why is c1 0 when there is one best hit (x0:i:1)?

0
Entering edit mode

As I said, you'll have to go through the code and add some printf's in.