I would like to know what formulas are being used in the most common genome aligners to determine the MAPQ of reads that fall inside repeats in their entirety? By falling inside repeats in their entirety, I mean that the whole read length is inside the repeat. Also that both read pairs of a PE fall inside the repeat. My definition of repeats here, just as a simplification, is "stretches of DNA sequences in a genome that are identically found in multiple locations".
Is there a consensus on which scoring formula to use for reads falling on repetitive regions of the genome?
EDITED: As an example, does bowtie
score the same as bwa
, or mosaik
? Does bwa aln
score the same as bwa mem
?
Highlighted below is the relevant code snippet for mem_sam_pe
in bwamem_pair.c
that makes things different between bwa aln
and bwa mem
.
I believe bwamem has changed the formula around pair-endness with respect to bwa aln. Is that so? Has anybody looked at it extensively?
That could well be, I've yet to look.
Edit: It doesn't look completely unrelated. See the
mem_sam_pe
function in bwamem_pair.c.Edit2: In fact, it's very similar and can be vaguely defined as
q_pe = 6*(o-subo) - (4.343*log(n_sub+1)+0.499)
with some limits on its range and further tweaks depending on what the single-end scores would be.I think I identified the bit that may be different between bwa-mem and bwa-aln: is in deciding if paired alignment is preferred or unpaired alignment is preferred based on
o > score_un
. Highlighted in the screenshot in the question.Yeah, it really does look like that's the main difference, thanks for updating the original question. Keep in mind that the actual reported MAPQ is given by
q_se
.Thanks for the great answer. Any thoughts on how bwa-mem assigns the QMAP score?
Hi Devon, your explanation helps me a lot. But I'm still confused about "best mapping" and "second best mapping" in bwa. Are they defined as the mapping with least difference with reference and second least?
I believe that they're defined according to the alignment score, which should be dependent on the edit distance (match/mismatch/indel).
If this is the case, why can we get some short reads with score "23"? I looked at the output of bwa. Actually there are some short reads have specified matching positions but the score is 23. If there is no "best mapping", I think it means that we cannot find mapping position satisfies the max edits. Then it cannot have a matching position in the output right? Or is it possible that there are "second best mapping" even though number of "best mapping" is 0?
You'd need to show exact examples.
Here are two examples: https://drive.google.com/file/d/0B0yotzh1GssPWTZvV2NHNzF3amc/view?usp=sharing https://drive.google.com/file/d/0B0yotzh1GssPNGdhem5kZEtlM0k/view?usp=sharing I use human_g1k_v37.fasta as reference and ERR013029_1.filt.fastq as short reads file. Thanks so much!
Do you have any ideas about these alignments? Thanks!
In each case, there's a single "best hit" (
X0:i:1
), which results in a MAPQ of 23 (p->c1 == 0
).For single "best hit", I think c1 is 1..? As you explained earlier, c1 is the number of best hit right? Here is another example with single "best hit" https://drive.google.com/open?id=0B0yotzh1GssPUDh0MGoyTU5wMUk, but the score is 37... I guess I have some misunderstanding about "best hit" x0 and "suboptimal hit" x1?
You'll have to go through the code and add some printf's in to completely determine this.
So what's the relation between c1 and x0? why is c1 0 when there is one best hit (x0:i:1)?
As I said, you'll have to go through the code and add some printf's in.