Question: Bowtie2 MAPQ difference using local vs end-to-end alignment
0
gravatar for polys
3 days ago by
polys0
polys0 wrote:

Hello everyone, I am currently trying to learn about ngs data analysis. Specifically, I have data from targeted (panel) ngs from human liquid biopsy samples, in which we have to determine the presence of somatic variants that are at a very low allele fraction (ctDNA in peripheral blood). The company that sold us this service conveniently forgot to mention that they were not going to hand us the final results, so it's up to us to decide, based on the BAM files they provided us with, which mutations could be considered valid and which might be artifacts or errors.... This is all very new to me, so I have lots of doubts. They are offering support, but there are some things that I can't seem to fully understand.

The thing is, when I look at specific positions in the BAM file, take KRAS codon 12 as an example, I can see that there is a variant in that position, but MAPQ values for the amplicons that support the position are lower than MAPQ values for amplicons that support the reference (18 vs 33). The company's support division explained to me that those differences in MAPQ could indicate that the variant is not actually real and that it should be confirmed by another method (confirming variants is not an option for us). Biologically speaking, since we are dealing with a cancer patient sample, and giving that KRAS codon 12 is a very common and important mutation, we thought: what are the chances that this specific mutation was produced by an error?

So, I decided to make some processing of my own, since they provided the raw data files, always focusing on KRAS codon 12 to compare results:

  • First of all, I tried aligning with BWA. Results were great, KRAS mutated reads had MAPQ of 60, same as reference reads
  • I then aligned the data using Bowtie2 (this is what the company uses) in end-to-end mode. Again, results were great (98,9% overall rate alignment), and no difference in MAPQ between mutated/reference reads (MAPQ 42)

  • Finally, I got support people to tell me the exact command that they had used to get to the final BAM file, and realized that they were using local alignment mode. Run the alignment in this mode (rate alignment was raised a little bit, I think around 99,2%, cannot remember), and finally got similar BAM to the company's BAM: MAPQ 18 vs 33.

So, I wanted an expert opinion on this, since, I repeat, this is all very new to me. Is it advantageous to perform local alignment in targeted ngs?? I cannot understand how local aligment can hurt MAPQ so much for these amplicons given that there seems to be no soft-clipped bases in the alignment.

I'm sorry if it got too long, want to be as detailed as possible. Hope you can help me

Thanks in advance!

ADD COMMENTlink modified 3 days ago by benformatics640 • written 3 days ago by polys0
1
gravatar for benformatics
3 days ago by
benformatics640
benformatics640 wrote:

The major difference is that for a local alignment you try to match your read to substring of your reference. This is a less conservative approach and essentially assumes that your reads are derived directly from your reference sequence (which they aren't, you have a SNP there).

That is likely the very reason why the score is lower for a local alignment. The alignment is less deal because a SNP is present... so in turn the score is lower. End-to-end alignment are the way to go.

That being said, BWA is perfectly fine. I would trust those results.

Here is the relevant bowtie2 manual entry:

End-to-end alignment score example

A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). Thus, in end-to-end alignment mode, if the read is 50 bp long and it matches the reference exactly except for one mismatch at a high-quality position and one length-2 read gap, then the overall score is -(6 + 11) = -17.

The best possible alignment score in end-to-end mode is 0, which happens when there are no differences between the read and the reference.

Local alignment score example

A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). A base that matches receives a bonus of +2 be default. Thus, in local alignment mode, if the read is 50 bp long and it matches the reference exactly except for one mismatch at a high-quality position and one length-2 read gap, then the overall score equals the total bonus, 2 * 49, minus the total penalty, 6 + 11, = 81.

The best possible score in local mode equals the match bonus times the length of the read. This happens when there are no differences between the read and the reference.

In your case the difference sounds like your SNP in question might be a gap is considered a gap or something?

ADD COMMENTlink modified 3 days ago • written 3 days ago by benformatics640

Thanks for your answer, benformatics. I Will try to repeat the alignment step with Hisat2 just to compare results.

So, in your opinion, end to end alignment is the best choice? In which cases would you say it's better to perform local alignment?

ADD REPLYlink modified 3 days ago • written 3 days ago by polys0

You would use local to get higher alignment rates ... however you do this with a risk of false-positives. Essentially, if you soft-mask nt's on the edge of your read you could accidentally align your read to the wrong location. (in many basic research cases this isn't a big deal)

Times when you might want to do this are: 1) you're sloppy/rushing - you know there are sequencing adapters or barcodes but you don't care and just want to align the reads 2) you're aligning reads that you know might not match well to the actual reference genome. Example: a closely related Drosophilid to the Drosophila genome

In your case you have something that is clinically important... you shouldn't take risks... you should high-confidence end-to-end alignments. You are aligning human reads to the human reference.

Also probably not super necessary to use HISAT2 (I revised my comment)

ADD REPLYlink modified 2 days ago • written 2 days ago by benformatics640

Thanks benformatics! Cristal clear

ADD REPLYlink written 1 day ago by polys0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1543 users visited in the last hour