Hello everyone, I am currently trying to learn about ngs data analysis. Specifically, I have data from targeted (panel) ngs from human liquid biopsy samples, in which we have to determine the presence of somatic variants that are at a very low allele fraction (ctDNA in peripheral blood). The company that sold us this service conveniently forgot to mention that they were not going to hand us the final results, so it's up to us to decide, based on the BAM files they provided us with, which mutations could be considered valid and which might be artifacts or errors.... This is all very new to me, so I have lots of doubts. They are offering support, but there are some things that I can't seem to fully understand.
The thing is, when I look at specific positions in the BAM file, take KRAS codon 12 as an example, I can see that there is a variant in that position, but MAPQ values for the amplicons that support the position are lower than MAPQ values for amplicons that support the reference (18 vs 33). The company's support division explained to me that those differences in MAPQ could indicate that the variant is not actually real and that it should be confirmed by another method (confirming variants is not an option for us). Biologically speaking, since we are dealing with a cancer patient sample, and giving that KRAS codon 12 is a very common and important mutation, we thought: what are the chances that this specific mutation was produced by an error?
So, I decided to make some processing of my own, since they provided the raw data files, always focusing on KRAS codon 12 to compare results:
- First of all, I tried aligning with BWA. Results were great, KRAS mutated reads had MAPQ of 60, same as reference reads
I then aligned the data using Bowtie2 (this is what the company uses) in end-to-end mode. Again, results were great (98,9% overall rate alignment), and no difference in MAPQ between mutated/reference reads (MAPQ 42)
Finally, I got support people to tell me the exact command that they had used to get to the final BAM file, and realized that they were using local alignment mode. Run the alignment in this mode (rate alignment was raised a little bit, I think around 99,2%, cannot remember), and finally got similar BAM to the company's BAM: MAPQ 18 vs 33.
So, I wanted an expert opinion on this, since, I repeat, this is all very new to me. Is it advantageous to perform local alignment in targeted ngs?? I cannot understand how local aligment can hurt MAPQ so much for these amplicons given that there seems to be no soft-clipped bases in the alignment.
I'm sorry if it got too long, want to be as detailed as possible. Hope you can help me
Thanks in advance!
Thanks for your answer, benformatics. I Will try to repeat the alignment step with Hisat2 just to compare results.
So, in your opinion, end to end alignment is the best choice? In which cases would you say it's better to perform local alignment?
You would use local to get higher alignment rates ... however you do this with a risk of false-positives. Essentially, if you soft-mask nt's on the edge of your read you could accidentally align your read to the wrong location. (in many basic research cases this isn't a big deal)
Times when you might want to do this are: 1) you're sloppy/rushing - you know there are sequencing adapters or barcodes but you don't care and just want to align the reads 2) you're aligning reads that you know might not match well to the actual reference genome. Example: a closely related Drosophilid to the Drosophila genome
In your case you have something that is clinically important... you shouldn't take risks... you should high-confidence end-to-end alignments. You are aligning human reads to the human reference.
Also probably not super necessary to use HISAT2 (I revised my comment)
Thanks benformatics! Cristal clear