Background:
I am looking for high confidence somatic non-coding variants from whole genome sequencing data. A list of mutation hotspots have been identified. the problem is that many of them are false positive upon reviewing them in IGV, specifically due to alignment artifacts. This is a bit surprising to me in the first place given our experiences that Strelka in general report high quality somatic variants in coding regions. But for some reason, it does not work as well when the targeted region is non-coding. I understand there are numerous way to filter for potential real variants, for example intersect with a second variant caller (primary caller is strelka) etc. One thing i am particular interested to try out is to identify genomic regions/positions that is prone to misalignment for whatever reason, repeats, low complexity etc. If we could figure out these regions, i could just assign low confidence score to those variants.
Approach and questions:
- One thing i have tried is to pool all matching normals from this cohort and look for recurrent positions where we see mutations in multiple samples. This can be done by run single sample mpileup to report all potential snvs in the normals and then summarize to see what positions are recurrently mutated. Hopefully, this can account for the alignment artifact to some degree and it seems it did help.
- Secondly, I would like to try more explicitly to look for potentially misalignment regions in the genome. I thought if i look into a bam, the CIGAR string should tell me exactly if a base(s) is a match, substitution, indel etc. But it tends out that is not the case. Even if there are multiple scattered mismatch in a read, the CIGAR string is still showing 125M (read length is 125), which is weird to me. Does someone know why i am seeing this? maybe due to parameter settings while running Strelka. If CIGAR string contains info about each base is a match or not, i could figure out the probability of mismatchs for each base in the genome when alignming normals to reference. If I aggregate hundreds of normal genomes together, I should have a pretty good idea which positions tend to be misaligned.
- any other ideas/strategies to accomplish this goal. thanks
Thanks a lot Vitis. the info is very helpful.