Question

identify genomic positions susceptible to alignment artifacts with bwa-mem aligner

0

Entering edit mode

5.2 years ago

szong • 0

Background:

I am looking for high confidence somatic non-coding variants from whole genome sequencing data. A list of mutation hotspots have been identified. the problem is that many of them are false positive upon reviewing them in IGV, specifically due to alignment artifacts. This is a bit surprising to me in the first place given our experiences that Strelka in general report high quality somatic variants in coding regions. But for some reason, it does not work as well when the targeted region is non-coding. I understand there are numerous way to filter for potential real variants, for example intersect with a second variant caller (primary caller is strelka) etc. One thing i am particular interested to try out is to identify genomic regions/positions that is prone to misalignment for whatever reason, repeats, low complexity etc. If we could figure out these regions, i could just assign low confidence score to those variants.

Approach and questions:

One thing i have tried is to pool all matching normals from this cohort and look for recurrent positions where we see mutations in multiple samples. This can be done by run single sample mpileup to report all potential snvs in the normals and then summarize to see what positions are recurrently mutated. Hopefully, this can account for the alignment artifact to some degree and it seems it did help.
Secondly, I would like to try more explicitly to look for potentially misalignment regions in the genome. I thought if i look into a bam, the CIGAR string should tell me exactly if a base(s) is a match, substitution, indel etc. But it tends out that is not the case. Even if there are multiple scattered mismatch in a read, the CIGAR string is still showing 125M (read length is 125), which is weird to me. Does someone know why i am seeing this? maybe due to parameter settings while running Strelka. If CIGAR string contains info about each base is a match or not, i could figure out the probability of mismatchs for each base in the genome when alignming normals to reference. If I aggregate hundreds of normal genomes together, I should have a pretty good idea which positions tend to be misaligned.
any other ideas/strategies to accomplish this goal. thanks

genome sequencing alignment SNP • 937 views

ADD COMMENT • link updated 5.2 years ago by Vitis ★ 2.5k • written 5.2 years ago by szong • 0

score 1 · Answer 1 · 2019-01-29

One thing you may be able to look at is the "mappability" of the genome reference. It is basically a measure how reliable mapping would work in a region. There are multiple papers discussing this concept.

http://massgenomics.org/2010/12/uniqueness-mappability-human-genome.html

For the CIGAR string, it usually ignores SNPs. You're really looking for the MD tag, which includes the reference information for the region covered by a particular read. Combining information from both CIGAR and MD tag, you would get a better idea about the variants within the range of a read or a few reads.

http://lh3.github.io/2018/03/27/the-history-the-cigar-x-operator-and-the-md-tag

In newer aligners like minimap2, the CS tag does the same thing and could be easier to deal with. But looks like MD and CIGAR would stick around for a while.