I'm aligning short-read (2x150) data to fairly divergent references (up to ~10%). So far I have been working mostly with bwa-mem and NGM and have very generous read depths (200-300x). After each alignment, I call the consensus using
samtools pileup and
bcftools call with defaults (currently no quality filters on variant calling, but will add them soon). My goal is consensus with ambiguities for downstream work.
My problem is that the consensus seqs from the different aligners don’t match exactly (overall 97-99% identical), with most mismatches being transitions (C-T or A-G). The Ts/Tv ratios from
bcftools stats are around 1.7-2.0, but these alignments are all to protein coding regions so those ratios are really on the low end.
My questions are:
Why am I getting so many transitions between consensus seqs from different aligners?
Are these discrepancies likely heterozygous positions, but not being called as such because the aligners (based on their internal tuning) have some variation in which haplotype maps better in each instance? (The consensus seqs from both aligners contain some heterozygous/ambiguous positions, so I know it’s possible to see heterozygous positions if both haplotypes map sufficiently.)
By what metric(s) should I choose between alignments?
If the transitions do represent heterozygous positions, would it be fair to effectively choose both by melding the consensus seqs from different aligners and create ambiguities where they differ?
Thanks in advance.