Question

Choosing the best alignment or aligner

0

Entering edit mode

5.6 years ago

adhb • 0

I'm aligning short-read (2x150) data to fairly divergent references (up to ~10%). So far I have been working mostly with bwa-mem and NGM and have very generous read depths (200-300x). After each alignment, I call the consensus using samtools pileup and bcftools call with defaults (currently no quality filters on variant calling, but will add them soon). My goal is consensus with ambiguities for downstream work.

My problem is that the consensus seqs from the different aligners don’t match exactly (overall 97-99% identical), with most mismatches being transitions (C-T or A-G). The Ts/Tv ratios from bcftools stats are around 1.7-2.0, but these alignments are all to protein coding regions so those ratios are really on the low end.

My questions are:

Why am I getting so many transitions between consensus seqs from different aligners?
Are these discrepancies likely heterozygous positions, but not being called as such because the aligners (based on their internal tuning) have some variation in which haplotype maps better in each instance? (The consensus seqs from both aligners contain some heterozygous/ambiguous positions, so I know it’s possible to see heterozygous positions if both haplotypes map sufficiently.)
By what metric(s) should I choose between alignments?
If the transitions do represent heterozygous positions, would it be fair to effectively choose both by melding the consensus seqs from different aligners and create ambiguities where they differ?

Thanks in advance.

alignment bwa ngm heterozygosity • 1.3k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 5.6 years ago by adhb • 0

0

Entering edit mode

Have you tried a simulation study to see which mapper is better able to deal with the high divergence? You could use Simulome to generate a reference sequence in FASTA format then simulate Illumina read pairs with ART.

ADD REPLY • link 5.5 years ago by mark.ziemann ★ 1.9k