Choosing the best alignment or aligner
Entering edit mode
5.2 years ago
adhb • 0

I'm aligning short-read (2x150) data to fairly divergent references (up to ~10%). So far I have been working mostly with bwa-mem and NGM and have very generous read depths (200-300x). After each alignment, I call the consensus using samtools pileup and bcftools call with defaults (currently no quality filters on variant calling, but will add them soon). My goal is consensus with ambiguities for downstream work.

My problem is that the consensus seqs from the different aligners don’t match exactly (overall 97-99% identical), with most mismatches being transitions (C-T or A-G). The Ts/Tv ratios from bcftools stats are around 1.7-2.0, but these alignments are all to protein coding regions so those ratios are really on the low end.

My questions are:

  1. Why am I getting so many transitions between consensus seqs from different aligners?
    Are these discrepancies likely heterozygous positions, but not being called as such because the aligners (based on their internal tuning) have some variation in which haplotype maps better in each instance? (The consensus seqs from both aligners contain some heterozygous/ambiguous positions, so I know it’s possible to see heterozygous positions if both haplotypes map sufficiently.)

  2. By what metric(s) should I choose between alignments?
    If the transitions do represent heterozygous positions, would it be fair to effectively choose both by melding the consensus seqs from different aligners and create ambiguities where they differ?

Thanks in advance.

alignment bwa ngm heterozygosity • 1.2k views
Entering edit mode

Have you tried a simulation study to see which mapper is better able to deal with the high divergence? You could use Simulome to generate a reference sequence in FASTA format then simulate Illumina read pairs with ART.


Login before adding your answer.

Traffic: 2262 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6