Comparing bwa 0.6.2 to bwa 0.5.9 I'm seeing very different behavior around indels. I'm using default options in both cases and as far as I can tell they haven't changed. I'm aligning Illumina paired end 2x80bp reads. For dbSNP indels of just a few nucleotides bwa 0.5.9 aligns them great - perfect matches for entire read length (plus the indel, except when the indel is near the end of the read) but 0.6.2 soft clips a large fraction of the read and then has a very high number of mismatches plus multiple indels in the remaining portion. Any suggestions?
I noticed the same issue. It seems that BWA is more aggressively soft-clipping reads around indels in recent releases. This is fine, provided you can realign the problematic reads.
To resolve issues with indel alignment in a relatively uniform and aligner-agnostic way, I developed a repeat and entropy-aware local realignment system, ogap. In our detection pipeline we use it like this:
bamtools merge -in file1.bam -in file2.bam -in file3.bam \ | ogap -z -R 25 -C 20 -Q 20 -S 0 -f $reference \ | bamleftalign -f $reference \ | samtools calmd -EAru - $reference 2>/dev/null \ | freebayes -f $reference >results.vcf
In this setting, it picks up reads which have soft clipped or mismatched bases summing to more than Q20, and realigns using an extended Smith-Waterman-Gotoh pairwise alignment algorithm that adjusts gap opening and extension penalties on the basis of local sequence context. To do this, it uses Shannon entropy and perfect repeats up to 12bp. This may be too sensitive for most use, so adjusting -C and -Q higher may improve performance.