What happens is about half way through, the aligner seems to slip a postiion in the reference, so that almost every letter after that slip looks like a difference between reference and the .bam file, making the vcf huge. I've had this happen on two different microbial species.
I used bwa for alignment and to make the paired end bam, and samtools to sort and remove duplicates. I've also tried running that .bam through Picard's sorting and duplicate removal, but the result is the same. I also tried bams that had and had not gone through GATK indel realignment, but it didn't matter. What did work was telling GATK not to look at the whole genome, but to start about 100 kb from the beginning. When I did that, it did not slip. This was on a Staph genome of about 2.2 Mb. The reference fastas look normal.
Has anyone seen this before? Is there some subtle formatting issue with my bam that is causing this?
You are using the SAME reference for both GATK and your alignment step?
Yup, it's the right reference, both times. Two different species, same problem. I have no problems using samtools mpileup on the same .bams and references.
Here's the command line I'm using, maybe I'm missing something stupid that the program is silently tripping on:
java -jar ../../../../GATK/1.1.23/GenomeAnalysisTK.jar -T UnifiedGenotyper -I pdedup.bam -R ../../reference.fa -o pdedup.txt
I've got nothin'.