I am struggling to produce a good base recalibration in GATK (version 2.4-9-g532efad) on multi-sample, single-end illumina data sets aligned with Stampy. After mapping reads to the reference, the workflow I've been following is:
- Identify indels with RealignerTargetCreator
- Realign around indels with IndelRealigner
- Call SNPs with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
- Recalibrate with BaseRecalibrator and -knownSites
- Make new recalibrated bam files with PrintReads -BQSR
- Call SNPs in new bam files with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
- Compare vcf files with VariantEval
Edit: Alignments were generated with Stampy prior to the base recalibration attempts.
Many SNPs are being called at step 3 (>700K), but very few at step 6 (<200), so something seems to be going wrong with the recalibration. This pattern is consistent across population samples from four different species. Any ideas on what might be going wrong here?
Edit: The problem seems to stem from the Stampy alignments. When I generate alignments with bwa, base recalibration behaves normally. However, bwa is unable to map the majority of divergent reads I'm working with so I need an alignment method that can handle the divergence. When I validate the bam files output by Stampy with Picard, it produces many errors that look like this:
ERROR: Record 7, Read name HWI-ST0747:287:D22WTACXX:2:2215:9595:66866, NM tag (nucleotide differences) in file [6] does not match reality [7]
java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T RealignerTargetCreator -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -o indels_forRealigner.intervals
java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -targetIntervals indels_forRealigner.intervals --nWayOut .realigned.bam
java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam --out_mode EMIT_VARIANTS_ONLY -o nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf
java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T BaseRecalibrator -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -knownSites nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf -o pdf_recal_data.grp --plot_pdf_file first_recal.pdf
java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T PrintReads -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -BQSR pdf_recal_data.grp -o recal_round1_all.bam
java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I recal_round1_all.bam -out_mode EMIT_VARIANTS_ONLY -o round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf
java -Xmx2g -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T VariantEval -R mouse_genome_masked.fasta -o round0_vs_round1.eval.gatkreport --eval:set1 nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf --eval:set2 round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf
after realigning , duplicates should be removed.
I thought that was only necessary for paired-end datasets. Am I mistaken?
Personally I will remove duplicates for both fragment and paired-end data sets.
sorry I didn't see it was single end dataset.