Question

Base Recalibration Of Stampy Alignments In Gatk

1

Entering edit mode

11.0 years ago

jake ▴ 20

I am struggling to produce a good base recalibration in GATK (version 2.4-9-g532efad) on multi-sample, single-end illumina data sets aligned with Stampy. After mapping reads to the reference, the workflow I've been following is:

Identify indels with RealignerTargetCreator
Realign around indels with IndelRealigner
Call SNPs with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
Recalibrate with BaseRecalibrator and -knownSites
Make new recalibrated bam files with PrintReads -BQSR
Call SNPs in new bam files with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
Compare vcf files with VariantEval

Edit: Alignments were generated with Stampy prior to the base recalibration attempts.

Many SNPs are being called at step 3 (>700K), but very few at step 6 (<200), so something seems to be going wrong with the recalibration. This pattern is consistent across population samples from four different species. Any ideas on what might be going wrong here?

Edit: The problem seems to stem from the Stampy alignments. When I generate alignments with bwa, base recalibration behaves normally. However, bwa is unable to map the majority of divergent reads I'm working with so I need an alignment method that can handle the divergence. When I validate the bam files output by Stampy with Picard, it produces many errors that look like this:

ERROR: Record 7, Read name HWI-ST0747:287:D22WTACXX:2:2215:9595:66866, NM tag (nucleotide differences) in file [6] does not match reality [7]

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T RealignerTargetCreator -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -o indels_forRealigner.intervals

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -targetIntervals indels_forRealigner.intervals --nWayOut .realigned.bam

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam --out_mode EMIT_VARIANTS_ONLY -o nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T BaseRecalibrator -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -knownSites nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf -o pdf_recal_data.grp --plot_pdf_file first_recal.pdf

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T PrintReads -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -BQSR pdf_recal_data.grp -o recal_round1_all.bam

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I recal_round1_all.bam -out_mode EMIT_VARIANTS_ONLY -o round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf

java -Xmx2g -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T VariantEval -R mouse_genome_masked.fasta -o round0_vs_round1.eval.gatkreport --eval:set1 nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf --eval:set2 round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf

gatk • 4.3k views

ADD COMMENT • link 11.0 years ago by jake ▴ 20

0

Entering edit mode

after realigning , duplicates should be removed.

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I thought that was only necessary for paired-end datasets. Am I mistaken?

ADD REPLY • link 11.0 years ago by jake ▴ 20

1

Entering edit mode

Personally I will remove duplicates for both fragment and paired-end data sets.

ADD REPLY • link 11.0 years ago by Ashutosh Pandey 12k

0

Entering edit mode

sorry I didn't see it was single end dataset.

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2013-05-02

1

Entering edit mode

11.0 years ago

jake ▴ 20

The problem is the divergent nature of the sequence data. Many SNPs are present between the reference and the sequences. After generating a 'known sites' list, many SNPs supported by reasonably good phred scores remained, and these caused the recalibration step to dramatically lower the phred scores of all reads, which led to no SNPs being called in the recalibrated bam file.

ADD COMMENT • link 11.0 years ago by jake ▴ 20

0

Entering edit mode

Yes. Unless you have a good list of known SNPs/base-pair differences, you probably should not recalibrate base quality. In addition, you might consider to give "bwa mem" a try. It should be more sensitive than bwa, but I cannot predict if it can be better than stampy.

ADD REPLY • link 11.0 years ago by lh3 33k