Base Recalibration Of Stampy Alignments In Gatk
1
1
Entering edit mode
11.0 years ago
jake ▴ 20

I am struggling to produce a good base recalibration in GATK (version 2.4-9-g532efad) on multi-sample, single-end illumina data sets aligned with Stampy. After mapping reads to the reference, the workflow I've been following is:

  1. Identify indels with RealignerTargetCreator
  2. Realign around indels with IndelRealigner
  3. Call SNPs with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
  4. Recalibrate with BaseRecalibrator and -knownSites
  5. Make new recalibrated bam files with PrintReads -BQSR
  6. Call SNPs in new bam files with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
  7. Compare vcf files with VariantEval

Edit: Alignments were generated with Stampy prior to the base recalibration attempts.

Many SNPs are being called at step 3 (>700K), but very few at step 6 (<200), so something seems to be going wrong with the recalibration. This pattern is consistent across population samples from four different species. Any ideas on what might be going wrong here?

Edit: The problem seems to stem from the Stampy alignments. When I generate alignments with bwa, base recalibration behaves normally. However, bwa is unable to map the majority of divergent reads I'm working with so I need an alignment method that can handle the divergence. When I validate the bam files output by Stampy with Picard, it produces many errors that look like this:

ERROR: Record 7, Read name HWI-ST0747:287:D22WTACXX:2:2215:9595:66866, NM tag (nucleotide differences) in file [6] does not match reality [7]

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T RealignerTargetCreator -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -o indels_forRealigner.intervals

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -targetIntervals indels_forRealigner.intervals --nWayOut .realigned.bam

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam --out_mode EMIT_VARIANTS_ONLY -o nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T BaseRecalibrator -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -knownSites nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf -o pdf_recal_data.grp --plot_pdf_file first_recal.pdf

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T PrintReads -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -BQSR pdf_recal_data.grp -o recal_round1_all.bam

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I recal_round1_all.bam -out_mode EMIT_VARIANTS_ONLY -o round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf

java -Xmx2g -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T VariantEval -R mouse_genome_masked.fasta -o round0_vs_round1.eval.gatkreport --eval:set1 nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf --eval:set2 round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf
gatk • 4.3k views
ADD COMMENT
0
Entering edit mode

after realigning , duplicates should be removed.

ADD REPLY
0
Entering edit mode

I thought that was only necessary for paired-end datasets. Am I mistaken?

ADD REPLY
1
Entering edit mode

Personally I will remove duplicates for both fragment and paired-end data sets.

ADD REPLY
0
Entering edit mode

sorry I didn't see it was single end dataset.

ADD REPLY
1
Entering edit mode
11.0 years ago
jake ▴ 20

The problem is the divergent nature of the sequence data. Many SNPs are present between the reference and the sequences. After generating a 'known sites' list, many SNPs supported by reasonably good phred scores remained, and these caused the recalibration step to dramatically lower the phred scores of all reads, which led to no SNPs being called in the recalibrated bam file.

ADD COMMENT
0
Entering edit mode

Yes. Unless you have a good list of known SNPs/base-pair differences, you probably should not recalibrate base quality. In addition, you might consider to give "bwa mem" a try. It should be more sensitive than bwa, but I cannot predict if it can be better than stampy.

ADD REPLY

Login before adding your answer.

Traffic: 1651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6