Question: Base Recalibration Of Stampy Alignments In Gatk
1
gravatar for jake
6.6 years ago by
jake20
jake20 wrote:

I am struggling to produce a good base recalibration in GATK (version 2.4-9-g532efad) on multi-sample, single-end illumina data sets aligned with Stampy. After mapping reads to the reference, the workflow I've been following is:

  1. Identify indels with RealignerTargetCreator
  2. Realign around indels with IndelRealigner
  3. Call SNPs with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
  4. Recalibrate with BaseRecalibrator and -knownSites
  5. Make new recalibrated bam files with PrintReads -BQSR
  6. Call SNPs in new bam files with UnifiedGenotyper -out_mode EMIT_VARIANTS_ONLY
  7. Compare vcf files with VariantEval

Edit: Alignments were generated with Stampy prior to the base recalibration attempts.

Many SNPs are being called at step 3 (>700K), but very few at step 6 (<200), so something seems to be going wrong with the recalibration. This pattern is consistent across population samples from four different species. Any ideas on what might be going wrong here?

Edit: The problem seems to stem from the Stampy alignments. When I generate alignments with bwa, base recalibration behaves normally. However, bwa is unable to map the majority of divergent reads I'm working with so I need an alignment method that can handle the divergence. When I validate the bam files output by Stampy with Picard, it produces many errors that look like this:

ERROR: Record 7, Read name HWI-ST0747:287:D22WTACXX:2:2215:9595:66866, NM tag (nucleotide differences) in file [6] does not match reality [7]

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T RealignerTargetCreator -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -o indels_forRealigner.intervals

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T IndelRealigner -I Sample1_sorted.bam -I Sample2_sorted.bam -R mouse_genome_masked.fasta -targetIntervals indels_forRealigner.intervals --nWayOut .realigned.bam

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam --out_mode EMIT_VARIANTS_ONLY -o nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T BaseRecalibrator -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -knownSites nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf -o pdf_recal_data.grp --plot_pdf_file first_recal.pdf

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T PrintReads -R mouse_genome_masked.fasta -I Sample1_sorted.realigned.bam -I Sample2_sorted.realigned.bam -BQSR pdf_recal_data.grp -o recal_round1_all.bam

java -Xmx2G -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T UnifiedGenotyper -R mouse_genome_masked.fasta -I recal_round1_all.bam -out_mode EMIT_VARIANTS_ONLY -o round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf

java -Xmx2g -jar GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T VariantEval -R mouse_genome_masked.fasta -o round0_vs_round1.eval.gatkreport --eval:set1 nonrecalibrated_variant_genotypes_unifiedgenotyper.vcf --eval:set2 round1_recalibrated_variant_genotypes_unifiedgenotyper.vcf
gatk • 3.1k views
ADD COMMENTlink modified 6.5 years ago • written 6.6 years ago by jake20

after realigning , duplicates should be removed.

ADD REPLYlink written 6.6 years ago by Pierre Lindenbaum124k

I thought that was only necessary for paired-end datasets. Am I mistaken?

ADD REPLYlink written 6.6 years ago by jake20
1

Personally I will remove duplicates for both fragment and paired-end data sets.

ADD REPLYlink written 6.6 years ago by Ashutosh Pandey11k

sorry I didn't see it was single end dataset.

ADD REPLYlink written 6.6 years ago by Pierre Lindenbaum124k
1
gravatar for jake
6.5 years ago by
jake20
jake20 wrote:

The problem is the divergent nature of the sequence data. Many SNPs are present between the reference and the sequences. After generating a 'known sites' list, many SNPs supported by reasonably good phred scores remained, and these caused the recalibration step to dramatically lower the phred scores of all reads, which led to no SNPs being called in the recalibrated bam file.

ADD COMMENTlink written 6.5 years ago by jake20

Yes. Unless you have a good list of known SNPs/base-pair differences, you probably should not recalibrate base quality. In addition, you might consider to give "bwa mem" a try. It should be more sensitive than bwa, but I cannot predict if it can be better than stampy.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1279 users visited in the last hour