Question

Gatk Variantrecalibrator Error Message

2

Entering edit mode

12.4 years ago

Lds ▴ 450

Hi, I had a project about exome sequencing, when I use VariantRecalibrator in GATK to do VQSR for VCF file generated for chromosome 1 by UnifiedGenotyper in GATK, it's OK, but for VCF file from chromosome 22, I got a ERROR message from GATK. Here is some values that you maybe want to know.

$ java -jar GenomeAnalysisTK.jar 
-R human_g1k_v37.fasta \
-T VariantRecalibrator \
-input snps.raw.chr22.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp_132.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an FS -an MQ \
-recalFile VQSRoutput.chr22.recal \
-tranchesFile VQSRoutput.chr22.tranches \
-rscriptFile VQSRoutput.chr22.plots.R

.......
INFO  16:24:50,760 VariantRecalibratorEngine - Finished iteration 0. 
INFO  16:24:50,939 VariantRecalibratorEngine - Finished iteration 5.     Current change in mixture coefficients = 0.74861 
INFO  16:24:51,120 VariantRecalibratorEngine - Finished iteration 10.     Current change in mixture coefficients = 0.24015 
INFO  16:24:51,301 VariantRecalibratorEngine - Finished iteration 15.     Current change in mixture coefficients = 0.10042 
INFO  16:24:51,479 VariantRecalibratorEngine - Finished iteration 20.     Current change in mixture coefficients = 0.04331 
INFO  16:24:51,658 VariantRecalibratorEngine - Finished iteration 25.     Current change in mixture coefficients = 0.02727 
INFO  16:24:51,765 VariantRecalibratorEngine - Convergence after 28 iterations! 
INFO  16:24:51,780 VariantRecalibratorEngine - Evaluating full set of 114431 variants... 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.2-2-g8143def): 
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)


$ grep -v "#" snps.raw.chr1.vcf | wc -l
229700
$ grep -v "#" snps.raw.chr22.vcf | wc -l
44434

How can I fix this problem? If the problem is what GATK suggests "few variants and these annotations is unsafe" and I should "raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)", How can I assign the values of --percentBadVariants and --maxGaussians for the best results? Thanks very much.

gatk • 8.1k views

ADD COMMENT • link updated 12.4 years ago by User 59 13k • written 12.4 years ago by Lds ▴ 450

0

Entering edit mode

Two questions - 1) Why are you doing this a chromosome at a time? 2) How many samples are you processing?

ADD REPLY • link 12.4 years ago by User 59 13k

0

Entering edit mode

Because that's can be processed in parallel for time-saving. The sample size is 50 in my study.

ADD REPLY • link 12.4 years ago by Lds ▴ 450

score 6 · Answer 1 · 2011-11-24

I think maybe doing this on a per chromosome basis is not the way to go. You've already indicated you've read the relevant page on the GATK website:

"This tool is expecting thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model. Whole exome call sets work well, but anything smaller than that scale might run into difficulties. One piece of advice is to turn down the number of Gaussians used during training and to turn up the number of variants that are used to train the negative model. This can be accomplished by adding --maxGaussians 4 --percentBad 0.05 to your command line."

I've had to set --maxGaussians 4 for smaller numbers of WEX samples, but the v3 recommendations for GATK say:

"In our testing we've found that in order to achieve the best exome results one needs to use an exome SNP callset with at least 30 samples"

That's a full exome callset with 30 samples, not on a single chromosome that represents 2% (or less) of the total DNA. I would have thought VQSR would have wanted as much data to work on to get the best results, you're segmenting the data and I wonder if you're violating the assumptions of VQSR in the process.