Hi, I had a project about exome sequencing, when I use VariantRecalibrator in GATK to do VQSR for VCF file generated for chromosome 1 by UnifiedGenotyper in GATK, it's OK, but for VCF file from chromosome 22, I got a ERROR message from GATK. Here is some values that you maybe want to know.
$ java -jar GenomeAnalysisTK.jar
-R human_g1k_v37.fasta \
-T VariantRecalibrator \
-input snps.raw.chr22.vcf \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
-resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
-resource:dbsnp,known=true,training=false,truth=false,prior=8.0 dbsnp_132.b37.vcf \
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an FS -an MQ \
-recalFile VQSRoutput.chr22.recal \
-tranchesFile VQSRoutput.chr22.tranches \
-rscriptFile VQSRoutput.chr22.plots.R
.......
INFO 16:24:50,760 VariantRecalibratorEngine - Finished iteration 0.
INFO 16:24:50,939 VariantRecalibratorEngine - Finished iteration 5. Current change in mixture coefficients = 0.74861
INFO 16:24:51,120 VariantRecalibratorEngine - Finished iteration 10. Current change in mixture coefficients = 0.24015
INFO 16:24:51,301 VariantRecalibratorEngine - Finished iteration 15. Current change in mixture coefficients = 0.10042
INFO 16:24:51,479 VariantRecalibratorEngine - Finished iteration 20. Current change in mixture coefficients = 0.04331
INFO 16:24:51,658 VariantRecalibratorEngine - Finished iteration 25. Current change in mixture coefficients = 0.02727
INFO 16:24:51,765 VariantRecalibratorEngine - Convergence after 28 iterations!
INFO 16:24:51,780 VariantRecalibratorEngine - Evaluating full set of 114431 variants...
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 1.2-2-g8143def):
##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
##### ERROR Please do not post this error to the GATK forum
##### ERROR
##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
##### ERROR
##### ERROR MESSAGE: NaN LOD value assigned. Clustering with this few variants and these annotations is unsafe. Please consider raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)
$ grep -v "#" snps.raw.chr1.vcf | wc -l
229700
$ grep -v "#" snps.raw.chr22.vcf | wc -l
44434
How can I fix this problem? If the problem is what GATK suggests "few variants and these annotations is unsafe" and I should "raising the number of variants used to train the negative model (via --percentBadVariants 0.05, for example) or lowering the maximum number of Gaussians to use in the model (via --maxGaussians 4, for example)", How can I assign the values of --percentBadVariants and --maxGaussians for the best results? Thanks very much.
Two questions - 1) Why are you doing this a chromosome at a time? 2) How many samples are you processing?
Because that's can be processed in parallel for time-saving. The sample size is 50 in my study.