Question

ERROR MESSAGE: QD annotation at VariantRecalibrator

0

Entering edit mode

5.6 years ago

Satyajeet Khare ★ 1.6k

I am performing trio analysis with Exome sequencing data. I am using GATK 3.8. I am at VariantRecalibrator stage. Following is my command...

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -input /gatk_out/output.vcf --resource:hapmap,known=false,training=true,truth=true,prior=15.0 /gatk/resources_broad_hg38_v0_hapmap_3.3.hg38.vcf.gz --resource:omni,known=false,training=true,truth=false,prior=12.0 /gatk/resources_broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz --resource:1000G,known=false,training=true,truth=false,prior=10.0 /gatk/resources_broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /ncbi/00-All.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -recalFile /gatk_out/output.recal -tranchesFile /gatk_out/output.tranches -rscriptFile /gatk_out/output.plots.R

...which is running into an error

##### ERROR MESSAGE: Bad input: Values for QD annotation not detected for ANY training variant in the input callset. VariantAnnotator may be used to add these annotations.

Here, output.vcf is the output of GenotypeGVCFs command run on g.vcf files from mother, father and the child. The hapmap, omni and 1000G files are downloaded from gatk resource. The dbsnp file is downloaded from ncbi (dbsnp file from gatk resource gave an error).

I tried running VariantAnnotator in this manner...

 java -jar GenomeAnalysisTK.jar -T VariantAnnotator -A QualByDepth -R Homo_sapiens.GRCh38.dna.primary_assembly.fa -V /gatk_out/output.vcf -o /gatk_out/output_annotated.vcf

...changing output.vcf to output_annotated.vcf in the VariantRecalibrator command.

But now, a new error appears

##### ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Annotation QD was not found; please check that you have specified the annotation name correctly

Any inputs?

GATK snp VariantRecalibrator • 4.0k views

ADD COMMENT • link updated 5.5 years ago by Biostar 20 • written 5.6 years ago by Satyajeet Khare ★ 1.6k

1

Entering edit mode

Have you validated the input VCFs with ValidateVariants?

ADD REPLY • link 5.6 years ago by Tm ★ 1.1k

0

Entering edit mode

Thanks @toralmanvar. I tested the input vcf file like this

java -jar GenomeAnalysisTK.jar -T ValidateVariants --reference_window_stop >= 300 -R Homo_sapiens.GRCh38.dna.primary_assembly.fa --variant:VCF gatk_out/output_annotated.vcf

The command finished the run without any errors. But the output is confusing. The log file reads following...

INFO  09:54:43,442 ProgressMeter -          | processed |    time |    per 1M |           |   total | remaining
INFO  09:54:43,443 ProgressMeter -           Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  09:55:01,039 ProgressMeter -              done   1389871.0    17.0 s      12.0 s      100.0%    17.0 s       0.0 s
INFO  09:55:01,039 ProgressMeter - Total runtime 17.60 secs, 0.29 min, 0.00 hours

But the result file generated in the home directory says

Successfully validated the input file.  Checked 0 records with no failures.

Not sure why 0 records!

ADD REPLY • link updated 5.5 years ago by Ram 44k • written 5.6 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Mine is a small dataset (4 samples). Is that the problem? Should I just shift to hard filtration such as this?

QD<2.0||MQ<40.0||FS>60.0||MQRankSum<-12.5||ReadPosRankSum<-8.0

ADD REPLY • link updated 5.5 years ago by Ram 44k • written 5.6 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

That's unusual. You might have gone through this GATK forum, but if not, see whether it can be useful to you.

ADD REPLY • link 5.6 years ago by Tm ★ 1.1k

0

Entering edit mode

Thanks, I will try running without -an QD parameter, but here are a couple of other peculiar observations.

I do see QD annotations in my VCF file, even without running the VariantAnnotator also.
If I try and validate the resource VCF files from gatk resource bundle (1000G and hapmap), the Processed sites are shown as 0 (fasta file is from NCBI).

Could this be an incompatibility issue between gatk VCF files and the genome fasta file (Homo_sapiens.GRCh38.dna.primary_assembly.fa) downloaded from NCBI?

Note: Validate variants works perfectly fine with 00-All.vcf.gz file which is also downloaded from NCBI and I see millions of processed sites.

ADD REPLY • link 5.6 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Usually, in case of incompatibility of reference genome you end up with this kind of problems. But this not the case here. But give a try using reference genome from this link for 1000G VCF files.

ADD REPLY • link 5.6 years ago by Tm ★ 1.1k

2

Entering edit mode

The issue is resolved. I made two changes. I don't know which one worked.

I had skipped the local realignment step earlier. That is, I had used the Picard output directly for BQSR. I corrected the pipeline by incorporating that step.
I used genome.fasta and all .vcf files from gatk resource bundle to avoid compatibility issues.

Don't know which one worked. Upvoted your VCF validation comment. That gave the clue!

ADD REPLY • link 5.6 years ago by Satyajeet Khare ★ 1.6k