Question: Problem with annovar in GATK4 output
gravatar for ahmad mousavi
20 months ago by
ahmad mousavi450
Royan Institute, Tehran, Iran
ahmad mousavi450 wrote:


Based on GATK4 best practices pipline I have made a VCF file composed 4 person WES data. When I want to annotate it with annovar , but annovar could not annotate all variations and near 70% of variations discard and gone to Invalid_input.

I though it might happen due to VCF version (4.2), but it doesn't work with annovar default input format (avinput).

What is your suggestion for annotating GATK4 output VCFs?

annovar gatk4 exome • 959 views
ADD COMMENTlink written 20 months ago by ahmad mousavi450

Please paste some of the variants that are stored in invalid_input

A better approach may be to first split your variants into 4 different avinput files, and then annotate these:

perl annovar/ -format vcf4 -withzyg --allsample -outfile WES.ann WES.vcf ;
ADD REPLYlink written 20 months ago by Kevin Blighe50k

thanks for reply. The problems still exist. I have 5 sample ( output from Haplotypecaller which merged by Bcftools), but after running you command each file is separated and it is hard to track a SNP in all 5 samples.

Can I annotate all 5 samples individually and then merge annotated result into one file?

   perl in.vcf -format vcf4 > output

It gives following information, I lost almost 50,000 SNPs and indels

NOTICE: Finished reading 159791 lines from VCF file NOTICE: A total of 156393 locus in VCF file passed QC threshold, representing 141154 SNPs (99256 transitions and 41898 transversions) and 17068 indels/substitutions NOTICE: Finished writing 93786 SNP genotypes (66143 transitions and 27643 transversions) and 10272 indels/substitutions for 1 sample (but input contains 5 samples)

ADD REPLYlink modified 18 months ago • written 18 months ago by ahmad mousavi450

Without seeing your VCF, I cannot understand entirely what is going on. The only filter that could result in substantial loss of variants is the --snpqual <float> filter passed to, with an initial value set to 20.

Other possibilities to consider:

  • you have many variants called on contigs outside the main autosomes and sex chromosomes
  • you have many entries in your VCF that are 0/0 (i.e., ref calls) but that are still recorded You have a very high proportion of multi-allelic sites

You could indeed process each sample independently with the --allsample flag, and then merge it all back together. You should keep rack of wich variants were called in which individual, though. When sample number is low, I believe that doing it this way is fine.

ADD REPLYlink written 18 months ago by Kevin Blighe50k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2363 users visited in the last hour