Problem with annovar in GATK4 output
0
0
Entering edit mode
3.8 years ago

Hello

Based on GATK4 best practices pipline I have made a VCF file composed 4 person WES data. When I want to annotate it with annovar , but annovar could not annotate all variations and near 70% of variations discard and gone to Invalid_input.

I though it might happen due to VCF version (4.2), but it doesn't work with annovar default input format (avinput).

What is your suggestion for annotating GATK4 output VCFs?

exome gatk4 annovar • 1.6k views
0
Entering edit mode

Please paste some of the variants that are stored in invalid_input

A better approach may be to first split your variants into 4 different avinput files, and then annotate these:

perl annovar/convert2annovar.pl -format vcf4 -withzyg --allsample -outfile WES.ann WES.vcf ;

0
Entering edit mode

thanks for reply. The problems still exist. I have 5 sample ( output from Haplotypecaller which merged by Bcftools), but after running you command each file is separated and it is hard to track a SNP in all 5 samples.

Can I annotate all 5 samples individually and then merge annotated result into one file?

   perl convert2annovar.pl in.vcf -format vcf4 > output


It gives following information, I lost almost 50,000 SNPs and indels

NOTICE: Finished reading 159791 lines from VCF file NOTICE: A total of 156393 locus in VCF file passed QC threshold, representing 141154 SNPs (99256 transitions and 41898 transversions) and 17068 indels/substitutions NOTICE: Finished writing 93786 SNP genotypes (66143 transitions and 27643 transversions) and 10272 indels/substitutions for 1 sample (but input contains 5 samples)

0
Entering edit mode

Without seeing your VCF, I cannot understand entirely what is going on. The only filter that could result in substantial loss of variants is the --snpqual <float> filter passed to convert2annovar.pl, with an initial value set to 20.

Other possibilities to consider:

• you have many variants called on contigs outside the main autosomes and sex chromosomes
• you have many entries in your VCF that are 0/0 (i.e., ref calls) but that are still recorded You have a very high proportion of multi-allelic sites

You could indeed process each sample independently with the --allsample flag, and then merge it all back together. You should keep rack of wich variants were called in which individual, though. When sample number is low, I believe that doing it this way is fine.