Using VCFTools (v0.1.13) with ExAC (release 0.3.1)
1
3
Entering edit mode
5.6 years ago

I'm trying to extract allele frequency information for all variants in the latest ExAC data release. I'm currently using vcftools version 0.1.13, and am coming into errors that relate, I think, to versions.

When I validate the ExAC data VCF file, I get an error message for every INFO line in the file, that says something like the below:

>> vcf-validator ExAC.r0.3.1.sites.vep.vcf
INFO field at 1:13404 .. INFO tag [AC_Het=0,1,0] expected different number of values (expected 2, found 3)
FILTER field at 1:13418 ..  The filter(s) [AC_Adj0_Filter] not listed in the header.


This suggests, I believe, that the ExAC data VCF file is not in the format vcftools wants it to be, which is v4.0, v4.1 or v4.2. If this is true, how do I make the ExAC data VCF file vcftools- compatible?

The real error I'm trying to overcome comes from when I try to extract the frequency information:

>> vcftools --gzvcf ExAC.r0.3.1.sites.vep.vcf.gz --freq --chr 1 --out chr1_freq

VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
--gzvcf ExAC.r0.3.1.sites.vep.vcf.gz
--chr 1
--freq
--out chr1_freq

Using zlib version: 1.2.8
After filtering, kept 0 out of 0 Individuals
Error: Require Genotypes in VCF file in order to output Frequency Statistics.


Can anyone help me out?

ExAC vcftools format • 3.3k views
3
Entering edit mode
5.6 years ago

Hum... I would say it's a bug in Exac:

##INFO=<ID=AC_Het,Number=A,Type=Integer,Description="Adjusted Heterozygous Counts">
(...)
1   13404   .   G   A,T 136.58  VQSRTrancheSNP99.60to99.80  (...)AC_Het=0,1,0;


as far as I know the VCF format, there should have only two numbers for AC_Het for each ALT allele.

3
Entering edit mode

Apologies - this is indeed a bug in ExAC. AC_Het should be Number=. and unfortunately this is the best way to do this at the moment, since its a semi-complicated field: in reality, it's something like Number="G minus R" but since something like that does not exist in the spec, we'll have to settle for '.' and one will have to parse it manually.

We will fix the VCF with this and other minor header issues shortly.

0
Entering edit mode

I just found the version number of ExAC, and it is actually v4.2! This cannot be a version issue, now. Can you explain how it's possible for ExAC to have a bug? For example, why did you paste the line above? Sorry if I'm missing something obvious!

0
Entering edit mode

how it's possible for ExAC to have a bug?

A: What is the reason for most software errors in Bioinformatics according to you?

For example, why did you paste the line above?

• because vcftools says there is an error for "INFO field at 1:13404 "
• because there are two ALT alleles and the definition of AC_HEt is 'number=A', that is to say, one value for each ALT allele. I've written an issue: https://github.com/konradjk/exac_browser/issues/256
0
Entering edit mode

Thanks for writing an issue! Hopefully we hear from them soon!

0
Entering edit mode

It may be counting the 3 possible heterozygous states: G/A, G/T, and A/T

0
Entering edit mode

@rbagnall yes, but then ID=AC_Het,Number=A shouldn't be used.

0
Entering edit mode