I wonder why annotation format from snpEff or VEP doesn't respect rule of VCF specification. So, actually annotations are combined into one field and looks like this :
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID..... "
In a better world, it should looks like this :
##INFO=<ID=Allele,Type=String,Description="Alternative Allele" ##INFO=<ID=Annotation,Type=String,Description="Annotation .." ##INFO=<ID=Annotation_Impact,Type=String,Description="The impact in gene" ##INFO=<ID=Gene_Name,Type=String,Description="the HUGO gene name "
I suppose the explaination comes from different transcript number for one gene... But we just have to duplicate the line for each transcripts. Actually annotations from a vcf are very hard to parse, because there are no good rules. The VCF annotation http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf doesn't make it better.
The author of snpEff? Congratulation for this greate tool. So, I can ask you some question. First, why your annotations keys are not keys :D For example, you have a key named : "ERRORS / WARNINGS / INFO" . It looks more like a description than a key. It contains forbidden character like "space" or "/" . So I can't use it as column name in a Sqlite database for example . I must rename them. Secondly, The "Variant annotations in VCF format" doesn't specify any rule for naming. It tells only fields order and meaning... This is really wired for a specification documentation. For example SnpEff and VEP, following the guidline can write the same SNP : A | Intron | GJB2 as follow:
For VEP: ( different key, different order)
So I am really confused .. How can I manage VEP / snpEff annotation in the same database ? By the way I agree , VCF is not suitable. A new specification based on hdf5 or bjson would be helpfull.