Why annotation doesn't respect VCF field schema ?
2
0
Entering edit mode
6.1 years ago
sacha ★ 2.4k

I wonder why annotation format from snpEff or VEP doesn't respect rule of VCF specification. So, actually annotations are combined into one field and looks like this :

##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID..... "


In a better world, it should looks like this :

##INFO=<ID=Allele,Type=String,Description="Alternative Allele"
##INFO=<ID=Annotation,Type=String,Description="Annotation .."
##INFO=<ID=Annotation_Impact,Type=String,Description="The impact in gene"
##INFO=<ID=Gene_Name,Type=String,Description="the HUGO gene name "


I suppose the explaination comes from different transcript number for one gene... But we just have to duplicate the line for each transcripts. Actually annotations from a vcf are very hard to parse, because there are no good rules. The VCF annotation http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf doesn't make it better.

vcf format variant annotation specification • 1.6k views
3
Entering edit mode
6.1 years ago
EnsemblWill ▴ 560

The authors of VEP (myself) and snpEff collaborated on this format to a large extent. As far as we can tell it is the best way to cram in all of the annotation data provided by the tools to VCF.

Your suggestion might be closer to the original intention of VCF spec, but it has a couple of flaws:

• it is very verbose. Our ANN format requires only the individual annotation keys once in the header. Yours will require it once (at least) per line for each key
• it won't deal well with multiple alleles, genes, transcripts or any other feature type we annotate. You say you could repeat the line, but how do you map which repeat is which between the different annotations?

Really the issue here is that VCF is not a suitable vehicle for carrying anything more than very basic functional variant annotation. JSON or some similar structured format is much more suited to this task, as is used by VEP's REST endpoints and the GA4GH data exchange spec.

0
Entering edit mode

The author of snpEff? Congratulation for this greate tool. So, I can ask you some question. First, why your annotations keys are not keys :D For example, you have a key named : "ERRORS / WARNINGS / INFO" . It looks more like a description than a key. It contains forbidden character like "space" or "/" . So I can't use it as column name in a Sqlite database for example . I must rename them. Secondly, The "Variant annotations in VCF format" doesn't specify any rule for naming. It tells only fields order and meaning... This is really wired for a specification documentation. For example SnpEff and VEP, following the guidline can write the same SNP : A | Intron | GJB2 as follow:

For SnpEff:

    Allele | Consequence | SYMBOL ....


For VEP: ( different key, different order)

   Allele | Gene_Name | Impact .....


So I am really confused .. How can I manage VEP / snpEff annotation in the same database ? By the way I agree , VCF is not suitable. A new specification based on hdf5 or bjson would be helpfull.

1
Entering edit mode
6.1 years ago

I think, this format is the best they can do. You already mentioned that a multiplexing may happen due to different transcripts. If SnpEFF would just duplicate the line, what would happen if later on someone deduplicates (bcftools norm -m '+') the VCF file? With the format you proposed, you would probably not be able to match, for example, the impact to the correct gene/transcript.

0
Entering edit mode

Thanks for your reply. I don't think it's the best. I would prefer HDF5 format or BJSON to store variant.

2
Entering edit mode

To specify my comment: I think, this format is the best they can do given the requirement of storing the annotation in the VCF and without breaking the VCF specification. I assumed this requirement from your question...