I'm writing a python parser for VEP produced serialised JSON output, but finding the alleles recorded in the JSON differ from the vcf, usually by one base (i.e. lacking a reference base), but for "complex" variants this is more of a problem as sometimes the allele recorded in the JSON is the same as that in the vcf, and sometimes it is one shorter. Has anyone come across this before and has an explanation as to why this happens?
VEP converts unbalanced substitutions (e.g. insertions, deletions) to an internal standard Ensembl representation. This is explained in basic terms here; essentially the leading base is trimmed from the REF and ALT alleles.
However, this only explains what happens in the simple case, i.e. with only one ALT allele. For complex VCF entries, VEP will only trim the leading base if all REF and ALTs share the same leading base.
A useful way to keep track of which allele is which in the output is to use the --allele_number flag.
You may modify VEP's default behaviour to reduce all REF/ALT pairs to their minimal representation using --minimal.
Both the --allele_number and --minimal flags may also be used as parameters if you're using the VEP REST API (i.e. add "&allele_number=1&minimal=1" to your URL or the equivalent parameters to the POST body.