Quick reality check - I have been normalizing VCFs and annotation files according to this methodology:
Tan A, Abecasis GR, Kang HM. Unified representation of genetic variants.
An example implementation would be this here: https://github.com/ericminikel/minimal_representation/blob/master/normalize.py
A consequence of this is that all duplications get converted to insertions post normalization.
For example: ref: C, alt: CC would be normalized to ref: A, alt: AC (assuming A is the base pair preceding the ref position) or ref: CAC alt: CACCAC would be normalized to ref: G alt: GCAC (assuming G is the base pair preceding the ref position)
Does this make sense? Other than the label "insertion" v "duplication", should there be any importance given to the fact that these variations were duplications before the normalization, from a biological/clinical POV?