4.1 years ago by
There are a few ways to skin this cat, and it is also an area with fairly active development. The central difficulty is that there often are multiple ways to represent the same variant in VCF, particularly in cases where block substitutions or indels are involved, and there is no "right" representation.
The decomposition/normalization approach has the downside that the process tends to destroy a lot of the good information that is contained in the original call set (e.g. phasing information, INFO/FORMAT annotations, quality scores). In addition, even after decomposition the results can be arbitrary (and so may not match up with with the coordinates you are getting your IDs from anyway, defeating the purpose).
An alternative approach is to have smarter comparison tools which are directly aware of representational ambiguity, by performing variant comparison at the haplotype level. AFAIK CGI calldiff and RTG vcfeval were independently the first to implement this strategy, and new tools are finally catching on, in varying stages of development (SMaSH, vgraph, hap.py). These tools replay the variants from the VCF into the reference and determine whether variants match by whether the resulting haplotypes match. With vcfeval the full VCF annotation information is preserved during the comparison (not so with hap.py, vgraph doesn't currently output VCF, and I haven't used calldiff or SMaSH)
In particular, the haplotype comparison tools are the current state of the art for same-sample call-set comparison (either between callers, or comparing with a benchmark set) -- certainly in the case of vcfeval this was the motivating driver in the development. The decomposition/normalization approach is more useful if you want to establish a population-level database where variants are converted to a "canonical" form with limited annotation requirements. Of course there is nothing to say you cannot use both techniques, depending on what you are trying to achieve.