Should you decompose and normalize multi-allelic variants for comparison / ID assignment?
Entering edit mode
7.4 years ago
William ★ 5.1k

In the documentation of Gemini there is that you should

  1. Decompose the original VCF such that variants with multiple alleles are expanded into distinct variant records; one record for each REF/ALT combination.
  2. Normalize the decomposed VCF so that variants are left aligned and represented using the most parsimonious alleles.

This sound like a good thing to do because is makes it easier to asign correct IDs and to compare variants.

Only problem I have seen mentioned is that for samples that have an ALT1, ALT2 genotype your genotype is now split over 2 vcf records; MISSING, ALT1 and MISSING , ALT2 . Or even REF, ALT1 and REF, ALT2 . Both don't correctly represent the genotype of the sample.

multi-allelic decomposition normalization • 6.5k views
Entering edit mode
7.4 years ago
Len Trigg ★ 1.6k

There are a few ways to skin this cat, and it is also an area with fairly active development. The central difficulty is that there often are multiple ways to represent the same variant in VCF, particularly in cases where block substitutions or indels are involved, and there is no "right" representation.

The decomposition/normalization approach has the downside that the process tends to destroy a lot of the good information that is contained in the original call set (e.g. phasing information, INFO/FORMAT annotations, quality scores). In addition, even after decomposition the results can be arbitrary (and so may not match up with with the coordinates you are getting your IDs from anyway, defeating the purpose).

An alternative approach is to have smarter comparison tools which are directly aware of representational ambiguity, by performing variant comparison at the haplotype level. AFAIK CGI calldiff and RTG vcfeval were independently the first to implement this strategy, and new tools are finally catching on, in varying stages of development (SMaSH, vgraph, These tools replay the variants from the VCF into the reference and determine whether variants match by whether the resulting haplotypes match. With vcfeval the full VCF annotation information is preserved during the comparison (not so with, vgraph doesn't currently output VCF, and I haven't used calldiff or SMaSH)

In particular, the haplotype comparison tools are the current state of the art for same-sample call-set comparison (either between callers, or comparing with a benchmark set) -- certainly in the case of vcfeval this was the motivating driver in the development. The decomposition/normalization approach is more useful if you want to establish a population-level database where variants are converted to a "canonical" form with limited annotation requirements. Of course there is nothing to say you cannot use both techniques, depending on what you are trying to achieve.

Entering edit mode

Great information. I think with GEMINI one of the reasons this process is the new norm is that in general it is being used for Mendelian disease studies. Losing the true genotype is ok for downstream analysis. If that site happened to be the home of the causative mutation it would be a special case of a compound het, which should still pop out in the downstream analysis.It also means you capture appropriate annotation for both variants, especially population frequencies.

But for other uses losing that information, or at least having it be somewhat obscured, might not be something you want.

Entering edit mode
4.3 years ago
dvitsios ▴ 30

I had a struggle recently figuring out the order of elements within the Genotype Count (GC) of VCF files (from gnomAD).

I managed to infer the correct order by checking several variant entries and calculating the GCs based on the AC and AN values. I wrote two blog posts about that that may be of interest in case you are trying to de-compose the GC field when having multiple (>2) alleles.

Expanding multi-allelic variants in VCF:



Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6