VCF - what are overlapping variants?
1
1
Entering edit mode
4.9 years ago
Kermit ▴ 90

As referenced in the pvcf documentation… what are overlapping variants? https://www.biorxiv.org/content/biorxiv/early/2018/06/11/343970.full.pdf

Hybrid allelic representation. To facilitate downstream summary statistics without doublecounting, ideal unified sites would be completely nonoverlapping, with mutually-exclusive alleles.

I understand snv, insertion, deletion, copy number… but what does an overlap look like? I can’t find any descriptive information aside from mentions that “variants can overlap.”

Is this when a single mutation is represented multiple ways by different ordering of alleles?

vcf • 3.1k views
ADD COMMENT
3
Entering edit mode
4.9 years ago

here are two variants from gnomad:

1   13459   rs1315857414    CAGA    C
1   13462   rs1441058751    AAGT    A

as you can see, the first variant CAGA at 13459 overlaps with th position 13462

ADD COMMENT
0
Entering edit mode

Thanks Pierre. Please allow me to clarify as I am not all the way there yet.

Guessing your 4th column is ALT and the 5th column is REF.

CAGA represents a 4-nucleotide insertion? And the last A in that insertion is falls at position 13462... where it is being counted a second time as a snp?

Looked at gnomad and dbsnp, but couldn't make much sense of it.

ADD REPLY
4
Entering edit mode

In VCF, column 4 is REF and 5 contains the ALT alleles - while Pierre's example doesn't contain the other columns that would be needed to make it valid VCF, I would assume that is what he is intending.

In which case the first variant (CATA to C) would represent a three base deletion (the ATA are not present in the ALT, and the second (AAGT to A) also represents a three base deletion of AGT. This is a good example that illustrates the difficulties when working with VCF. Notice:

  • The "overlapping" A base at 13462 is not actually altered by the second variant, so semantically there isn't really an overlap, the A base that is in common in the REF and ALT alleles is only there because the VCF spec says that you cannot have either REF or ALT be empty -- it's commonly called a padding or anchor base.
  • In the human reference, the base at 13466 is an A, so the second variant could equivalently be represented as 1 13463 rs1441058751 AGTA A, so there isn't even any syntactic overlap if that representation were used.
  • If these two variants occurred in the same haplotype, you could also represent the variation as (for example) 1 13459 rs1315857414 CAGAAGT CA, (and conversely, you could represent each of your variants as separate single base deletions)

When you are matching and comparing VCF variants you really need to be comparing at the level of the underlying haplotypes rather than at the syntactic level of what is in the REF and ALT fields of the VCF. For this I recommend a tool such as vcfeval from RTG Tools (which I help develop).

ADD REPLY
0
Entering edit mode

Thanks for the explanation, Len. Seems like computer scientists trying to save a few bits. I'm sure I will be running that tool of yours soon enough.

ADD REPLY

Login before adding your answer.

Traffic: 2698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6