Question: VCF - what are overlapping variants?
1
gravatar for syntax
5 months ago by
syntax50
Boston
syntax50 wrote:

As referenced in the pvcf documentation… what are overlapping variants? https://www.biorxiv.org/content/biorxiv/early/2018/06/11/343970.full.pdf

Hybrid allelic representation. To facilitate downstream summary statistics without doublecounting, ideal unified sites would be completely nonoverlapping, with mutually-exclusive alleles.

I understand snv, insertion, deletion, copy number… but what does an overlap look like? I can’t find any descriptive information aside from mentions that “variants can overlap.”

Is this when a single mutation is represented multiple ways by different ordering of alleles?

vcf • 222 views
ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum124k • written 5 months ago by syntax50
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

here are two variants from gnomad:

1   13459   rs1315857414    CAGA    C
1   13462   rs1441058751    AAGT    A

as you can see, the first variant CAGA at 13459 overlaps with th position 13462

ADD COMMENTlink written 5 months ago by Pierre Lindenbaum124k

Thanks Pierre. Please allow me to clarify as I am not all the way there yet.

Guessing your 4th column is ALT and the 5th column is REF.

CAGA represents a 4-nucleotide insertion? And the last A in that insertion is falls at position 13462... where it is being counted a second time as a snp?

Looked at gnomad and dbsnp, but couldn't make much sense of it.

ADD REPLYlink modified 5 months ago • written 5 months ago by syntax50
3

In VCF, column 4 is REF and 5 contains the ALT alleles - while Pierre's example doesn't contain the other columns that would be needed to make it valid VCF, I would assume that is what he is intending.

In which case the first variant (CATA to C) would represent a three base deletion (the ATA are not present in the ALT, and the second (AAGT to A) also represents a three base deletion of AGT. This is a good example that illustrates the difficulties when working with VCF. Notice:

  • The "overlapping" A base at 13462 is not actually altered by the second variant, so semantically there isn't really an overlap, the A base that is in common in the REF and ALT alleles is only there because the VCF spec says that you cannot have either REF or ALT be empty -- it's commonly called a padding or anchor base.
  • In the human reference, the base at 13466 is an A, so the second variant could equivalently be represented as 1 13463 rs1441058751 AGTA A, so there isn't even any syntactic overlap if that representation were used.
  • If these two variants occurred in the same haplotype, you could also represent the variation as (for example) 1 13459 rs1315857414 CAGAAGT CA, (and conversely, you could represent each of your variants as separate single base deletions)

When you are matching and comparing VCF variants you really need to be comparing at the level of the underlying haplotypes rather than at the syntactic level of what is in the REF and ALT fields of the VCF. For this I recommend a tool such as vcfeval from RTG Tools (which I help develop).

ADD REPLYlink written 5 months ago by Len Trigg1.3k

Thanks for the explanation, Len. Seems like computer scientists trying to save a few bits. I'm sure I will be running that tool of yours soon enough.

ADD REPLYlink written 5 months ago by syntax50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 706 users visited in the last hour