Question

Why Does Vcf Aggregate Indels Like This?

4

Entering edit mode

12.4 years ago

Jeremy Leipzig 22k

From:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

SNPs and Small Indels

For example, suppose we are looking at a locus in the genome:

Ref: a t C g a // C is the reference base
   : a t G g a // C base is a G in some individuals
   : a t - g a // C base is deleted w.r.t. the reference
   : a t CAg a // A base is inserted w.r.t. the reference sequence

In the above cases, what are the alleles and how would they be represented as a VCF record?

First is a SNP polymorphism of C/G → { C , G } → C is the reference allele

20     3 .         C      G       .   PASS  DP=100

Second, 1 base deletion of C → { tC , t } → tC is the reference allele

20     2 .         TC      T      .   PASS  DP=100

Third, 1 base insertion of A → { tC ; tCA } → tC is the reference allele

20     2 .         TC      TCA    .   PASS  DP=100

OK if witnessed independently, this tC->tCA would be c->cA. Right?

20     3 .         C       CA    .   PASS  DP=100

It seems like if we did not observe 3C as a deleted position in this file then it could be used as a reference base, but since it was we have to aggregate it.

This is not in the spec, but I assume this is the correct aggregation:

20     2 .         TC      TG,T,TCA    .   PASS  DP=100

What is the rule that is being applied here? Is there a spec that describes this more precisely? Is there a name for this strategy (other than VCF)?

vcf • 3.6k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 12.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I see what you mean. They could have aggregated the example:TC=>TG,T,TCA. Maybe they meant to show three independent entries before aggregation and they just messed up on that last TC>TCA?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

the spec is a bit incomplete here. I guess my issue is that there are good examples but the rules governing the aggregation are vague.

ADD REPLY • link 12.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

You may want to search the archives of the vcf-tools-spec mailing list, or if that fails, ask the group. They field questions like this all the time: https://lists.sourceforge.net/lists/listinfo/vcftools-spec

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.4 years ago by Chris Miller 22k

score 3 · Answer 1 · 2012-02-22

This relates to how 1000G is handling MNPs and Complex Variants in VCF; essentially if variants are contiguous or overlap, they're aggregated & the whole thing is treated as one block, as is in the last aggregation example in your question.

Also from the spec:

Note that in VCF records, the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn't retained.

For completeness, VCF records are dynamically typed, so whether a VCF record is a SNP, Indel, Mixed, or Reference site depends on the properties of the alleles in the record.

score 2 · Answer 2 · 2012-02-22

When you put multiple alleles in one line, you are allowed to have a heterozygous sample with two different alternate alleles. You cannot describe a multiallelic heterozygote by putting each allele in one line.

As deniz said, you aggregate multiple alternate alleles when they overlap or interfere with each other. This rule solves most cases, but may still be vague in corner ones.