Why Does Vcf Aggregate Indels Like This?
2
4
Entering edit mode
12.4 years ago

From:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41


SNPs and Small Indels

For example, suppose we are looking at a locus in the genome:

Ref: a t C g a // C is the reference base
   : a t G g a // C base is a G in some individuals
   : a t - g a // C base is deleted w.r.t. the reference
   : a t CAg a // A base is inserted w.r.t. the reference sequence

In the above cases, what are the alleles and how would they be represented as a VCF record?

First is a SNP polymorphism of C/G → { C , G } → C is the reference allele

20     3 .         C      G       .   PASS  DP=100

Second, 1 base deletion of C → { tC , t } → tC is the reference allele

20     2 .         TC      T      .   PASS  DP=100

Third, 1 base insertion of A → { tC ; tCA } → tC is the reference allele

20     2 .         TC      TCA    .   PASS  DP=100


OK if witnessed independently, this tC->tCA would be c->cA. Right?

20     3 .         C       CA    .   PASS  DP=100

It seems like if we did not observe 3C as a deleted position in this file then it could be used as a reference base, but since it was we have to aggregate it.

This is not in the spec, but I assume this is the correct aggregation:

20     2 .         TC      TG,T,TCA    .   PASS  DP=100

What is the rule that is being applied here? Is there a spec that describes this more precisely? Is there a name for this strategy (other than VCF)?

vcf • 3.6k views
ADD COMMENT
0
Entering edit mode

I see what you mean. They could have aggregated the example:TC=>TG,T,TCA. Maybe they meant to show three independent entries before aggregation and they just messed up on that last TC>TCA?

ADD REPLY
0
Entering edit mode

the spec is a bit incomplete here. I guess my issue is that there are good examples but the rules governing the aggregation are vague.

ADD REPLY
0
Entering edit mode

You may want to search the archives of the vcf-tools-spec mailing list, or if that fails, ask the group. They field questions like this all the time: https://lists.sourceforge.net/lists/listinfo/vcftools-spec

ADD REPLY
3
Entering edit mode
12.2 years ago
Deniz ▴ 140

This relates to how 1000G is handling MNPs and Complex Variants in VCF; essentially if variants are contiguous or overlap, they're aggregated & the whole thing is treated as one block, as is in the last aggregation example in your question.

Also from the spec:

Note that in VCF records, the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn't retained.

For completeness, VCF records are dynamically typed, so whether a VCF record is a SNP, Indel, Mixed, or Reference site depends on the properties of the alleles in the record.

ADD COMMENT
2
Entering edit mode
12.2 years ago
lh3 33k

When you put multiple alleles in one line, you are allowed to have a heterozygous sample with two different alternate alleles. You cannot describe a multiallelic heterozygote by putting each allele in one line.

As deniz said, you aggregate multiple alternate alleles when they overlap or interfere with each other. This rule solves most cases, but may still be vague in corner ones.

ADD COMMENT

Login before adding your answer.

Traffic: 2673 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6