Question: Why Does Vcf Aggregate Indels Like This?
4
gravatar for Jeremy Leipzig
7.5 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

From:

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41


SNPs and Small Indels

For example, suppose we are looking at a locus in the genome:

Ref: a t C g a // C is the reference base
   : a t G g a // C base is a G in some individuals
   : a t - g a // C base is deleted w.r.t. the reference
   : a t CAg a // A base is inserted w.r.t. the reference sequence

In the above cases, what are the alleles and how would they be represented as a VCF record?

First is a SNP polymorphism of C/G → { C , G } → C is the reference allele

20     3 .         C      G       .   PASS  DP=100

Second, 1 base deletion of C → { tC , t } → tC is the reference allele

20     2 .         TC      T      .   PASS  DP=100

Third, 1 base insertion of A → { tC ; tCA } → tC is the reference allele

20     2 .         TC      TCA    .   PASS  DP=100


OK if witnessed independently, this tC->tCA would be c->cA. Right?

20     3 .         C       CA    .   PASS  DP=100

It seems like if we did not observe 3C as a deleted position in this file then it could be used as a reference base, but since it was we have to aggregate it.

This is not in the spec, but I assume this is the correct aggregation:

20     2 .         TC      TG,T,TCA    .   PASS  DP=100

What is the rule that is being applied here? Is there a spec that describes this more precisely? Is there a name for this strategy (other than VCF)?

vcf • 2.3k views
ADD COMMENTlink modified 5.8 years ago by Biostar ♦♦ 20 • written 7.5 years ago by Jeremy Leipzig18k

I see what you mean. They could have aggregated the example:TC=>TG,T,TCA. Maybe they meant to show three independent entries before aggregation and they just messed up on that last TC>TCA?

ADD REPLYlink written 7.5 years ago by Damian Kao15k

the spec is a bit incomplete here. I guess my issue is that there are good examples but the rules governing the aggregation are vague.

ADD REPLYlink written 7.5 years ago by Jeremy Leipzig18k

You may want to search the archives of the vcf-tools-spec mailing list, or if that fails, ask the group. They field questions like this all the time: https://lists.sourceforge.net/lists/listinfo/vcftools-spec

ADD REPLYlink written 7.5 years ago by Chris Miller20k
3
gravatar for Deniz
7.3 years ago by
Deniz140
Cambridge
Deniz140 wrote:

This relates to how 1000G is handling MNPs and Complex Variants in VCF; essentially if variants are contiguous or overlap, they're aggregated & the whole thing is treated as one block, as is in the last aggregation example in your question.

Also from the spec:

Note that in VCF records, the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn't retained.

For completeness, VCF records are dynamically typed, so whether a VCF record is a SNP, Indel, Mixed, or Reference site depends on the properties of the alleles in the record.

ADD COMMENTlink written 7.3 years ago by Deniz140
2
gravatar for lh3
7.3 years ago by
lh331k
United States
lh331k wrote:

When you put multiple alleles in one line, you are allowed to have a heterozygous sample with two different alternate alleles. You cannot describe a multiallelic heterozygote by putting each allele in one line.

As deniz said, you aggregate multiple alternate alleles when they overlap or interfere with each other. This rule solves most cases, but may still be vague in corner ones.

ADD COMMENTlink written 7.3 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 615 users visited in the last hour