VCF Record Question
0
0
Entering edit mode
8.8 years ago
Matt ▴ 70

I am parsing a VCF file (indel not SNP) and stumbled upon this type of entry.

CHROM: 1
POSITION: 3010401
REF: GT
ALT: GGTTTTT

If ALT were GTTTTT, I would say that there is a 5 sequence insertion starting after the first G. But with the leading G, I don't understand. Is this both an insertion and a deletion? What bases are affected?

Please help!

Matt

indel vcf • 4.1k views
ADD COMMENT
0
Entering edit mode

a G at 3010401 and followed by 'T' is replaced by a G at 3010401 and followed by 'GTTTTT'. The VCF spec says you need to put the based 'before' the event when there's an indel.

ADD REPLY
1
Entering edit mode

Just a correction regarding the padding base. The VCF spec (since 4.1 onward) says you need to put a padding base if either the REF or ALT would otherwise be empty (i.e. a pure insertion or pure deletion). Thus the leading G in this variant is not required for padding reasons. A literal interpretation is that the variant caller has asserted that the base at position 3010401 is a G (as you note).

The notion of padding bases is actually a shortcoming of VCF, as it introduces uncertainty as to whether the caller really means what it is declaring regarding the bases actually present in the sample genome vs having added the base simply for syntactic reasons. This is becoming more important as in the clinical world people care about confidently calling locations as reference.

ADD REPLY
0
Entering edit mode

You are saying that it is as simple as the 2 REF bases are replaced by the 7 ALT bases?

Would it be the same if REF were AGT? The 3 REF bases would be replaced? Or is this an invalid indel record?

ADD REPLY
0
Entering edit mode

He's saying that in VCF's world, this represents a valid insertions between a G at position 3010401 and a T at 3010402.

The ref bases aren't replaced.

This insertion can be a bit more sanely called Pos Ref Alt 3010401 G GGTTTT

because the T at position 3010402 is completely irrelevant to position 3010401.

If the Ref were AGT, the Ref would be wrong, because 3010401 is a G not a T (presumably the reference given in your first answer is correct).

VCF makes variant calling confusing because of 2 fundamentally weird decisions. First, ALT always needs to have a base. Second, the ref can be any number of bases long, even for a single position.

ADD REPLY
0
Entering edit mode

Hello Matt!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=61540

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

My apologies

ADD REPLY
0
Entering edit mode

Please don't apologize, tl;dr what you did is the best way to gain exposure. There is nothing wrong with posing the same question to multiple internet communities. This 1) increases your chance of having the question answered, 2) increases the chance that more than one valid perspective will be demonstrated 3) increases the chance that your question, and corresponding answers will appear in a google search. This is especially important when using resources where valid answers cannot be upvoted, and therefore cannot increase in relevance except by word volume and links to / from answers (like seqanswers). It also seems anticompetitive for the creator of one website to ask his users not to use another.

To Pierre's credit, his response "annoyed" me enough to register and answering your remaining question.

ADD REPLY

Login before adding your answer.

Traffic: 1720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6