Creating Vcfs With An Iupac Reference Base
2
2
Entering edit mode
10.4 years ago

When trying to validate some VCFs that we're creating, we hit some snags at positions where the reference base is non-ACGT. (For example, it could be an "M", using the IUPAC code for A or C). The VCF spec states that only ACGT bases are allowed in the ref position, and that column doesn't allow for a comma-separated list.

What's the proper way to encode this position? At the present, I'm leaning towards replacing the ref base with an N. Is this reasonable and/or correct?

vcf reference • 2.4k views
ADD COMMENT
0
Entering edit mode

People on the VCF-spec mailing list seem to think that we're right - it should be an N, at least under the current spec. Pierre gets best-answer for being faster, but upvotes all around!

ADD REPLY
3
Entering edit mode
10.4 years ago

As far as I understand the specification, it says that it should be 'N':

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40

REF reference base(s): Each base must be one of A,C,G,T,N. Bases should be in uppercase. Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For InDels, the reference String must include the base before the event (which must be reflected in the POS field). (String, Required).

ADD COMMENT
2
Entering edit mode
10.4 years ago

Hi all,

I second Chris, having found the same issue with proprietary vcf. Does the rule also apply to the allele call sequence.

It would be nice to keep IUPAC there which is very informative for lets say splice Donnor Acceptor, TFBSs, or stop codon identification.

Any feed back from vcf experts?

Stephane

ADD COMMENT

Login before adding your answer.

Traffic: 1188 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6