In VCF files, what character encodings are used / legal?
1
1
Entering edit mode
8.9 years ago

In VCF files, what character encodings are used in practice?

What character encodings are expected for the format?

The examples I've seen seem to include single-byte characters. It's not clear whether it's intended to be 7-bit US-ASCII or 8-bit Latin1.

The 4.2 format specification doesn't mention encodings for VCF files. (It does specify that Unicode characters are not supported in characters and strings in BCF files)

vcf • 7.1k views
ADD COMMENT
3
Entering edit mode
8.9 years ago

https://github.com/samtools/hts-specs/blob/VCFv4.3/VCFv4.3.tex

The character encoding of VCF files is UTF-8. UTF-8 is a multi-byte character encoding that is a strict superset of 7-bit ASCII and has the property that none of the bytes in any multi-byte characters are 7-bit ASCII bytes. As a result, most software that processes VCF files does not have to be aware of the possible presence of multi-byte UTF-8 characters.

ADD COMMENT
0
Entering edit mode

7 minutes with a direct quote. Is Google spelled P-I-E-R-R-E in France?

ADD REPLY
0
Entering edit mode

@Ram: no surprise: I've been recently involved in some conversations about encoding things (xml...) in VCF https://github.com/samtools/hts-specs/issues/75

ADD REPLY
0
Entering edit mode

Nice discussion. I'm pro-JSON though, lightweight and easier to parse, no?

ADD REPLY

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6