Question

VCF format conventions for multi-sample vs. single-sample data

1

Entering edit mode

9.0 years ago

bberns ▴ 20

I'm new to VCF processing and am a bit confused about the different conventions used by VCF files "in the wild". The VCF spec supports variant calls for multiple samples in a single file via the FORMAT column, with a separate column for each sample containing the explicit variant call for that sample in a "GT" field.

However, it seems like some tools (e.g. SnpEff) assume that the input VCF file contains calls for only a single sample, and so omit the FORMAT column and GT field(s) entirely. Instead, the variant call is apparently inferred from the REF and ALT columns.

My questions are:

Which convention is actually used most often in practice, the full multi-sample format, or the simpler single-sample format?
What happens if a tool like SnpEff assumes the simpler format, but is given a file with multiple samples? For example, the current VCFs from 1000 Genomes contain 2504 samples per file. How does SnpEff cope with this?
It's not clear to me how the simpler convention handles heterozygosity vs. homozygosity. For example, a GT value of "0|1" is heterozygous, while "1|1" is homozygous. There is no way to convey this difference using the ALT and REF fields, is there?

Thanks for any insight you can provide.

Brian

1000genomes vcf snpeff • 5.6k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by bberns ▴ 20

Ram · Accepted Answer · 2015-05-05

Which convention is actually used most often in practice, the full multi-sample format, or the simpler single-sample format?
- I've seen both of them around, although multisample files tend to be reserved for intended purposes such as trios, other type of pedigrees, or bulk data release like the 1000g project data for instance.
What happens if a tool like SnpEff assumes the simpler format, but is given a file with multiple samples? For example, the current VCFs from 1000 Genomes contain 2504 samples per file. How does SnpEff cope with this?
- The annotation depends on the variant itself, not on the genotype, so even if you have thousands of samples genotypes each variant will be annotated only once, and the annotation will be written in the INFO column (shared by all samples, since the annotation is variant dependent, and not sample dependent). I haven't work with SnpEff in multi-sample mode, but apparently it's perfectly capable of doing it: http://snpeff.sourceforge.net/SnpEff_manual.html#cancer
It's not clear to me how the simpler convention handles heterozygosity vs. homozygosity. For example, a GT value of "0|1" is heterozygous, while "1|1" is homozygous. There is no way to convey this difference using the ALT and REF fields, is there?
- Yes, the way you can tell if a variant call is homozygous or heterozygous for a particular variant on a vcf format is only by looking at the the genotype. the REF and the ALT column are just descriptive of what the reference genome has on that position and what the alternative allele was found. it will be homozygous if both numbers are the same (1|1, 2|2 or 3|3 and so on if the variant has more than 1 alternative allele, or even 0|0 if you deal with multisample files that have to present sites that do not vary - reference homozygous)

Extra note: if you're new to vcf format keep in mind that all the columns except the last sample's one (or the last samples' ones if multi-sample) are there to describe the variant, not to describe the genotypes. a variant is defined by its chromosome location and the alternative alleles that were found.