VCF format conventions for multi-sample vs. single-sample data
1
1
Entering edit mode
6.7 years ago
bberns ▴ 20

I'm new to VCF processing and am a bit confused about the different conventions used by VCF files "in the wild". The VCF spec supports variant calls for multiple samples in a single file via the FORMAT column, with a separate column for each sample containing the explicit variant call for that sample in a "GT" field.

However, it seems like some tools (e.g. SnpEff) assume that the input VCF file contains calls for only a single sample, and so omit the FORMAT column and GT field(s) entirely. Instead, the variant call is apparently inferred from the REF and ALT columns.

My questions are:

  • Which convention is actually used most often in practice, the full multi-sample format, or the simpler single-sample format?
  • What happens if a tool like SnpEff assumes the simpler format, but is given a file with multiple samples? For example, the current VCFs from 1000 Genomes contain 2504 samples per file. How does SnpEff cope with this?
  • It's not clear to me how the simpler convention handles heterozygosity vs. homozygosity. For example, a GT value of "0|1" is heterozygous, while "1|1" is homozygous. There is no way to convey this difference using the ALT and REF fields, is there?

Thanks for any insight you can provide.

-- Brian

vcf snpeff 1000 genomes • 4.3k views
ADD COMMENT
2
Entering edit mode
6.7 years ago
  • Which convention is actually used most often in practice, the full multi-sample format, or the simpler single-sample format?
    • I've seen both of them around, although multisample files tend to be reserved for intended purposes such as trios, other type of pedigrees, or bulk data release like the 1000g project data for instance.
  • What happens if a tool like SnpEff assumes the simpler format, but is given a file with multiple samples? For example, the current VCFs from 1000 Genomes contain 2504 samples per file. How does SnpEff cope with this?
    • the annotation depends on the variant itself, not on the genotype, so even if you have thousands of samples genotypes each variant will be annotated only once, and the annotation will be written in the INFO column (shared by all samples, since the annotation is variant dependant, and not sample dependant). I haven't work with SnpEff in multi-sample mode, but apparently it's perfectly capable of doing it: http://snpeff.sourceforge.net/SnpEff_manual.html#cancer
  • It's not clear to me how the simpler convention handles heterozygosity vs. homozygosity. For example, a GT value of "0|1" is heterozygous, while "1|1" is homozygous. There is no way to convey this difference using the ALT and REF fields, is there?
    • yes, the way you can tell if a variant call is homozygous or heterozygous for a particular variant on a vcf format is only by looking at the the genotype. the REF and the ALT column are just descriptive of what the reference genome has on that position and what the alternative allele was found. it will be homozygous if both numbers are the same (1|1, 2|2 or 3|3 and so on if the variant has more than 1 alternative allele, or even 0|0 if you deal with multisample files that have to present sites that do not vary - reference homozygous)

extra note: if you're new to vcf format keep in mind that all the columns except the last sample's one (or the last samples' ones if multi-sample) are there to describe the variant, not to describe the genotypes. a variant is defined by its chromosome location and the alternative alleles that were found.

ADD COMMENT
0
Entering edit mode

Thanks, this is helpful. From the SnpEff link you provided, it looks like the tool works with an in-between convention for cancer that has two samples in one file (normal and tumor), but still assumes a single patient per file.

ADD REPLY
0
Entering edit mode

The Ensembl Variant Effect Predictor (VEP, www.ensembl.org/vep) can treat each individual genotype in a VCF separately, though be aware that the behaviour is somewhat primitive and introduces a large amount of duplication in the results file if you have many individuals in the VCF. SeeĀ http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_individual for details.

ADD REPLY

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6