I'm new to VCF processing and am a bit confused about the different conventions used by VCF files "in the wild". The VCF spec supports variant calls for multiple samples in a single file via the FORMAT column, with a separate column for each sample containing the explicit variant call for that sample in a "GT" field.
However, it seems like some tools (e.g. SnpEff) assume that the input VCF file contains calls for only a single sample, and so omit the FORMAT column and GT field(s) entirely. Instead, the variant call is apparently inferred from the REF and ALT columns.
My questions are:
- Which convention is actually used most often in practice, the full multi-sample format, or the simpler single-sample format?
- What happens if a tool like SnpEff assumes the simpler format, but is given a file with multiple samples? For example, the current VCFs from 1000 Genomes contain 2504 samples per file. How does SnpEff cope with this?
- It's not clear to me how the simpler convention handles heterozygosity vs. homozygosity. For example, a GT value of "0|1" is heterozygous, while "1|1" is homozygous. There is no way to convey this difference using the ALT and REF fields, is there?
Thanks for any insight you can provide.
Thanks, this is helpful. From the SnpEff link you provided, it looks like the tool works with an in-between convention for cancer that has two samples in one file (normal and tumor), but still assumes a single patient per file.
The Ensembl Variant Effect Predictor (VEP, http://www.ensembl.org/vep) can treat each individual genotype in a VCF separately, though be aware that the behaviour is somewhat primitive and introduces a large amount of duplication in the results file if you have many individuals in the VCF. See http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#opt_individual for details.