I'm new to VCF processing and am a bit confused about the different conventions used by VCF files "in the wild". The VCF spec supports variant calls for multiple samples in a single file via the FORMAT column, with a separate column for each sample containing the explicit variant call for that sample in a "GT" field.
However, it seems like some tools (e.g. SnpEff) assume that the input VCF file contains calls for only a single sample, and so omit the FORMAT column and GT field(s) entirely. Instead, the variant call is apparently inferred from the REF and ALT columns.
My questions are:
- Which convention is actually used most often in practice, the full multi-sample format, or the simpler single-sample format?
- What happens if a tool like SnpEff assumes the simpler format, but is given a file with multiple samples? For example, the current VCFs from 1000 Genomes contain 2504 samples per file. How does SnpEff cope with this?
- It's not clear to me how the simpler convention handles heterozygosity vs. homozygosity. For example, a GT value of "0|1" is heterozygous, while "1|1" is homozygous. There is no way to convey this difference using the ALT and REF fields, is there?
Thanks for any insight you can provide.