Multi-sample variant calling files made with GATK4 have both genotype, genotype phase and genotype phase block information.
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
If I run the following bcftools query I can get the 3 above fields for each sample.
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT\t%PGT\t%PID]\n' my_samples.vcf.gz
Combinations of GT, PGT and PID make sense to make. Heterozygous genotypes (0/1) can be both on the paternal (e.g. 0|1) or maternal haplotype (e.g. 1|0).
0/1 0|1 748_A_C
0/1 1|0 627_C_T
But how can a homozygous genotype (0/0) be phased on one of the parental haplotypes (e.g. 0|1)?
There is no alternative allele (=1) in the homozygous genotype (0/0)?
0/0 0|1 627_C_T
The above 0/0 genotype combined with 0|1 genotype phase does not make sense to me. Does anyone know what this means? How to interpret this. To me it just seems nonsens, the 0|1 627_C_T information can be discarded?