How to convert a VCF with genotypes and phasing info to list of haplotypes for ROI/SOI?
VCF files created with GATK HaplotypeCaller/GenotypeGVCFs include genotypes and the phase between close by heterozygous genotypes.
In theory this should make it possible to output haplotypes for a Region Of Interest(ROI) and Sample Of Interest (SOI).
For example if there are 3 close by heterozygous genotpyes in 100bp region of interest, in theory there are 8 (=2 X 2 X 2) haplotypes. By looking at the phasing it might become clear there are only 2 haplotypes, i.e. the 3 heterozygous genotypes are in phase for all the samples of interest.
A few years back I tried to use VCFLib vcfgeno2haplo for this. But it did now work as I expected in my hands. https://github.com/vcflib/vcflib/blob/master/doc/vcfgeno2haplo.md
Does anyone know what currently are good tools to convert a vcf with genotype and phase info to haplotypes? And did you maybe also find out how trustworthy the GATK haplotype information is?
As command line example would be the following
genoAndPhaseToHaploTool -input my.vcf.gz -region Chr_01:100-200 -samples samples.txt
I am not sure how to best format the output, but I could imagine something like this
H1 ATCGATCG H2 ATCCATCG H3 ATCAATCG H4 ATCTATCG H5 ACCTATCG H6 CCCTATCG Sample1 = H1, H2 Sample2 = H2, H2 Sample3 = H4, H1 H1 = Sample 1, Sample 3, frequency = 0.33 H2 = Sample2, Sample 1 frequency = 0.5 H3 = None, frequency = 0 H4 = Sample3, frequency = 0.166