Question: How to convert a VCF with genotypes and phasing info to list of haplotypes for ROI/SOI
0
gravatar for William
12 days ago by
William4.7k
Europe
William4.7k wrote:

How to convert a VCF with genotypes and phasing info to list of haplotypes for ROI/SOI?

VCF files created with GATK HaplotypeCaller/GenotypeGVCFs include genotypes and the phase between close by heterozygous genotypes.

In theory this should make it possible to output haplotypes for a Region Of Interest(ROI) and Sample Of Interest (SOI).

For example if there are 3 close by heterozygous genotpyes in 100bp region of interest, in theory there are 8 (=2 X 2 X 2) haplotypes. By looking at the phasing it might become clear there are only 2 haplotypes, i.e. the 3 heterozygous genotypes are in phase for all the samples of interest.

A few years back I tried to use VCFLib vcfgeno2haplo for this. But it did now work as I expected in my hands. https://github.com/vcflib/vcflib/blob/master/doc/vcfgeno2haplo.md

Does anyone know what currently are good tools to convert a vcf with genotype and phase info to haplotypes? And did you maybe also find out how trustworthy the GATK haplotype information is?

As command line example would be the following

genoAndPhaseToHaploTool -input  my.vcf.gz -region Chr_01:100-200  -samples samples.txt

I am not sure how to best format the output, but I could imagine something like this

H1 ATCGATCG
H2 ATCCATCG
H3 ATCAATCG
H4 ATCTATCG
H5 ACCTATCG
H6 CCCTATCG

Sample1 = H1, H2
Sample2 = H2, H2
Sample3 = H4, H1

H1 = Sample 1, Sample 3, frequency = 0.33
H2 = Sample2, Sample 1  frequency = 0.5
H3 = None, frequency = 0
H4 = Sample3, frequency = 0.166
ADD COMMENTlink modified 12 days ago by 4galaxy77100 • written 12 days ago by William4.7k
1
gravatar for 4galaxy77
12 days ago by
4galaxy77100
United Kingdom
4galaxy77100 wrote:

Could try using plink to convert to Oxford haps format https://www.cog-genomics.org/plink/2.0/formats#haps - it more or less looks like what you need.

unsure exactly how accurate the haplotype calling from GATK is from short reads. If you need accurate haplotypes across the whole genome, it might be worth looking at statistical phasing using e.g. shapeit, although it depends on what kind of samples you have.

ADD COMMENTlink modified 12 days ago • written 12 days ago by 4galaxy77100

I have Illumina 150bp sequencing data for multiple samples. I am looking to leverage the phase information from the sequencing data to determine the haplotypes for small regions of interest. Regions can be as small as 100bp, or 150, so even within the Illumina read length.

ADD REPLYlink written 12 days ago by William4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1690 users visited in the last hour
_