How to convert a VCF with genotypes and phasing info to list of haplotypes for ROI/SOI
1
0
Entering edit mode
3.3 years ago
William ★ 5.3k

How to convert a VCF with genotypes and phasing info to list of haplotypes for ROI/SOI?

VCF files created with GATK HaplotypeCaller/GenotypeGVCFs include genotypes and the phase between close by heterozygous genotypes.

In theory this should make it possible to output haplotypes for a Region Of Interest(ROI) and Sample Of Interest (SOI).

For example if there are 3 close by heterozygous genotpyes in 100bp region of interest, in theory there are 8 (=2 X 2 X 2) haplotypes. By looking at the phasing it might become clear there are only 2 haplotypes, i.e. the 3 heterozygous genotypes are in phase for all the samples of interest.

A few years back I tried to use VCFLib vcfgeno2haplo for this. But it did now work as I expected in my hands. https://github.com/vcflib/vcflib/blob/master/doc/vcfgeno2haplo.md

Does anyone know what currently are good tools to convert a vcf with genotype and phase info to haplotypes? And did you maybe also find out how trustworthy the GATK haplotype information is?

As command line example would be the following

genoAndPhaseToHaploTool -input  my.vcf.gz -region Chr_01:100-200  -samples samples.txt

I am not sure how to best format the output, but I could imagine something like this

H1 ATCGATCG
H2 ATCCATCG
H3 ATCAATCG
H4 ATCTATCG
H5 ACCTATCG
H6 CCCTATCG

Sample1 = H1, H2
Sample2 = H2, H2
Sample3 = H4, H1

H1 = Sample 1, Sample 3, frequency = 0.33
H2 = Sample2, Sample 1  frequency = 0.5
H3 = None, frequency = 0
H4 = Sample3, frequency = 0.166
vcf genotypes phasing haplotypes • 2.0k views
ADD COMMENT
1
Entering edit mode
3.3 years ago
4galaxy77 2.8k

Could try using plink to convert to Oxford haps format https://www.cog-genomics.org/plink/2.0/formats#haps - it more or less looks like what you need.

unsure exactly how accurate the haplotype calling from GATK is from short reads. If you need accurate haplotypes across the whole genome, it might be worth looking at statistical phasing using e.g. shapeit, although it depends on what kind of samples you have.

ADD COMMENT
0
Entering edit mode

I have Illumina 150bp sequencing data for multiple samples. I am looking to leverage the phase information from the sequencing data to determine the haplotypes for small regions of interest. Regions can be as small as 100bp, or 150, so even within the Illumina read length.

ADD REPLY

Login before adding your answer.

Traffic: 2881 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6