Question: Getting Haplotypes from a VCF file
gravatar for gkuffel22
12 months ago by
United States
gkuffel2270 wrote:

Hi everyone,

I am analying a data set for a user interested in looking at variants in the MHC gene of coyotes. I have used Freebayes to generate a VCF file. The user would now like haploytypes, does anyone know how to decipher that info from a file such as this vcf:


Hi again everyone,

Ram has asked that I provide more clarity on the purpose of of my exercise. So here goes:

The user is looking at SNPs that are linked on a single functional gene in the coyote genome and would like to decipher haploytpes of these SNPs because different haploytypes are indicative of different levels of resistance to things such as Mange.

If there are 4 SNPs across the amplicon my user needs to know how these SNPs are inherited in an allele form, meaning which SNPs were on the same strand of DNA.

freebayes haplotypes vcf • 1.1k views
ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 12 months ago by gkuffel2270

Ummm, why did you give us a spreadsheet of a VCF file?

ADD REPLYlink written 12 months ago by RamRS20k

Ummm, because any VCF file can be viewed as a csv or xls file. I provided this format to allow for easy viewing. If it would help anyone to have the actual VCF file I can easily provide that upon request. I wanted to know if I can get haplotypes from the data presented in the spreadsheet view of the VCF file.

I believe I found my own answer. I need to re-run FreeBayes and enable the parameter -E --max-complex-gap. Otherwise Freebayes only calculates haplotypes within a 3 bp distance. I am trying to calculate haploypes over a 250 bp amplicon.

ADD REPLYlink written 12 months ago by gkuffel2270

My question was, people here know what a VCF file looks like. You could have provided a code block or a gist if you wished to exhibit an excerpt. Why use a spreadsheet?

Also, "any VCF can be viewed as a spreadsheet": Nope, don't do that. Excel is not meant to handle that data volume or that kind of data.

ADD REPLYlink modified 12 months ago • written 12 months ago by RamRS20k

So I should paste like this for next time?

#CHROM  POS     ID  REF     ALT     QUAL    FILTER  INFO    FORMAT  10  110_S24_L001    111_S25_L001    115_S26_L001    118_S27_L001    11_S5_L001  120_S28_L001    121_S29_L001    123_S30_L001    124_S31_L001    12$
gi|1209890|gb|U47338.1|CFMHCDRB02   7   .   C   G   0   .   AB=0.223249;ABP=867.188;AC=63;AF=0.203226;AN=310;AO=386;CIGAR=1X;DP=15213;DPB=15213;DPRA=3.99182;EPP=815.344;EPPR=26084.5;GTI=16;LEN=1;MEA$
gi|1209890|gb|U47338.1|CFMHCDRB02   28  .   TTGGA   GTGAG,GTGAA     9.25207e+06     .   AB=0.415709,0.037858;ABP=75093.2,2.2572e+06;AC=51,4;AF=0.132812,0.0104167;AN=384;AO=530334,47952;CIGAR=1X2M2X,1X2M1X1M;DP=$
ADD REPLYlink modified 12 months ago by RamRS20k • written 12 months ago by gkuffel2270

Do you have any information on my actual question about haplotypes?

ADD REPLYlink written 12 months ago by gkuffel2270

Unfortunately, no. Maybe if you could explain the purpose behind this exercise, we could get some clarity.

ADD REPLYlink written 12 months ago by RamRS20k

Presumably the customer wants a FASTA file with two sequences per individual (i.e., the haplotypes)? Am I right? And then later a GENEPOP file input or something like that?

There are two options:

  1. You could phase the VCF file with a tool such as BEAGLE ( then use bcftools consensus -H 1 (for the haplotype 1's allele) and bcftools consensus -H 2 (for haplotype 2's allele) and output as FASTA for each allele. You could also use ReadBackedPhasing from GATK ( note that it does not support indels such as the ones you have in your example VCF output.

  2. One could also directly output the FASTA sequence with ambiguity codes with bcftools consensus -I option then use a program such as PHASE implemented in DnaSP ( to determine the alleles for each haplotype. If you have reference alleles, you can also give those to PHASE to possibly improve the phasing.

Note that I've had issues with strange haplotypes sequences when I tried option 1, so I'd recommend option 2.

If you use PGDSpider (, it can take the input as FASTA files from either method 1 or 2 from above and it will tell you the allele numbers for the individuals in GENEPOP format if you convert FASTA to GENEPOP. Granted it won't know that a sequence such as


belong to the same individual, so you will have to manually convert the GENEPOP output from

Individual1-1 001
Individual1-2 001


Individual1 001001

You could also write a regular expression to convert the default GENEPOP output of PGDSpider to one line per individual.

ADD REPLYlink modified 10 months ago • written 10 months ago by jean.elbers510

Almost. I've corrected it so it's better. You can use the formatting bar (especially the code option) to present your post better. Formatting bar

ADD REPLYlink written 12 months ago by RamRS20k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 897 users visited in the last hour