Question: Getting Haplotypes from a VCF file
1
gravatar for gkuffel22
9 months ago by
gkuffel2250
United States
gkuffel2250 wrote:

Hi everyone,

I am analying a data set for a user interested in looking at variants in the MHC gene of coyotes. I have used Freebayes to generate a VCF file. The user would now like haploytypes, does anyone know how to decipher that info from a file such as this vcf:

https://docs.google.com/spreadsheets/d/1rH-6Q8559C70QeINoOOwCj8SCRA2U-AYr9NSb-Qefs4/edit?usp=sharing

UPDATE:

Hi again everyone,

Ram has asked that I provide more clarity on the purpose of of my exercise. So here goes:

The user is looking at SNPs that are linked on a single functional gene in the coyote genome and would like to decipher haploytpes of these SNPs because different haploytypes are indicative of different levels of resistance to things such as Mange.

If there are 4 SNPs across the amplicon my user needs to know how these SNPs are inherited in an allele form, meaning which SNPs were on the same strand of DNA.

freebayes haplotypes vcf • 813 views
ADD COMMENTlink modified 7 months ago by Biostar ♦♦ 20 • written 9 months ago by gkuffel2250

Ummm, why did you give us a spreadsheet of a VCF file?

ADD REPLYlink written 9 months ago by RamRS19k

Ummm, because any VCF file can be viewed as a csv or xls file. I provided this format to allow for easy viewing. If it would help anyone to have the actual VCF file I can easily provide that upon request. I wanted to know if I can get haplotypes from the data presented in the spreadsheet view of the VCF file.

I believe I found my own answer. I need to re-run FreeBayes and enable the parameter -E --max-complex-gap. Otherwise Freebayes only calculates haplotypes within a 3 bp distance. I am trying to calculate haploypes over a 250 bp amplicon.

ADD REPLYlink written 9 months ago by gkuffel2250

My question was, people here know what a VCF file looks like. You could have provided a code block or a gist if you wished to exhibit an excerpt. Why use a spreadsheet?

Also, "any VCF can be viewed as a spreadsheet": Nope, don't do that. Excel is not meant to handle that data volume or that kind of data.

ADD REPLYlink modified 9 months ago • written 9 months ago by RamRS19k

So I should paste like this for next time?

#CHROM  POS     ID  REF     ALT     QUAL    FILTER  INFO    FORMAT  10  110_S24_L001    111_S25_L001    115_S26_L001    118_S27_L001    11_S5_L001  120_S28_L001    121_S29_L001    123_S30_L001    124_S31_L001    12$
gi|1209890|gb|U47338.1|CFMHCDRB02   7   .   C   G   0   .   AB=0.223249;ABP=867.188;AC=63;AF=0.203226;AN=310;AO=386;CIGAR=1X;DP=15213;DPB=15213;DPRA=3.99182;EPP=815.344;EPPR=26084.5;GTI=16;LEN=1;MEA$
gi|1209890|gb|U47338.1|CFMHCDRB02   28  .   TTGGA   GTGAG,GTGAA     9.25207e+06     .   AB=0.415709,0.037858;ABP=75093.2,2.2572e+06;AC=51,4;AF=0.132812,0.0104167;AN=384;AO=530334,47952;CIGAR=1X2M2X,1X2M1X1M;DP=$
ADD REPLYlink modified 9 months ago by RamRS19k • written 9 months ago by gkuffel2250
1

Do you have any information on my actual question about haplotypes?

ADD REPLYlink written 9 months ago by gkuffel2250

Unfortunately, no. Maybe if you could explain the purpose behind this exercise, we could get some clarity.

ADD REPLYlink written 9 months ago by RamRS19k

Presumably the customer wants a FASTA file with two sequences per individual (i.e., the haplotypes)? Am I right? And then later a GENEPOP file input or something like that?

There are two options:

  1. You could phase the VCF file with a tool such as BEAGLE (https://faculty.washington.edu/browning/beagle/beagle.html) then use bcftools consensus -H 1 (for the haplotype 1's allele) and bcftools consensus -H 2 (for haplotype 2's allele) and output as FASTA for each allele. You could also use ReadBackedPhasing from GATK (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php) note that it does not support indels such as the ones you have in your example VCF output.

  2. One could also directly output the FASTA sequence with ambiguity codes with bcftools consensus -I option then use a program such as PHASE implemented in DnaSP (http://www.ub.edu/dnasp/) to determine the alleles for each haplotype. If you have reference alleles, you can also give those to PHASE to possibly improve the phasing.

Note that I've had issues with strange haplotypes sequences when I tried option 1, so I'd recommend option 2.

If you use PGDSpider (http://www.cmpg.unibe.ch/software/PGDSpider/), it can take the input as FASTA files from either method 1 or 2 from above and it will tell you the allele numbers for the individuals in GENEPOP format if you convert FASTA to GENEPOP. Granted it won't know that a sequence such as

>Individual1-1
ACTG
>Individual1-2
ACTG

belong to the same individual, so you will have to manually convert the GENEPOP output from

Individual1-1 001
Individual1-2 001

to

Individual1 001001

You could also write a regular expression to convert the default GENEPOP output of PGDSpider to one line per individual.

ADD REPLYlink modified 7 months ago • written 7 months ago by jean.elbers450

Almost. I've corrected it so it's better. You can use the formatting bar (especially the code option) to present your post better. Formatting bar

ADD REPLYlink written 9 months ago by RamRS19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2202 users visited in the last hour