How To Get Phasing Status From Vcf Files
1
5
Entering edit mode
10.3 years ago
Rubal7 ▴ 820

Hello All,

We are designing a pipeline that will take phased data as input. We will ultimately be using a phased dataset provided by another group. Until this arrives we would like to practice with some phased data. We have data in VCF files that we would like to have with phase information. So output should also be VCF format. Can anyone recommend a fast way to get phase information from, and ultimately in, VCF format. Here the emphasis is on speed, we want phased data as fast as possible as dummy data and are not concerned with error rate (this once). Thank you in advance for your comments.

Best,

Rubal

vcf haplotype genome • 12k views
3
Entering edit mode

When your VCF is generated by GATK, phasing is encoded in the 1|0, 0|1 format.

0
Entering edit mode

In what format will they supply the phased data? Are you sure is VCF? I was under the impression that VCF does not maintain phased data (alleles are swappable, no assurance of maintaining order)

1
Entering edit mode

no, vcf maintains the phase. If the two genotypes are separated by a pipe (e.g. 0|1) it means that they are phased; if they are separated by a slash (e.g. 0/1), they are unphased. http://www.1000genomes.org/node/101

0
Entering edit mode

I changed the title of your question because I understood that you are asking about how to get phasing data from vcf files. Please correct it if I am wrong.

0
Entering edit mode

I actually meant how do I phase unphased data that is in VCF format. Sorry I was away from this post for a while. But still interested in an answer

0
Entering edit mode

I found this description to be the most helpful for understanding how phasing information is represented in a VCF file: http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing

It has nice intuitive examples of what the file actually looks like for phased and unphased variants.

8
Entering edit mode
10.3 years ago

In vcf files, if the two genotypes are separated by a pipe (e.g. 0|1) it means that they are phased; if they are separated by a slash (e.g. 0/1), they are unphased. http://www.1000genomes.org/node/101

For example:

#CHROM POS ID  REF ALT QUAL FILTER INFO FORMAT      NA00001        NA00002
20     14  rs1 G   A   9    PASS   ...  GT:GQ:DP:HQ 0|0:48:1:51,51 1/0:48:8:51,51
20     17  rs2 T   A   3    q10    ...  GT:GQ:DP:HQ 0|0:49:3:58,50 0/1:3:5:65,3
20     20  rs3 A   G   67   PASS   ...  GT:GQ:DP:HQ 1|0:21:6:23,27 0/1:2:0:18,2


The first individual (column NA00001) has phased data, because the genotypes are separed by a "|"; the second (NA00002) is unphased.

You can also use the --phased option in vcftools to extract only the individuals that have phased data (see http://vcftools.sourceforge.net/options.html )

0
Entering edit mode

I have the mapping wgs data on hand and Look forward to the method which turn unphased VCF files to phased. Could you please provide a method to get unphased VCF phased?

0
Entering edit mode

0
Entering edit mode

If the question is "How to phase a vcf file", there are lots of tools to do this - SHAPEIT, Eagle, Beagle, etc.