How To Get Phasing Status From Vcf Files
1
5
Entering edit mode
11.8 years ago
Rubal7 ▴ 830

Hello All,

We are designing a pipeline that will take phased data as input. We will ultimately be using a phased dataset provided by another group. Until this arrives we would like to practice with some phased data. We have data in VCF files that we would like to have with phase information. So output should also be VCF format. Can anyone recommend a fast way to get phase information from, and ultimately in, VCF format. Here the emphasis is on speed, we want phased data as fast as possible as dummy data and are not concerned with error rate (this once). Thank you in advance for your comments.

Best,

Rubal

vcf haplotype genome • 15k views
ADD COMMENT
3
Entering edit mode

When your VCF is generated by GATK, phasing is encoded in the 1|0, 0|1 format.

See: http://www.broadinstitute.org/gsa/wiki/index.php/Read-backed_phasing_algorithm

ADD REPLY
0
Entering edit mode

In what format will they supply the phased data? Are you sure is VCF? I was under the impression that VCF does not maintain phased data (alleles are swappable, no assurance of maintaining order)

ADD REPLY
1
Entering edit mode

no, vcf maintains the phase. If the two genotypes are separated by a pipe (e.g. 0|1) it means that they are phased; if they are separated by a slash (e.g. 0/1), they are unphased. http://www.1000genomes.org/node/101

ADD REPLY
0
Entering edit mode

I changed the title of your question because I understood that you are asking about how to get phasing data from vcf files. Please correct it if I am wrong.

ADD REPLY
0
Entering edit mode

I actually meant how do I phase unphased data that is in VCF format. Sorry I was away from this post for a while. But still interested in an answer

ADD REPLY
0
Entering edit mode

I found this description to be the most helpful for understanding how phasing information is represented in a VCF file: http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing

It has nice intuitive examples of what the file actually looks like for phased and unphased variants.

ADD REPLY
8
Entering edit mode
11.8 years ago

In vcf files, if the two genotypes are separated by a pipe (e.g. 0|1) it means that they are phased; if they are separated by a slash (e.g. 0/1), they are unphased. http://www.1000genomes.org/node/101

For example:

#CHROM POS ID  REF ALT QUAL FILTER INFO FORMAT      NA00001        NA00002
20     14  rs1 G   A   9    PASS   ...  GT:GQ:DP:HQ 0|0:48:1:51,51 1/0:48:8:51,51
20     17  rs2 T   A   3    q10    ...  GT:GQ:DP:HQ 0|0:49:3:58,50 0/1:3:5:65,3
20     20  rs3 A   G   67   PASS   ...  GT:GQ:DP:HQ 1|0:21:6:23,27 0/1:2:0:18,2

The first individual (column NA00001) has phased data, because the genotypes are separed by a "|"; the second (NA00002) is unphased.

You can also use the --phased option in vcftools to extract only the individuals that have phased data (see http://vcftools.sourceforge.net/options.html )

ADD COMMENT
0
Entering edit mode

I have the mapping wgs data on hand and Look forward to the method which turn unphased VCF files to phased. Could you please provide a method to get unphased VCF phased?

ADD REPLY
0
Entering edit mode

Was this ever answered?

ADD REPLY
0
Entering edit mode

If the question is "How to phase a vcf file", there are lots of tools to do this - SHAPEIT, Eagle, Beagle, etc.

ADD REPLY

Login before adding your answer.

Traffic: 3017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6