Question: VCF format for phased data
2
gravatar for nschaefer
4.3 years ago by
nschaefer20
nschaefer20 wrote:

Despite the detailed explanation of VCF format on the 1000Genomes site, it is still not clear to me how the data should be interpreted with respect to sample results.

CHROM POS     ID        REF    ALT     QUAL FILTER INFO                                    FORMAT           NA00002

20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2      GT:GQ:DP:HQ 1|0:48:8:51,51

20     1230237 .              T       C       47   PASS   NS=3;DP=13;AA=T                     GT:GQ:DP:HQ  0|1:3:5:65,3

For individual NA00002 the vertical upright bar in the second position indicates that the data is phased. But is there any significance as to which side of the bar the digits occur??

Eg for position 14370 does the first digit "1" in "1|0"  (>A) relate to a particular parent---mother or father? And the second digit on the right of the bar "0" (>G) indicate the base from the other parent. Similarly at position 1230237 first digit "O" (>T) and second digit to the right of the bar "1" (>C) .

If so then the left chromosome will read AT and the right chromosome GC. Correct? or is it impossible to tell from the order of the alleles with respect to the vertical bar?

thank you in advance

phased 1000 genomes vcf • 1.7k views
ADD COMMENTlink modified 19 months ago by Biostar ♦♦ 20 • written 4.3 years ago by nschaefer20
2
gravatar for brentp
4.3 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

The "|" just indicates that the genotype call is part of a block. In general, it does not mean it is from the mother or father (though it's possible to know that for trio's). It just means that the relative origin of the variants in the same block can be inferred.

 

So, in VCF, a block starts with "/" and continues as long as the following lines are "|" so:

 

REF ALT GT1 GT2

A T 0/1 0/1

C G 1|0 0|1

G A 1|0 1|0

G T 1|0 0|1

  

Would be a haplotype of AGAT/TCGG for sample 1 and ACAG/TGGT for sample 2. But, we don't know which parent those haplotypes came from.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by brentp22k

From what you're saying the position of the value wrt | is significant. So in your example the last three positions on the left of | implies they are ALT and on the same chromosome whilst to the right of | the values are REF and on the other chromosome. However as the first position is not phased is it possible to associate either allele with those below ie doesn't the block (chromosome segment) actually start with line 2?

From 100Genomes: "The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):

  • / : genotype unphased
  • | : genotype phased"

I did try to sort this out myself looking at the gene NPC1 on chr18 but in all cases  of supposed family trios the child had been redacted, so not possible to check phased formatting.

Thanks brentp.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by nschaefer20

"/" indicates that it is not phased with anything before it.

"|" indicates that it is phased with (at least) the line before it.

So a block starts with "/' and ends 1 line before the next "/".

So if all you have are unphased genotypes "/" each line is the start and end of its own block.

So, to answer your first question, Yes, you can tell that all 4 variants, even the first are phased together.

ADD REPLYlink written 4.3 years ago by brentp22k

OK, so the phasing is with the line(s) before rather than after the |. I wish that had been made explicit in the 1000Genomes page.

 

Thanks again

ADD REPLYlink written 4.3 years ago by nschaefer20
0
gravatar for Chris Evelo
4.3 years ago by
Chris Evelo9.9k
Maastricht, The Netherlands
Chris Evelo9.9k wrote:

This is not really my field of expertise, but you might want to read this: http://www.nature.com/nrg/journal/v12/n10/full/nrg3054.html 

 

 

ADD COMMENTlink written 4.3 years ago by Chris Evelo9.9k

Thanks Chris for the link. Whilst it may not have answered the specific question, it was very interesting for my broader goals viz; imputing/interpolating values for all alleles in my 23&Me phased results (1M positions) using larger databases such as 1000Genomes and beyond. 

ADD REPLYlink written 4.3 years ago by nschaefer20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1429 users visited in the last hour