VCF format for phased data
2
2
Entering edit mode
7.8 years ago
nschaefer ▴ 20

Despite the detailed explanation of VCF format on the 1000Genomes site, it is still not clear to me how the data should be interpreted with respect to sample results.

CHROM POS     ID        REF    ALT     QUAL FILTER INFO                                    FORMAT           NA00002
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2      GT:GQ:DP:HQ 1|0:48:8:51,51
20     1230237 .              T       C       47   PASS   NS=3;DP=13;AA=T                     GT:GQ:DP:HQ  0|1:3:5:65,3


For individual NA00002 the vertical upright bar in the second position indicates that the data is phased. But is there any significance as to which side of the bar the digits occur?

Eg for position 14370 does the first digit "1" in "1|0" (>A) relate to a particular parent---mother or father? And the second digit on the right of the bar "0" (>G) indicate the base from the other parent. Similarly at position 1230237 first digit "O" (>T) and second digit to the right of the bar "1" (>C) .

If so then the left chromosome will read AT and the right chromosome GC. Correct? or is it impossible to tell from the order of the alleles with respect to the vertical bar?

1000Genomes phased VCF • 3.0k views
2
Entering edit mode
7.8 years ago
brentp 24k

The "|" just indicates that the genotype call is part of a block. In general, it does not mean it is from the mother or father (though it's possible to know that for trio's). It just means that the relative origin of the variants in the same block can be inferred.

So, in VCF, a block starts with "/" and continues as long as the following lines are "|" so:

REF ALT GT1 GT2
A T 0/1 0/1
C G 1|0 0|1
G A 1|0 1|0
G T 1|0 0|1


Would be a haplotype of AGAT/TCGG for sample 1 and ACAG/TGGT for sample 2. But, we don't know which parent those haplotypes came from.

0
Entering edit mode

From what you're saying the position of the value wrt | is significant. So in your example the last three positions on the left of | implies they are ALT and on the same chromosome whilst to the right of | the values are REF and on the other chromosome. However as the first position is not phased is it possible to associate either allele with those below ie doesn't the block (chromosome segment) actually start with line 2?

From 100Genomes:

The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):

• / : genotype unphased
• | : genotype phased

I did try to sort this out myself looking at the gene NPC1 on chr18 but in all cases of supposed family trios the child had been redacted, so not possible to check phased formatting.

Thanks brentp.

0
Entering edit mode

"/" indicates that it is not phased with anything before it.

"|" indicates that it is phased with (at least) the line before it.

So a block starts with "/' and ends 1 line before the next "/".

So if all you have are unphased genotypes "/" each line is the start and end of its own block.

So, to answer your first question, Yes, you can tell that all 4 variants, even the first are phased together.

0
Entering edit mode

OK, so the phasing is with the line(s) before rather than after the |. I wish that had been made explicit in the 1000Genomes page.

Thanks again

0
Entering edit mode
7.8 years ago

This is not really my field of expertise, but you might want to read this

0
Entering edit mode

Thanks Chris for the link. Whilst it may not have answered the specific question, it was very interesting for my broader goals viz; imputing/interpolating values for all alleles in my 23&Me phased results (1M positions) using larger databases such as 1000Genomes and beyond.