Genotype Likelihood In 1000 Genomes Vcf Data
Entering edit mode
11.0 years ago
jjc ▴ 80

I am trying to understand the per sample genotype data given in the following file from 1000 genomes:


I start by explaining what I see, and then I have two questions at the end.

What I see is a FORMAT column with the entry:


with GT and GL defined as in the VCF definition, with DS being a custom format defined as:

##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">

One of the several things that I am not understanding is how the genotype call is being made in some cases.

For example, there are many entries in which the GL values are -0.48,-0.48,-0.48

My understanding is that then the read data does not allow a call between the reference or the variant allele at such a bialletic.

However, the DS reading seems to be being used. Specifically, looking across the whole file, I run the following awk script:

BEGIN { FS="\t" }
$1=="#CHROM"{ for (i=10;i<=NF;i++) {
{ for (i=10;i<=NF;i++) {
     if ($i ~ /:-0.48,-0.48,-0.48/) {
         if ( done[$i] == 0 ) { 
             done[$i] = 1 
             print $i , NR, subject[i]

This is executed:

gunzip -c ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | 
         awk -f gl.awk

Shows DS values from 0.000 to 2.000 with granularity of 0.050. As far as I can tell this value, and some other data, seems to allow a genotype call to be made.

What I see, is that if the DS value lies between 0.000 and 0.500 then the phased genotype is called as 0|0, e,g.

0|0:0.150:-0.48,-0.48,-0.48     39     HG00140

If the DS value lies between 0.550 and 1.000 then the genotype is called as one of 0|0, 1|0, or 0|1 e.g

1|0:0.850:-0.48,-0.48,-0.48    34     HG01168
0|0:0.850:-0.48,-0.48,-0.48    34     HG00242
0|1:0.850:-0.48,-0.48,-0.48    34     HG00143

If the DS value is between 1.050 and 1.500 then the genotype is called as one of 1|1, 1|0, or 0|1 e.g.

0|1:1.200:-0.48,-0.48,-0.48 31 NA19108
1|1:1.200:-0.48,-0.48,-0.48 32 HG01069
1|0:1.200:-0.48,-0.48,-0.48 34 NA12383

If the DS value is between 1.550 and 2.000 then the genotype is called as 1|1.

My two questions are:

  1. When the DS lies between 0.5 and 1.5 and it does not indicate definitively either 0|0 or 1|1 on what basis is the genotype call made, and by which piece of software?

  2. What is this DS value anyway, and how does it allow the 0|0 and 1|1 calls to be made? "Genotype dosage from MaCH/Thunder" is somewhat cryptic.

1000genomes vcf • 4.4k views
Entering edit mode
11.0 years ago
Dan ▴ 520

Looking here:

Seems like this is related to imputation, i.e. there may not be direct evidence for the call, but a call is imputed. Sorry I can't be more help... Perhaps you can try asking your question on the VCF mailing list?

VCF format discussions, such as clarifications or proposed changes to the spec:

Not sure if there is a 1000 genomes mailing list.

HTH, Dan.

Entering edit mode

This is correct, as the 1000genomes is low coverage all positions do not share equal coverage in all individuals so imputation and haplotype inference is needed to give every individual a genotype are every position

Dosage isn't necessarily 0, 1 or 2 but can be a continum as it shows how likely the genotype is on the basis of the imputation evidence

Entering edit mode

Thanks, this does answer the first of my two questions. The calls appear to be made on the basis of haplotype information


Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6