Question: Genotype Likelihood In 1000 Genomes Vcf Data
gravatar for jjc
7.6 years ago by
Palo Alto
jjc80 wrote:

I am trying to understand the per sample genotype data given in the following file from 1000 genomes:


I start by explaining what I see, and then I have two questions at the end.

What I see is a FORMAT column with the entry:


with GT and GL defined as in the VCF definition, with DS being a custom format defined as:

##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">

One of the several things that I am not understanding is how the genotype call is being made in some cases.

For example, there are many entries in which the GL values are -0.48,-0.48,-0.48

My understanding is that then the read data does not allow a call between the reference or the variant allele at such a bialletic.

However, the DS reading seems to be being used. Specifically, looking across the whole file, I run the following awk script:

BEGIN { FS="\t" }
$1=="#CHROM"{ for (i=10;i<=NF;i++) {
{ for (i=10;i<=NF;i++) {
     if ($i ~ /:-0.48,-0.48,-0.48/) {
         if ( done[$i] == 0 ) { 
             done[$i] = 1 
             print $i , NR, subject[i]

This is executed:

gunzip -c ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz | 
         awk -f gl.awk

Shows DS values from 0.000 to 2.000 with granularity of 0.050. As far as I can tell this value, and some other data, seems to allow a genotype call to be made.

What I see, is that if the DS value lies between 0.000 and 0.500 then the phased genotype is called as 0|0, e,g.

0|0:0.150:-0.48,-0.48,-0.48     39     HG00140

If the DS value lies between 0.550 and 1.000 then the genotype is called as one of 0|0, 1|0, or 0|1 e.g

1|0:0.850:-0.48,-0.48,-0.48    34     HG01168
0|0:0.850:-0.48,-0.48,-0.48    34     HG00242
0|1:0.850:-0.48,-0.48,-0.48    34     HG00143

If the DS value is between 1.050 and 1.500 then the genotype is called as one of 1|1, 1|0, or 0|1 e.g.

0|1:1.200:-0.48,-0.48,-0.48 31 NA19108
1|1:1.200:-0.48,-0.48,-0.48 32 HG01069
1|0:1.200:-0.48,-0.48,-0.48 34 NA12383

If the DS value is between 1.550 and 2.000 then the genotype is called as 1|1.

My two questions are:

  1. When the DS lies between 0.5 and 1.5 and it does not indicate definitively either 0|0 or 1|1 on what basis is the genotype call made, and by which piece of software?

  2. What is this DS value anyway, and how does it allow the 0|0 and 1|1 calls to be made? "Genotype dosage from MaCH/Thunder" is somewhat cryptic.

vcf 1000genomes • 3.6k views
ADD COMMENTlink modified 7.6 years ago by Dan520 • written 7.6 years ago by jjc80
gravatar for Dan
7.6 years ago by
Dan520 wrote:

Looking here:

Seems like this is related to imputation, i.e. there may not be direct evidence for the call, but a call is imputed. Sorry I can't be more help... Perhaps you can try asking your question on the VCF mailing list?

VCF format discussions, such as clarifications or proposed changes to the spec:

Not sure if there is a 1000 genomes mailing list.

HTH, Dan.

ADD COMMENTlink written 7.6 years ago by Dan520

This is correct, as the 1000genomes is low coverage all positions do not share equal coverage in all individuals so imputation and haplotype inference is needed to give every individual a genotype are every position

Dosage isn't necessarily 0, 1 or 2 but can be a continum as it shows how likely the genotype is on the basis of the imputation evidence

ADD REPLYlink written 7.6 years ago by Laura1.7k

Thanks, this does answer the first of my two questions. The calls appear to be made on the basis of haplotype information

ADD REPLYlink written 7.6 years ago by jjc80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1747 users visited in the last hour