Question: Why So Many Missing Genotypes In 1000 Genomes Data?
gravatar for Paul
9.4 years ago by
United States
Paul760 wrote:


I downloaded what I think is the latest set of SNP variant calls for the European samples from the 1000 genomes here:

I used tabix to download the data and vcftools to convert to a matrix of 0, 1, 2 and -1 for missing genotype.

I'm interested in SNPs in the exons of a particular gene, the thing is, of 13,000 SNPs that there is data, there are a huge number of missing values, nearly 5,000 for which genotypes is missing.

Does anyone have any idea why there are so many missing genotypes in this data?


genome snp • 3.2k views
ADD COMMENTlink written 9.4 years ago by Paul760

are the genotypes missing for all samples at a given site, or are you saying that of the 13000 SNPs, 5000 have at least one missing genotype?

ADD REPLYlink written 9.4 years ago by Aaronquinlan11k

Hi, in most cases the genotypes seem to be missing in all samples at a given site.

ADD REPLYlink written 9.4 years ago by Paul760
gravatar for Laura
9.4 years ago by
Cambridge UK
Laura1.7k wrote:

The 20100804 data has missing genotypes due to the way it was created

The set itself was a naive 2 of 4 intersection of 4 input call sets, only 2 of these 4 sets had genotypes associated with them, Broad and UMich, the Broad genotype set was phased and used LD info so was felt to be better so any snp with a Broad genotype got that genotype info, a snp with just a UMich genotype got that info, any snp only called by the NCBI and Boston College didn't get any genotype

We have just released a new data set which has much more complete phased genotypes for a larger number of individuals but we don't have population level allele frequencies yet


ADD COMMENTlink written 9.4 years ago by Laura1.7k

Hi Laura, any idea when the mitochondrial genotypes will be released for the interim phase1 release (or a newer, of course)? I'm asking since I found out there's a lot of missing data in the first release, so let's say we only have a "true" access to 3% of the whole genotype dataset for MT. Secondly, how should I interprete the biallelic genotype calls for the MT - being that haploid (or, if you want to see it otherwise, N-ploid)?

Thanks in advance for your time

ADD REPLYlink written 9.1 years ago by Fede10

We recently realised that the person who generated mt genotypes for the pilot used . where they should of used 0 so that makes that data set much more useful. Hetrozygous genotypes represent mitochondrial heteroplasmy. We hope to have new MT genotypes before the end of the year but I can't give a better timeline than that

ADD REPLYlink written 9.1 years ago by Laura1.7k
gravatar for User 7433
9.3 years ago by
User 7433160
User 7433160 wrote:

Hi, there I kind of have the same question. I have downloaded data for a particular gene - there are huge amounts of missing data!

The missing data is not for individual sites (ie everyone is missing data for a SNP). The missing data is totally haphazard. Eg SNP 1 has data for african americans and Europeans, whilst SNP 2 has data for Yoruba, Asians and lacks african americans.

Can anyone explain why this is?

Thanks x

ADD COMMENTlink written 9.3 years ago by User 7433160

Don't post a new question in the answer section. Ask your question separately as a new question. Feel free to link to this question as an example. This post will be deleted; we will leave it here a bit so that you can see this comment.

ADD REPLYlink written 9.3 years ago by Istvan Albert ♦♦ 85k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2036 users visited in the last hour