Question

Why So Many Missing Genotypes In 1000 Genomes Data?

3

Entering edit mode

13.1 years ago

Paul ▴ 760

Hi,

I downloaded what I think is the latest set of SNP variant calls for the European samples from the 1000 genomes here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz

I used tabix to download the data and vcftools to convert to a matrix of 0, 1, 2 and -1 for missing genotype.

I'm interested in SNPs in the exons of a particular gene, the thing is, of 13,000 SNPs that there is data, there are a huge number of missing values, nearly 5,000 for which genotypes is missing.

Does anyone have any idea why there are so many missing genotypes in this data?

Thanks!

genome snp • 4.2k views

ADD COMMENT • link updated 13.1 years ago by User 7433 ▴ 170 • written 13.1 years ago by Paul ▴ 760

1

Entering edit mode

are the genotypes missing for all samples at a given site, or are you saying that of the 13000 SNPs, 5000 have at least one missing genotype?

ADD REPLY • link 13.1 years ago by Aaronquinlan 12k

0

Entering edit mode

Hi, in most cases the genotypes seem to be missing in all samples at a given site.

ADD REPLY • link 13.1 years ago by Paul ▴ 760

score 6 · Answer 1 · 2011-06-24

6

Entering edit mode

13.1 years ago

Laura ★ 1.8k

The 20100804 data has missing genotypes due to the way it was created

The set itself was a naive 2 of 4 intersection of 4 input call sets, only 2 of these 4 sets had genotypes associated with them, Broad and UMich, the Broad genotype set was phased and used LD info so was felt to be better so any snp with a Broad genotype got that genotype info, a snp with just a UMich genotype got that info, any snp only called by the NCBI and Boston College didn't get any genotype

We have just released a new data set which has much more complete phased genotypes for a larger number of individuals but we don't have population level allele frequencies yet

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/

thanks

ADD COMMENT • link 13.1 years ago by Laura ★ 1.8k

0

Entering edit mode

Hi Laura, any idea when the mitochondrial genotypes will be released for the interim phase1 release (or a newer, of course)? I'm asking since I found out there's a lot of missing data in the first release, so let's say we only have a "true" access to 3% of the whole genotype dataset for MT. Secondly, how should I interprete the biallelic genotype calls for the MT - being that haploid (or, if you want to see it otherwise, N-ploid)?

Thanks in advance for your time

ADD REPLY • link 12.8 years ago by Fede ▴ 10

0

Entering edit mode

We recently realised that the person who generated mt genotypes for the pilot used . where they should of used 0 so that makes that data set much more useful. Hetrozygous genotypes represent mitochondrial heteroplasmy. We hope to have new MT genotypes before the end of the year but I can't give a better timeline than that

ADD REPLY • link 12.8 years ago by Laura ★ 1.8k

score 0 · Answer 2 · 2011-08-22

0

Entering edit mode

12.9 years ago

User 7433 ▴ 170

Hi, there I kind of have the same question. I have downloaded data for a particular gene - there are huge amounts of missing data!

The missing data is not for individual sites (ie everyone is missing data for a SNP). The missing data is totally haphazard. Eg SNP 1 has data for african americans and Europeans, whilst SNP 2 has data for Yoruba, Asians and lacks african americans.

Can anyone explain why this is?

Thanks x

ADD COMMENT • link 12.9 years ago by User 7433 ▴ 170

0

Entering edit mode

Don't post a new question in the answer section. Ask your question separately as a new question. Feel free to link to this question as an example. This post will be deleted; we will leave it here a bit so that you can see this comment.

ADD REPLY • link 12.9 years ago by Istvan Albert 101k