The 20100804 data has missing genotypes due to the way it was created
The set itself was a naive 2 of 4 intersection of 4 input call sets, only 2 of these 4 sets had genotypes associated with them, Broad and UMich, the Broad genotype set was phased and used LD info so was felt to be better so any snp with a Broad genotype got that genotype info, a snp with just a UMich genotype got that info, any snp only called by the NCBI and Boston College didn't get any genotype
We have just released a new data set which has much more complete phased genotypes for a larger number of individuals but we don't have population level allele frequencies yet
Hi, there I kind of have the same question. I have downloaded data for a particular gene - there are huge amounts of missing data!
The missing data is not for individual sites (ie everyone is missing data for a SNP). The missing data is totally haphazard. Eg SNP 1 has data for african americans and Europeans, whilst SNP 2 has data for Yoruba, Asians and lacks african americans.
are the genotypes missing for all samples at a given site, or are you saying that of the 13000 SNPs, 5000 have at least one missing genotype?
Hi, in most cases the genotypes seem to be missing in all samples at a given site.