I am having a lot of difficulty with a set of case/control exome data (using Plink2 as my main analysis tool).
I have a lot of heterozygous haploid genotypes and nonmale nonmissing Y chromosome markers. A large proportion of the samples appear not to be Caucasian from pca analysis, even though the curators assure us they are all Caucasian. Also a very large number of samples seem to be very closely related to each other (pi-hat estimates way above 0.125).
On top of this a few sex-fails have been detected (removed from analysis before population stratification and relatedness checks).
Is it likely that all these issues are caused by missing SNPs? Over 2/3 of the available SNPs had to be removed from analysis as they had missingness values above 15%.
My personal feeling is that I can't really trust this set of data based on all these things that are going wrong! There has to be a fundamental reason why every step in this analysis is causing so much spurious results!
Any help is immensely appreciated!