When a principle component analysis is done on genome-wide SNP data how should missing genotypes be handled?
Naively I can think of two approaches: i) Drop the markers with any missing data - but this loses too much data with a big cohort of samples and relatively random genotyping failure. ii) Set the missing markers to the average of the sample present (assuming each marker is coded as 0,1,2)
Is approach (ii) reasonable? What would be better approaches?