Question: How To Deal With Missing Genotypes In Population Pca Analysis
gravatar for Alex Stoddard
7.1 years ago by
Alex Stoddard190
Wisconsin, USA
Alex Stoddard190 wrote:

When a principle component analysis is done on genome-wide SNP data how should missing genotypes be handled?

Naively I can think of two approaches: i) Drop the markers with any missing data - but this loses too much data with a big cohort of samples and relatively random genotyping failure. ii) Set the missing markers to the average of the sample present (assuming each marker is coded as 0,1,2)

Is approach (ii) reasonable? What would be better approaches?

population pca genomics • 3.6k views
ADD COMMENTlink modified 7.1 years ago by zx87548.9k • written 7.1 years ago by Alex Stoddard190
gravatar for Eugen Buehler
7.1 years ago by
Eugen Buehler70 wrote:

The process of substituting a reasonable guess for missing data is called imputation and is fairly common practice for large data sets. Packages for performing imputation (using a k-nearest neighbors approach, for example) are available in R. I haven't used any of them recently so I can't comment on which one you should pick.

ADD COMMENTlink written 7.1 years ago by Eugen Buehler70
gravatar for brentp
7.1 years ago by
Salt Lake City, UT
brentp23k wrote:

How many markers do you lose if you drop those with any missing data?

You can set the missing markers to some value. But you may run into problems if there is bias in the missing data. as @Eugen says, inferring a value from KNN would be better than an average.

There's a very simple-to-use R package that will do the imputation for you using KNN:

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by brentp23k

Is KNN considered appropriate for genotype data and its typical structure? There is much research effort in doing genotype imputation. I am looking for the simplest thing that could possibly work to get my data into a PCA for a first pass. It sounds like the danger with using the average is that it will be biased when data isn't missing a random. Provided I'm using a lot of markers (1000s +) and each marker has only a small percent missingness do I risk much bias?

ADD REPLYlink written 7.1 years ago by Alex Stoddard190
gravatar for zx8754
7.1 years ago by
zx87548.9k wrote:


Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples

ADD COMMENTlink written 7.1 years ago by zx87548.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1461 users visited in the last hour