Removing ancestral outliers in GWAS
4.0 years ago
johnja • 0

I am new to GWAS. I am now at the step where I want to remove cases and controls of non-European ancestry. So I recently performed principal components analysis using plink on cases and controls for a practice GWAS analysis. I then merged my data with the data on the 11 populations from HapMap3. I am unclear how to proceed in the next steps, and I feel like the many articles I have viewed assume that the reader already has certain knowledge.

My thoughts on what to do next are to:

Use R to subset the CEU and TSI populations, as they are European.

Find the means and standard deviations of the first two principal component scores
Choose a threshold value to determine outliers (a certain number of standard deviations away from mean PC1 and PC2 scores)
write an R script to produce a file on the cases and controls to eliminate for non-European ancestry.
Use plink to eliminate those non-European samples

My Question: Is this method correct? I have no idea to which the threshold for outliers should be set.

4.0 years ago

Hello johnja,

Yes, it is quite standard to remove samples that are 2 or 3 standard deviations (SDs) from the group mean through PCA. You can either code this manually by converting the values for a given eigenvector (i.e. principal component) to Z-scores, where Z=1 is 1 SD from the mean, Z=2 is 2 SDs, et cetera). For example, if you know that your suspected outlier is from the British Isles (Republic of Ireland and the United Kingdom of Great Britain and Northern Ireland), then check its Z-score in relation to the other GBR (British in England and Scotland) 1000 Genomes EUR samples.

PLINK already has an implementation of this through identity-by-state (IBS) clustering, where it also gauges outliers by Z-scores:



