[The next 2 MDS-based steps were aimed at
1. removing population outliers (to keep only European-ancestry
2. obtaining population covariates only for those European-ancestry subjects, to use them later. In short, the question is: should population outliers be removed based on information from step 2? More details below... .]
I have merged some Plink datasets with the 1000 genomes reference, in order to detect population stratification. After removing non-Europeans, the MDS-plot looked like this:
Then, using only the individuals from my sample (labeled "HNPs" in the plot), I obtained a relevant subset from the whole SNP dataset (~300K SNPs), and got an MDS-plot, which looked like this:
The plan is to use the MDS covariates in a later stage. However, in the last plot, there are many outliers within the set of European-ancestry subjects. Should those outlier observations be excluded?