11 months ago by
Republic of Ireland
This is certainly an interesting area of research and I have actually already produced my own ethnicity predictive model using 1000 Genomes data as the training data. I originally developed it in order to predict the ethnicity of patient samples of unreported ethnicity [one could infer the ethnicity based on the samples' countries of origin, though]. Testing it on these samples of unknown ethnicity gave excellent predictive ability.
I guess that the process could be divided into 4 steps:
Identifying the panel of predictive markers on the training data
In my case, I generally followed [but not exactly followed] the previous tutorial that I posted here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)
That then gave me a panel of a few thousand SNPs. Even at this point, one can easily overlay study samples with the 1000 Genomes samples in order to infer ethnicity, as per this figure:
Build the model
To then build the model, I actually used the PCA rotated component loadings from X eigenvectors / dimensions from the plot above - these were fed into a multinomial logistic regression model with ethnic group as outcome, obviously. I then used stepwise logistic regression in order to reduce the number of predictors to a minimum set that were most informative based on the Akaike information criterion (AIC).
Gauging the model's predictive potential
To test the model [on the 1000 Genomes training data], one can do X-fold cross validation using
cv.glm() in R Programming Language and also generate ROC curves to gauge AUC, sensitivity, and specificity. My model had AUC 98.7%.
Testing the model on 'real World' data
By doing something simple as using the
predict() function in R, one can easily apply the model to samples of unknown ethnicity and the model will return a value from 0-1 (depending on how you execute
predict()). In my example, I then output these results by plotting the samples into ethnicity 'strata'. As far as I can recall, the model did not incorrectly predict a ethnicity that was not expected.