11 months ago by

Republic of Ireland

Hi Joe,

This is certainly an interesting area of research and I have actually already produced my own ethnicity predictive model using 1000 Genomes data as the training data. I originally developed it in order to predict the ethnicity of patient samples of unreported ethnicity [one could infer the ethnicity based on the samples' countries of origin, though]. Testing it on these samples of unknown ethnicity gave excellent predictive ability.

I guess that the process could be divided into 4 steps:

# Identifying the panel of predictive markers on the training data

In my case, I generally followed [but not exactly followed] the previous tutorial that I posted here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

That then gave me a panel of a few thousand SNPs. Even at this point, one can easily overlay study samples with the 1000 Genomes samples in order to infer ethnicity, as per this figure:

# Build the model

To then build the model, I actually used the PCA rotated component loadings from X eigenvectors / dimensions from the plot above - these were fed into a multinomial logistic regression model with ethnic group as outcome, obviously. I then used stepwise logistic regression in order to reduce the number of predictors to a minimum set that were most informative based on the Akaike information criterion (AIC).

# Gauging the model's predictive potential

To test the model [on the 1000 Genomes training data], one can do X-fold cross validation using `cv.glm()`

in R Programming Language and also generate ROC curves to gauge AUC, sensitivity, and specificity. My model had AUC 98.7%.

# Testing the model on 'real World' data

By doing something simple as using the `predict()`

function in R, one can easily apply the model to samples of unknown ethnicity and the model will return a value from 0-1 (depending on how you execute `predict()`

). In my example, I then output these results by plotting the samples into ethnicity 'strata'. As far as I can recall, the model did not incorrectly predict a ethnicity that was not expected.

Kevin