Hi Joe,

This is certainly an interesting area of research and I have actually already produced my own ethnicity predictive model using 1000 Genomes data as the training data. I originally developed it in order to predict the ethnicity of patient samples of unreported ethnicity [one could infer the ethnicity based on the samples' countries of origin, though]. Testing it on these samples of unknown ethnicity gave excellent predictive ability.

I guess that the process could be divided into 4 steps:

# Identifying the panel of predictive markers on the training data

In my case, I generally followed [but not exactly followed] the previous tutorial that I posted here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

That then gave me a panel of a few thousand SNPs. Even at this point, one can easily overlay study samples with the 1000 Genomes samples in order to infer ethnicity, as per this figure:

# Build the model

To then build the model, I actually used the PCA rotated component loadings from X eigenvectors / dimensions from the plot above - these were fed into a multinomial logistic regression model with ethnic group as outcome, obviously. I then used stepwise logistic regression in order to reduce the number of predictors to a minimum set that were most informative based on the Akaike information criterion (AIC).

# Gauging the model's predictive potential

To test the model [on the 1000 Genomes training data], one can do X-fold cross validation using `cv.glm()`

in R Programming Language and also generate ROC curves to gauge AUC, sensitivity, and specificity. My model had AUC 98.7%.

# Testing the model on 'real World' data

By doing something simple as using the `predict()`

function in R, one can easily apply the model to samples of unknown ethnicity and the model will return a value from 0-1 (depending on how you execute `predict()`

). In my example, I then output these results by plotting the samples into ethnicity 'strata'. As far as I can recall, the model did not incorrectly predict an ethnicity that was not expected.

Kevin

newbie here. how is this better than a linear model. snp look to me as not the right data set for ML.

0To whom are you addressing? Why have you posted this as an answer?

If you are addressing me, then I must inform you that I am not using genotypes in the linear model. I am using the eigenvectors from a PCA performed on the genotypes.

53k