Question: How to predict individual ethnicity information by using hapmap data
gravatar for Joe
11 months ago by
Joe10 wrote:

Hi all,

I want to predict the ethnicity information of individuals by using ML method. HapMap genotype data were treated as training set, and unknown individuals' genotype data as testing set. Because this is my first time to do the ethnicity prediction, anyone can give some detail informations (such as methods, typical workflows) or articles making it easy to follow? Many thanks!!!


snp ethnicity hapmap • 1.3k views
ADD COMMENTlink modified 8 days ago by marco0 • written 11 months ago by Joe10
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe37k
Republic of Ireland
Kevin Blighe37k wrote:

Hi Joe,

This is certainly an interesting area of research and I have actually already produced my own ethnicity predictive model using 1000 Genomes data as the training data. I originally developed it in order to predict the ethnicity of patient samples of unreported ethnicity [one could infer the ethnicity based on the samples' countries of origin, though]. Testing it on these samples of unknown ethnicity gave excellent predictive ability.

I guess that the process could be divided into 4 steps:

Identifying the panel of predictive markers on the training data

In my case, I generally followed [but not exactly followed] the previous tutorial that I posted here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

That then gave me a panel of a few thousand SNPs. Even at this point, one can easily overlay study samples with the 1000 Genomes samples in order to infer ethnicity, as per this figure:


Build the model

To then build the model, I actually used the PCA rotated component loadings from X eigenvectors / dimensions from the plot above - these were fed into a multinomial logistic regression model with ethnic group as outcome, obviously. I then used stepwise logistic regression in order to reduce the number of predictors to a minimum set that were most informative based on the Akaike information criterion (AIC).

Gauging the model's predictive potential

To test the model [on the 1000 Genomes training data], one can do X-fold cross validation using cv.glm() in R Programming Language and also generate ROC curves to gauge AUC, sensitivity, and specificity. My model had AUC 98.7%.


Testing the model on 'real World' data

By doing something simple as using the predict() function in R, one can easily apply the model to samples of unknown ethnicity and the model will return a value from 0-1 (depending on how you execute predict()). In my example, I then output these results by plotting the samples into ethnicity 'strata'. As far as I can recall, the model did not incorrectly predict a ethnicity that was not expected.



ADD COMMENTlink modified 3 months ago • written 11 months ago by Kevin Blighe37k

Thanks Kevin! This is exactly what I want!

ADD REPLYlink written 11 months ago by Joe10

Respond back here if you need any clarifications

ADD REPLYlink written 11 months ago by Kevin Blighe37k

Hello Kevin,

How can i add another color to indicate my samples in PCA plots?


ADD REPLYlink written 4 weeks ago by aksoyluinci10

What have you tried so far? Note that I have just had a new package, PCAtools, accepted to Bioconductor, but it is not yet officially released: If you use R v3.6, you could install it from GitHub.

Otherwise, if you are using base R functions, you just need to add the col parameter to plot() or points(), whichever you use to add the points to the plot. col is just a vector of colours whose order matches the samples in your data object.

ADD REPLYlink written 4 weeks ago by Kevin Blighe37k

Hello Kevin,

Thanks for your super fast response. What i meant is that i wanted to add a color just as you did for indicating your samples in red in pca plots above, I actually applied your tutorial to my own data by merging the samples with 1000Genomes data and couldnt quite get how could I do that after obtaining a combined eigen vector.


your tutorial that I used: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by aksoyluinci10

Sorry - now slow response because I went out.

I see.. and the plot result was as you expected when you merged with 1000 Genomes?

The way that I assign colours in the tutorial is with this command:

col <- colorRampPalette(c(

Essentially this is assigning a single colour to a single factor in PED$Population. The order of the samples in PED$Population will / must match the order of samples in your pca results file from PLINK. You may have to study how this function works in order to utilise it for your own data. Here is a quick example:

samples <- c("A", "B", "A", "C", "B", "B", "C", "C", "A")
col <- colorRampPalette(c("red","green","blue"))(length(unique(samples)))[factor(samples, levels = c("A","B","C"))]

This will assign red to A, green to B, and blue to C. Check it:

plot(1:9, 1:9, type="n")
text(1:9, 1:9, samples, col=col, cex=3)


There are actually various ways of assigning colours but this is what I use. It is different if you are assigning colour to a continuous variable.

ADD REPLYlink written 4 weeks ago by Kevin Blighe37k

Thanks Kevin. This helped a lot!

ADD REPLYlink written 4 weeks ago by aksoyluinci10
gravatar for marco
8 days ago by
marco0 wrote:

newbie here. how is this better than a linear model. snp look to me as not the right data set for ML.

ADD COMMENTlink written 8 days ago by marco0

To whom are you addressing? Why have you posted this as an answer?

If you are addressing me, then I must inform you that I am not using genotypes in the linear model. I am using the eigenvectors from a PCA performed on the genotypes.

ADD REPLYlink modified 8 days ago • written 8 days ago by Kevin Blighe37k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1897 users visited in the last hour