Question: How to predict individual ethnicity information by using hapmap data
gravatar for Joe
23 months ago by
Joe40 wrote:

Hi all,

I want to predict the ethnicity information of individuals by using ML method. HapMap genotype data were treated as training set, and unknown individuals' genotype data as testing set. Because this is my first time to do the ethnicity prediction, anyone can give some detail informations (such as methods, typical workflows) or articles making it easy to follow? Many thanks!!!


snp ethnicity hapmap • 2.6k views
ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 23 months ago by Joe40

newbie here. how is this better than a linear model. snp look to me as not the right data set for ML.

ADD REPLYlink written 11 months ago by marco0

To whom are you addressing? Why have you posted this as an answer?

If you are addressing me, then I must inform you that I am not using genotypes in the linear model. I am using the eigenvectors from a PCA performed on the genotypes.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe53k
gravatar for Kevin Blighe
23 months ago by
Kevin Blighe53k
Kevin Blighe53k wrote:

Hi Joe,

This is certainly an interesting area of research and I have actually already produced my own ethnicity predictive model using 1000 Genomes data as the training data. I originally developed it in order to predict the ethnicity of patient samples of unreported ethnicity [one could infer the ethnicity based on the samples' countries of origin, though]. Testing it on these samples of unknown ethnicity gave excellent predictive ability.

I guess that the process could be divided into 4 steps:

Identifying the panel of predictive markers on the training data

In my case, I generally followed [but not exactly followed] the previous tutorial that I posted here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

That then gave me a panel of a few thousand SNPs. Even at this point, one can easily overlay study samples with the 1000 Genomes samples in order to infer ethnicity, as per this figure:


Build the model

To then build the model, I actually used the PCA rotated component loadings from X eigenvectors / dimensions from the plot above - these were fed into a multinomial logistic regression model with ethnic group as outcome, obviously. I then used stepwise logistic regression in order to reduce the number of predictors to a minimum set that were most informative based on the Akaike information criterion (AIC).

Gauging the model's predictive potential

To test the model [on the 1000 Genomes training data], one can do X-fold cross validation using cv.glm() in R Programming Language and also generate ROC curves to gauge AUC, sensitivity, and specificity. My model had AUC 98.7%.


Testing the model on 'real World' data

By doing something simple as using the predict() function in R, one can easily apply the model to samples of unknown ethnicity and the model will return a value from 0-1 (depending on how you execute predict()). In my example, I then output these results by plotting the samples into ethnicity 'strata'. As far as I can recall, the model did not incorrectly predict an ethnicity that was not expected.



ADD COMMENTlink modified 7 months ago • written 23 months ago by Kevin Blighe53k

Thanks Kevin! This is exactly what I want!

ADD REPLYlink written 23 months ago by Joe40

Respond back here if you need any clarifications

ADD REPLYlink written 23 months ago by Kevin Blighe53k

Hello Kevin,

How can i add another color to indicate my samples in PCA plots?


ADD REPLYlink written 12 months ago by aksoyluinci10

What have you tried so far? Note that I have just had a new package, PCAtools, accepted to Bioconductor, but it is not yet officially released: If you use R v3.6, you could install it from GitHub.

Otherwise, if you are using base R functions, you just need to add the col parameter to plot() or points(), whichever you use to add the points to the plot. col is just a vector of colours whose order matches the samples in your data object.

ADD REPLYlink written 12 months ago by Kevin Blighe53k

Hello Kevin,

Thanks for your super fast response. What i meant is that i wanted to add a color just as you did for indicating your samples in red in pca plots above, I actually applied your tutorial to my own data by merging the samples with 1000Genomes data and couldnt quite get how could I do that after obtaining a combined eigen vector.


your tutorial that I used: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLYlink modified 12 months ago • written 12 months ago by aksoyluinci10

Sorry - now slow response because I went out.

I see.. and the plot result was as you expected when you merged with 1000 Genomes?

The way that I assign colours in the tutorial is with this command:

col <- colorRampPalette(c(

Essentially this is assigning a single colour to a single factor in PED$Population. The order of the samples in PED$Population will / must match the order of samples in your pca results file from PLINK. You may have to study how this function works in order to utilise it for your own data. Here is a quick example:

samples <- c("A", "B", "A", "C", "B", "B", "C", "C", "A")
col <- colorRampPalette(c("red","green","blue"))(length(unique(samples)))[factor(samples, levels = c("A","B","C"))]

This will assign red to A, green to B, and blue to C. Check it:

plot(1:9, 1:9, type="n")
text(1:9, 1:9, samples, col=col, cex=3)


There are actually various ways of assigning colours but this is what I use. It is different if you are assigning colour to a continuous variable.

ADD REPLYlink written 12 months ago by Kevin Blighe53k

Thanks Kevin. This helped a lot!

ADD REPLYlink written 12 months ago by aksoyluinci10

This plot looks great @ Kevin Blighe . How did you do the plotting of the samples into ethnicity 'strata' - any R package used? Thanks

ADD REPLYlink written 11 months ago by robjohn70000120

Hey, that was actually done with base R functions, would you believe. What is being plotted on y-axis are prediction scores (0 to 1) from the ethnicity model that I created (i.e. from predict() applied to my final glm()). X-axis is just a single integer for each sampel so that they are spaced out, going from 1 to 2504 for all 1000 Genomes samples.

I then manually set the cut-offs to segregate the different groups, and provided a gradient colour scheme. The red dots are my own samples overlaid via points().

So, I just used plot(). Surprising how good plot can be with a bit of work. Reminds me of MS Paint on Windows. Powerful tool.

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe53k

Thanks @Kevin Blighe

ADD REPLYlink written 10 months ago by robjohn70000120

Hi Kevin,

This is super helpful, but I had a quick question. Is it necessary to find common markers and overlay samples with the reference HapMap data? Would it be possible to project the HapMap samples first, and then project any arbitrary sample on that? I believe there might be issues with dimensionality and projection since the markers in the sample might not be the same as the HapMap dataset. But is there a workaround that?


ADD REPLYlink written 7 months ago by milind_ag10

Hey, I am not sure what you mean (?). In this example, the regression model is constructed using the HapMap data, and then we use this model to apply a 'prediction' on other, non-HapMap samples. Generally, the HapMap and test datasets should be filtered to include shared markers / SNPs, though.

ADD REPLYlink written 7 months ago by Kevin Blighe53k

Ah, I'm sorry if I wasn't clear before. I am trying to do PCA to infer ethnicity. Unfortunately, my sample data doesn't have a lot of shared markers with reference datasets like 1000G/HapMap.

If there were, I could just merge both the reference and my sample, and then do PCA on it, and see where my samples landed compared to the others. I was curious to know if your regression model would work even when there aren't a lot of shared markers in the non-HapMap samples?

ADD REPLYlink written 7 months ago by milind_ag10

It depends on how informative (of ethnicity) are the markers that are shared. For the model above, I believe there was ~13000 markers, which is a lot. PCA was performed using the information from these markers across the 1000 Genomes samples and also the unknown samples, i.e., combined. I then constructed the model using the PCA results, and then performed model predictions on the unknown samples.

The initial work to derive the markers was performed by following this: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLYlink written 7 months ago by Kevin Blighe53k

Thanks for the clarification Kevin! That's what I thought too. It really does depend on how many and how informative the markers I have are. I will try to incorporate as many common markers as possible.

And thanks for the link to the original work. Its very helpful! :)

ADD REPLYlink written 7 months ago by milind_ag10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1315 users visited in the last hour