Question

Post PCA analysis

0

Entering edit mode

5.1 years ago

doodle ▴ 30

Hello,

I have data of two different populations - A and B. I merged both the data sets (unpruned A and LD pruned B) and did a PCA. I got two different clusters. Next, I'm supposed to identify the principle components that separate A and B, then take the A selective principle components and run a GWAS on it. How do I go about doing it?

Any suggestions please?

Thank you!

pca population stratification correlation matrix • 2.0k views

ADD COMMENT • link updated 5.1 years ago by Kevin Blighe 87k • written 5.1 years ago by doodle ▴ 30

score 2 · Answer 1 · 2019-03-12

2

Entering edit mode

5.1 years ago

shawn.w.foley ★ 1.3k

How did you generate the PCA? What do you mean by "unpruned A and LD pruned B"? If you're using different filtering on your A and B populations that could be driving your association.

From experience I'd warn you to remember that these are associations/correlations and can mislead you. I had a beautiful PCA that separated populations 1 and 2, only to discover that it was driven by X and Y-linked genes because I had an overrepresentation of females in one population. So just be careful how much you infer from these data.

ADD COMMENT • link 5.1 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

Data set B was pruned based on Linkage diseuqilibrium using the --indep-pairwise option in Plink. However, if I did pruning in A, it was removing most of the SNPs. Hence, I merged A as a whole (without pruning) to a pruned B and ran a PCA. My aim was to merge A and B (which are very different populations) and do a PCA on them. I did PCA using --pca option in Plink.

What I also want to know is, what does it mean by running a GWAS on the principle components? Does is mean that I use the PCAs as covariates in the GWAS?

ADD REPLY • link 5.1 years ago by doodle ▴ 30

2

Entering edit mode

You probably mean that you want to adjust your test statistics for population stratification via the inclusion of PCs (principal components) as covariates in your design formula. You should first check if the populations are segregated on PCA bi-plots and, if so, which PCs are segregating them.

I actually used PCA to predict ethnicity previously, with very high sensitivity/specificity on 1000 Genomes populations: A: How to predict individual ethnicity information by using hapmap data

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks Kevin! I have one more question ...may sound a bit silly. Does it make sense to use the principle component as a phenotype in GWAS?

ADD REPLY • link 5.1 years ago by doodle ▴ 30

1

Entering edit mode

People do use principal components as 'phenotypes', sometimes. So, you can use it if you wish. PCs are uncorrelated, which can help in the context of a regression model. Here is the proof of this: C: PCA in a RNA seq analysis

It may help you to first investigate which PCs are of interest by investigating bi-plots.

I had a package recently accepted to Bioconductor, too, but it is not yet officially released: https://github.com/kevinblighe/PCAtools

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you once again Kevin!

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

By running a GWAS on the 2 merged populations A and B, I will be using the first principal component as a phenotype. Will that give me information about which principle component belongs to which population?

ADD REPLY • link 5.1 years ago by doodle ▴ 30

2

Entering edit mode

No, an inspection of the bi-plot will tell you that. Look how I do it here at the end of the tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Looking at the bi-plots for PC1 vs PC2 (left) and PC1 vs PC3 (right), I can say, for example, that PC1 segregates the African population from the other populations. PC2 segregates the East-Asians from all other populations. PC3 segregates South Asians from all other populations.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

1

Entering edit mode

Beautiful! Thank you so much Kevin! It has been of immense help.

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

So I did a GWAS using the first principal component as a phenotype and plotted a manhattan plot. I got a few hundred hits (above the 5* e -8) Is that normal? My next step would be to use the top hits as covariates for the GWAS on my main phenotype.

ADD REPLY • link 5.1 years ago by doodle ▴ 30

1

Entering edit mode

One usually includes the PCs as covariates for the purpose of adjusting for population stratification in your sample cohort. I was not aware that you are simply using the PCs as the main phenotype of interest.

For example, If I had samples from Ireland, France, Italy, and UK and I am studying haemochromatosis, my model may be:

HaemochromatosisStatus ~ SNP + PC1 + PC2

I include the PCs for the purpose of adjusting for likely natural differences between my populations, but my main interest is haemochromatosis status.

What is the overall aim of your study?

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

The overall aim is to identify age at diagnosis of diabetes in 2 populations.

1.I did a GWAS for age at diagnosis for the first population. 2.Next, I merged the data of the 2 populations and did a PCA on them. I got 2 different clusters. 3.Next, I was asked to run a GWAS using the first PC as a phenotype. 4. The tophits of the above GWAS should be run for age at diagnosis.

I do not quite understand the context from step 3 onwards.

ADD REPLY • link 5.1 years ago by doodle ▴ 30

1

Entering edit mode

If you do Step1 on both populations, do the results differ?

Step3 is likely related to the point that I was making, i.e., after you have merged the populations together, include PC1 (and/or any other PCs along which your populations are segregated on a bi-plot) as a covariate in order to adjust for the effects of population stratification.

So, you would have 3 sets of results:

GWAS hits for first population (Diabetes ~ SNP + age)
GWAS hits for second population (Diabetes ~ SNP + age)
GWAS hits for populations combined (population effect adjusted by including PCs as covariates) (Diabetes ~ SNP + age + PC1)

I would compare these sets back to your supervisor / collaborator.

Note that the formulae that I list above are testing for Diabetes status while adjusting for age. You may have your own formulae different.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, the results are different.

Okay, will try out your step 3.

Thanks!

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

Hi Kevin, I'm still a little stuck with this work.

Can I please know what you meant when you wrote 'People do use principal components as 'phenotypes', sometimes. So, you can use it if you wish. PCs are uncorrelated, which can help in the context of a regression model' in the comments above?

Just to rewind a bit. I have 2 populations -A and B. I did a gwas on age of diagnosis for diabetes in population A. Then I merged A and B, did a PCA on them, used the first principal component as a phenotype (without any covariates) and ran a gwas on the merged populations. This gave me a list of SNPs which show the difference between the 2 populations (I got about 0.1 million hits here. Is that normal?) I then used the SNPs from the tophits (the 0.1 million) and ran a gwas for age at diagnosis for the Population A. I did not get any hits this time.

A few questions,

Does this experiment validate the results of my original gwas on population A for age at diagnosis before I did the PCA?

Does this show the genetic reasons for different age at diagnosis for diabetes in the 2 populations? (Literature shows that population A gets diabetes at a lower age than B).

If not the above, what does it all mean?

I'm quite confused now.

ADD REPLY • link 5.1 years ago by doodle ▴ 30

0

Entering edit mode

Hey, what is your actual model / design formula? I do not believe you ever stated clearly in this thread. It does not make much sense to be testing your SNPs against just the PC, which I believe you may have done.

We include PCs as 'phenotypes' (more correct to use the term 'covariates') in GWAS studies in order to control for population stratification.

It would help to show exact code that you are running. Sometimes, between code and written text, much information can become lost or confusing.

By the way, if the literature shows that population A gets diabetes at a lower age than B, then do you even need to combine these together? It may be more intuitive to process them separately and just compare results, e.g., perform a meta-analysis.

ADD REPLY • link 5.1 years ago by Kevin Blighe 87k