Question: Post PCA analysis
0
gravatar for charvinangia
10 days ago by
charvinangia30
charvinangia30 wrote:

Hello,

I have data of two different populations - A and B. I merged both the data sets (unpruned A and LD pruned B) and did a PCA. I got two different clusters. Next, I'm supposed to identify the principle components that separate A and B, then take the A selective principle components and run a GWAS on it. How do I go about doing it?

Any suggestions please?

Thank you!

ADD COMMENTlink modified 8 days ago by Kevin Blighe39k • written 10 days ago by charvinangia30
2
gravatar for shawn.w.foley
10 days ago by
shawn.w.foley260
USA
shawn.w.foley260 wrote:

How did you generate the PCA? What do you mean by "unpruned A and LD pruned B"? If you're using different filtering on your A and B populations that could be driving your association.

From experience I'd warn you to remember that these are associations/correlations and can mislead you. I had a beautiful PCA that separated populations 1 and 2, only to discover that it was driven by X and Y-linked genes because I had an overrepresentation of females in one population. So just be careful how much you infer from these data.

ADD COMMENTlink written 10 days ago by shawn.w.foley260

Data set B was pruned based on Linkage diseuqilibrium using the --indep-pairwise option in Plink. However, if I did pruning in A, it was removing most of the SNPs. Hence, I merged A as a whole (without pruning) to a pruned B and ran a PCA. My aim was to merge A and B (which are very different populations) and do a PCA on them. I did PCA using --pca option in Plink.

What I also want to know is, what does it mean by running a GWAS on the principle components? Does is mean that I use the PCAs as covariates in the GWAS?

ADD REPLYlink written 10 days ago by charvinangia30
2

You probably mean that you want to adjust your test statistics for population stratification via the inclusion of PCs (principal components) as covariates in your design formula. You should first check if the populations are segregated on PCA bi-plots and, if so, which PCs are segregating them.

I actually used PCA to predict ethnicity previously, with very high sensitivity/specificity on 1000 Genomes populations: A: How to predict individual ethnicity information by using hapmap data

ADD REPLYlink modified 9 days ago • written 9 days ago by Kevin Blighe39k

Thanks Kevin! I have one more question ...may sound a bit silly. Does it make sense to use the principle component as a phenotype in GWAS?

ADD REPLYlink written 9 days ago by charvinangia30
1

People do use principal components as 'phenotypes', sometimes. So, you can use it if you wish. PCs are uncorrelated, which can help in the context of a regression model. Here is the proof of this: C: PCA in a RNA seq analysis

It may help you to first investigate which PCs are of interest by investigating bi-plots.

I had a package recently accepted to Bioconductor, too, but it is not yet officially released: https://github.com/kevinblighe/PCAtools

ADD REPLYlink modified 9 days ago • written 9 days ago by Kevin Blighe39k
1

Thank you once again Kevin!

ADD REPLYlink written 8 days ago by charvinangia30

By running a GWAS on the 2 merged populations A and B, I will be using the first principal component as a phenotype. Will that give me information about which principle component belongs to which population?

ADD REPLYlink written 8 days ago by charvinangia30
1

No, an inspection of the bi-plot will tell you that. Look how I do it here at the end of the tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

Looking at the bi-plots for PC1 vs PC2 (left) and PC1 vs PC3 (right), I can say, for example, that PC1 segregates the African population from the other populations. PC2 segregates the East-Asians from all other populations. PC3 segregates South Asians from all other populations.

biplot

ADD REPLYlink written 8 days ago by Kevin Blighe39k
1

Beautiful! Thank you so much Kevin! It has been of immense help.

ADD REPLYlink written 7 days ago by charvinangia30

So I did a GWAS using the first principal component as a phenotype and plotted a manhattan plot. I got a few hundred hits (above the 5* e -8) Is that normal? My next step would be to use the top hits as covariates for the GWAS on my main phenotype.

ADD REPLYlink written 4 days ago by charvinangia30
1

One usually includes the PCs as covariates for the purpose of adjusting for population stratification in your sample cohort. I was not aware that you are simply using the PCs as the main phenotype of interest.

For example, If I had samples from Ireland, France, Italy, and UK and I am studying haemochromatosis, my model may be:

HaemochromatosisStatus ~ SNP + PC1 + PC2

I include the PCs for the purpose of adjusting for likely natural differences between my populations, but my main interest is haemochromatosis status.

What is the overall aim of your study?

ADD REPLYlink modified 4 days ago • written 4 days ago by Kevin Blighe39k

The overall aim is to identify age at diagnosis of diabetes in 2 populations.

1.I did a GWAS for age at diagnosis for the first population. 2.Next, I merged the data of the 2 populations and did a PCA on them. I got 2 different clusters. 3.Next, I was asked to run a GWAS using the first PC as a phenotype. 4. The tophits of the above GWAS should be run for age at diagnosis.

I do not quite understand the context from step 3 onwards.

ADD REPLYlink written 4 days ago by charvinangia30
1

If you do Step1 on both populations, do the results differ?

Step3 is likely related to the point that I was making, i.e., after you have merged the populations together, include PC1 (and/or any other PCs along which your populations are segregated on a bi-plot) as a covariate in order to adjust for the effects of population stratification.

So, you would have 3 sets of results:

  1. GWAS hits for first population (Diabetes ~ SNP + age)
  2. GWAS hits for second population (Diabetes ~ SNP + age)
  3. GWAS hits for populations combined (population effect adjusted by including PCs as covariates) (Diabetes ~ SNP + age + PC1)

I would compare these sets back to your supervisor / collaborator.

Note that the formulae that I list above are testing for Diabetes status while adjusting for age. You may have your own formulae different.

ADD REPLYlink modified 4 days ago • written 4 days ago by Kevin Blighe39k

Yes, the results are different.

Okay, will try out your step 3.

Thanks!

ADD REPLYlink written 4 days ago by charvinangia30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 889 users visited in the last hour