Question: Correcting for population structure in GWAS?
0
gravatar for maya123z
12 months ago by
maya123z30
maya123z30 wrote:

I'm running a GWAS for the first time and modeling my analysis based on this paper, since the nature of our data is similar. The analysis uses GEMMA to apply a linear mixed model and includes the top five principle components as covariates. However, when I ran the analysis and generated a QQ plot, it showed substantial evidence of population structure (early divergence from the expected line).

To correct the population structure issues, I tried using the top 20 principle components instead of the top 5. I also tried pruning for linkage disequilibrium prior to generating the principle components. Neither of these significantly altered the QQ plot.

The only other thing I can think of is to try correcting my phenotypes for additional covariates. I'm using a fly model and so I could correct for Wolbachia infection status and the presence of chromosomal inversions (thought none of these are individually associated with my phenotype). However, since my phenotype is binary (case/control), the residuals from fitting a logistic model look strange. The phenotypes start out as 1's and 0's but end up looking more like a continuous variable that includes some negative numbers and ranges from -2 to 5. This would cause GEMMA to treat it like a quantitative phenotype instead of binary and potentially disrupt the analysis.

Does anyone have advice for how I might proceed?

Edit: One option that just occurred to me is to just include the additional covariates (Wolbachia/inversions) in the covariates file along with the eigenvectors generated by PCA. However the fact that all the other papers I've seen with this model have used corrected phenotypes (i.e., covariates regressed out) rather than including them in the covariates file makes me think that perhaps there's some kind of problem with this approach?

gemma pca qq plot gwas • 1.3k views
ADD COMMENTlink modified 12 months ago by Kevin Blighe39k • written 12 months ago by maya123z30
2
gravatar for Kevin Blighe
12 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

Hi, the first thing that I will add is that using 20 PC eigenvectors is almost definitely not required. Attempting to adjust your statistics using that many eigenvectors is likely to introduce more bias than remove [bias]. The decision on how many PC eigenvectors to include is usually made by observing pairwise bi-plots of the first few eigenvectors.

For example, if you look at this figure ( taken from Produce PCA bi-plot for 1000 Genomes Phase III in VCF format ): Screen

Here, the first 3 eigenvectors are clearly capturing population structure, so, these would be adjusted for in the modeling.

--------------------------------

For other ideas on what to try:

  • Pruning SNPs based on LD (you've done)
  • Removing samples that break HWE

On the issue of phenotypes, it's not a good idea to just 'randomly' include phenotypes that you think may be confounding factors - instead, you should determine statistically if they are [confounders]. You can do this by testing each phenotype independently as a predictor of your outcome via a Chi-squared ANOVA, and include the statistically significant ones as covariates in the final model. For example, in R, I would do this as:

anova(glm(outcome ~ phnotype1, data=MyData, family=binomial()), test="Chisq")

If you are testing for those that confound population structure, then 'outcome' in the model would be 'population'.

--------------------------------

Finally, a good study design also helps to minimise biases. With a very bad study design, bias may be impossible to mitigate.

Kevin

ADD COMMENTlink written 12 months ago by Kevin Blighe39k

Hi Kevin, thanks so much for this helpful information. I'm still a bit confused about how I should decide how many eigenvectors to include based on the bi-plot. I don't have any categorical data (such as ethnicity, as in your example) that I can use to color-code my plots. I used R to plot the first component vs. components 2, 3, 4, and 5, but I'm not sure what I'm looking for in these plots to decide whether or not the component should be included.

ADD REPLYlink written 12 months ago by maya123z30
1

The standard is to include just 2 eigenvectors, i.e., to correct or population structure. If your sample cohort is entirely from the same region, however, then you may not even have to correct for population structure.

ADD REPLYlink written 12 months ago by Kevin Blighe39k

Ah I see. That's strange because for all other papers I've seen using this model set (205 inbred Drosophila lines derived from the same geographical area) always include 5-20 principle components as covariates. In my case, I fitted a logistic regression to the first 5 pc's and determined that none were significantly associated with my phenotype. However, in the paper that I'm basing my analysis on (linked in my original post) they also determined that none of the pc's were associated with their phenotype, yet they still included them as covariates. Perhaps I should try running the analysis without any pc covariates at all?

ADD REPLYlink written 12 months ago by maya123z30

Just dawned on me that you are studying Drosophila spp., here the population differences are undoubtedly different depending on how they are bred.

I don't know the specifics of your own study cohort but, if you found that none of the first 5 PCs were statistically significant, then none of the subsequent PCs are likely to be. You can try without PC covariate adjustment to see how it affects the QQ plot.

ADD REPLYlink written 12 months ago by Kevin Blighe39k

Since none of the pc's or other covariates were statistically significant, I ran the analysis again without any covariates. The QQ plot from that data had a strange step-shaped pattern: https://imgur.com/a/t9Zu7

ADD REPLYlink written 12 months ago by maya123z30

Out of interest, can you show the other QQ plots? Obviously, a QQ plot does not have to perfectly follow the expected distribution (red line) - the deviations from the expected distribution can be reflective of the differences for which you are looking, i.e., between cases and controls

ADD REPLYlink written 12 months ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1605 users visited in the last hour