GWAS: how do we know we have the most appropriate model?
1
1
Entering edit mode
5.7 years ago
Nick ▴ 70

(Note: crossposted on Cross Validated here)

My question is about how to know when the optimal statistical model has been selected for a GWAS.

I appreciate that statistical models always provide only an approximation to a true process, and we aim to make inferences about those processes by choosing an appropriate model pragmatically (e.g. based on fit, complexity etc - there are no hard and fast rules).

But for a GWAS where there are many thousands or millions of tests being performed in parallel, how should we determine that one model is more appropriate than another?

A simple example: if we want to control for age, is it more appropriate to include a covariate for age only, or for age + age^2, as I have seen done? A trickier example: I have a quantitative trait with excess zeroes (it is based on % response to treatment). It is better to use a 2-part model, a truncated model, some non-parametric test or just to press on regardless with a linear model?

To answer these questions using the data, is it sufficient simply to look at QQ plots and Genomic Control (or LDSC intercept)? I have mainly seen these methods used as a check for population structure in the past, but are they also appropriate for model selection? Or are there better approaches?

GWAS statistics genomic control GC inflation • 3.0k views
ADD COMMENT
1
Entering edit mode
5.7 years ago

You will likely / hopefully receive a more in depth response on Cross Validated, which is more 'statistics'-aligned than we are.

The simplest type of association test just looks at allele tallies in cases versus controls and derives a Chi-square P value from this, as I go over step-by-step here: A: SNP dataset and Z Score

The SNPs are tested independently for association. Each test is therefore its own model and the ones that pass Genome-wide statistical significance are chosen. Prior to running these tests, we can pre-filter the variants based on various metrics, including:

  • missingness
  • Drift from Hardy-Weinberg Equilibrium (HWE)
  • Linkage Disequilibrium
  • et cetera

Power analysis prior to study commencement helps to determine ideal sample numbers for testing.

That's just simple association tests.

--------------------------------------------------

It is also possible to test each SNP and adjust for certain factors, like BMI, height, exposure to allergens, ethnicity, gender, age, etc. This type of test is performed through logistic regression.

In these situations, the endpoint (y variable) is usually a binary trait (1, case; 0, control), with the x predictors being the genotype being tested and then all of the covariates that I mentioned above, e.g.:

glm(CaseControl ~ SNP1 + PC1 + PC2 + age + BMI)

[NB - ethnicity / population stratification is usually controlled via PCs / eigenvectors, here PC1 and PC2]

From this, the P value for SNP1 will be 'adjusted' by the other factors in the model, i.e., PC1, PC2, age, and BMI.

Note: you should not just adjust for any type of variable without justification. For example, you cannot just throw everything into the model and assume that this will be in any way good. You need justification for including covariates. For example, prior to testing any SNPs, you should independently test each of your covariates against your endpoint to see if it's statistically significantly different between, for example, your cases and controls - if it is, then you should include it, as, otherwise, your testing would be confounded by such a covariate.

Regarding age and age^2, think of the squaring as a transformation or smoothing function (like logging), which can bring the distribution of the covariate to a normal distribution. Other options may include:

  • converting age into a categorical variable (e.g. <18, <30, <50, >=50)
  • stratifying your cases and controls by matching on age, as in conditional logistic regression

Indeed, one should also check the distribution of each covariate and normalise (log, square, square root, etc), categorise, match (conditional logistic regression), or do something else if the distribution is likely an issue.

If you have a trait with excess zeros, then consider categorising it and not testing it as a numerical covariate.

A QQ plot of your obtained P values is the typical and easy way to see how your sample cohort and model assumptions have held up against the expected distribution in a 'normal' population.

Please also read this former answer by Philipp: A: GWAS: when is it appropriate to add covariates?

Kevin

ADD COMMENT
0
Entering edit mode

Thanks for a very comprehensive answer Kevin! I really appreciate you taking the time to engage.

While not wishing to gloss over a lot of the really helpful background/context, I just want to pick up on the key point that motivated my question. Let's stick with the age and age^2 question, i.e. when to include a transformed covariate in a logistic regression setting (and presumably your advice holds for linear regression for a quantitative trait). My interpretation of your answer is that there are two broad alternatives:

  • Select covariates/transformations using a model without genotype
  • Incorporate covariates/transformations into the GWAS and then check QQ plots (+ presumably genomic control etc)

Would you recommend both of these? If QQ plots show severe deviation from normality, would you go back to the genotype-free model and try to obtain a better fit?

In my trickier example, where the issue is more around selecting an appropriate modelling framework, I think option 1 (make selections without using genotype) is harder to do. Surely then you would need a variable (i.e. covariate) that you know is associated with your trait, in order to "know" when you've found the right framework. In this case, is it sufficient to choose a modelling framework based on a consideration of its assumptions etc, and then conclude that it is an appropriate choice if the QQ plot looks reasonable?

Thanks, Nick

ADD REPLY
0
Entering edit mode

Hey Nick, from what I have learned, each study is different and, moreover, any 2 epidemiologists will do the same study slightly differently based on their own prior experiences. As an example, I once worked on a study whereby a particular covariate had been historically assumed to have an influence on the trait being tested. I showed that it had no influence in our particular data but the PI wanted to include the covariate nevertheless based on historical reasons / pressure. Others want to manipulate data in such a way as to satisfy their pre-conceived hypotheses about what should / should not be a covariate.

So, in practice, a good QQ plot is definitively not the deciding factor. I definitely agree that the simple model of just looking at allele tallies is not the best approach, though.

So, in relation to the more difficult question, there's never a point where you say that this or that model set-up is the best. Also, the association of some variants to the trait being tested may not even be confounded by the same covariates as other variants based on the underlying biology at play. For example, in a cancer study, a variant that results in higher metabolism of fatty acids may be confounded by BMI, whereas, another variant that modulates melanin production is likely not confounded by BMI in its association to the trait being tested. Unfortunately, nobody has time to test each variant manually.

Perhaps it is these issues along with poor study design that have helped make GWAS lacking in reproducibility. I touch on this early on in my review just out.

ADD REPLY
0
Entering edit mode

Hi Kevin,

I know this is an old thread but I will give it a shot asking a question here:

What if you only wanted to extract the SNPs signficant for age after controlling for other covariates in your model (PC1, PC2, BMI).

As an example:

I am running GWAS in GAPIT in R.

For example, if I run GAPIT with covariates “family” (fam1, fam2, fam3) and “temperature” (27C, 32C) as fixed covariate factors, can I extract the significant SNPs associated with fam3 at 32C for example?

Thanks!

ADD REPLY
0
Entering edit mode

Hey, one could do it for a standard regression model, as a coefficient and p-value is returned for every parameter in the model. I am not sure how GAPIT functions, though - does it not return values for each parameter?

ADD REPLY
1
Entering edit mode

Hi Kevin, Thanks for the speedy reply. After hunting around I don't think this is possible to do with GAPIT unfortunately. It was suggested by a GAPIT expert to do the following though:

"however, that if you go the significant SNPs and did a regression including the other variables that you could get what you needed with respect to SNP effect within each family/fixed effect. Also, you could do a two-stage analysis where you just put the id'ed SNPs as covariates into GAPIT and run the model - it should at least give you a pretty solid idea of what you are looking for."

ADD REPLY

Login before adding your answer.

Traffic: 3152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6