GWAS: when is it appropriate to add covariates?
Entering edit mode
6.3 years ago
nchuang ▴ 260

struggling with biostat and association study designs.

I initially wanted to do an association study comparing two populations and seeing which SNPs are significant. For example, I am looking at just variation between centenarians (people who live >100years) and a control group. Should I be including age as a covariate? I am interested in detecting longevity variants or anything that suggests a difference from the control group. I do not think age would be necessary since it is not a confounder because it is not an indepedent covariate. I think I was reading that adding independent covariates can decrease power of the study. 

When do you guys start considering modeling with linear or logistic regressions instead of GWAS? Is it when you have a depedent variable and a predictor variable you are interested in? Would you guys think I should have added covariates?

Or is another way possible to change the case and control population to groups that reflect covariate status? For example, If I was interested in centenarians with alzheimers compared to a control population with alzheimers, a logistic regression with Alzheimer status be more appropriate than running a GWAS on them?


Sorry for so many questions.

genome gwas • 8.5k views
Entering edit mode
6.3 years ago

I'm a bit self-taught in this but since no-one has answered in 2 days I'll give it a try, others can feel free to chip in.

>Should I be including age as a covariate?

You add covariates when you expect that they have an influence on your phenotype - for example, gender often has an influence, but most commonly in humans it's population structure (like PC1 and PC2 and sometimes PC3 from PCA software). In your case with old age I wouldn't use age as a covariate as it's practically identical to your phenotype of interest, it's not independent.

Generally, I'd add covariates if you have surprisingly tiny p-values, and especially if your QQ-plots look bad. For a good introduction to reading and interpreting QQ-plots, see here: Behavior of QQ-Plots and Genomic Control in Studies of Gene-Environment Interaction. How to plot QQ-plots depends on the format of your output (some software like GAPIT does it automatically).

Have a look at these two tutorials for more on covariates, QQ-plots and population stratification: and (these also have code for plotting QQ-plots for PLINK results)

>When do you guys start considering modeling with linear or logistic regressions instead of GWAS?

I'm not sure I understand this question, regressions are a part of GWAS. In my experience running a regression with PLINK gives relatively similar results to a mixed linear model as for example implemented in GAPIT. The actual p-values are different but the SNPs with the lowest p-values stay roughly the same.

Entering edit mode

I think this is a correct answer. Age should not be a covariate because it is dependent on your response variable. An alternative could be to model a poisson regression on the age, instead of a logistic regression, which would give you how much each SNP status leads to an increase of one year of age (instead of categories centanarians/non centenarians). Another approaches is to model the logistic regression as the probability of reaching the maximum age (e.g. follow the example here: )

Other factors that can be included as covariates, apartfrom the PC components, are: 1) the sequencing center and the 2) technology used to sequence, if they are different; 3) the location where the samples were taken.

Entering edit mode

I really appreciate the links to further reading since I am currently teaching myself as well!

Regarding regression vs association study, I think what I meant was doing a simple Chi-squared or Fisher's of the allele frequencies versus linear/logistic regression. Sorry for the confusion, but the correct approach would be to do an association study with Fisher's exact test between my centenarians against controls. However, if I wish to add disease status into the analysis, would I use logistic regression with for example Alzheimer's as my outcome variable, genotype as my predictor, and centenarians status as my covariate?



Login before adding your answer.

Traffic: 1508 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6