Phenotype data normalization prior to GWAS
1
1
Entering edit mode
20 months ago
antmantras ▴ 80

Hi all.

Is it necessary, previous to conducting a GWAS (MLMM approach, Multi locus mixed model), to normalize quantitative phenotypes to make them follow a normal distribution?

After applying the Shapiro-Wilk test for each phenotype, I have observed that none of the four traits studied follow a normal distribution. See for example a histogram for a quantity of a compound "A".

I have performed GWAS with the phenotypes un-normalized and the QQ-plots obtained for each one of them are:

Except for the compound B (if I have to choose one), the plots seem ok to me. If a previous normalization is required, which one should I use? I have read about quantile normalization or the rank-based inverse normal transformation, which seems to be more popular. Thanks in advance.

normalization gwas phenotype • 1.7k views
2
Entering edit mode
20 months ago
LChart 4.1k

The core assumption of linear model statistics is normality of the standard error of the parameter estimates. This is guaranteed when the residuals are normal; but it is also guaranteed (as N -> infinity) by independence and the Central Limit Theorem. As GWAS have very large values of N, it should not generally matter if the residuals are normal or not.

For other instances of regression, a larger issue is non-linearity between response and predictors; but since GWAS only has three states (AA=0, AB=1, BB=2) it's rare to observe a deviance from linearity (not that many dominance effects).

Finally, the distribution you are showing is not merely non-normal; it appears to be zero-inflated. There is no monotonic transformation that can convert a zero-inflated distribution into a normal distribution; so the approach here would have to be to use a GLMM in place of an LMM; and explicitly model the relationship between variant dosage and (a) Probability of 0, and (b) Conditional distribution of (y|x != 0).

0
Entering edit mode

Although it seems that there are many zero values, actually those are zero-points (0.2, 0.5, 0.3, etc). There are only 3 zero values in the all set of phenotypes. Would then your approach be necessary? If yes, could you recommend me any tool to perform this analysis?