Question

How to estimate the percent phenotypic variation explained by the significant SNP from a GWAS study?

0

Entering edit mode

4.5 years ago

anikcropscience ▴ 270

Hello, I have conducted a large-scale GWAS study and got a few significantly associated SNPs. I used GEMMA with -lmm 1 options to run the GWAS and obtain the beta and standard-error estimates. I want to estimate the percent phenotypic variation explained by each of the significant SNPs. I used the following procedure for estimating the variance explained in R:

fit <- lm (Phenotypic_value ~ SNP_data, data = a)
summary(fit)$r.squared

Here, the datafile a contains three columns namely, sample_ID, Phenotypic_value for each sample, and the biallelic SNP_data. I got a value which is 0.43 meaning 43% phenotypic variation explained by the SNP.

Again, I used another formula which is: 2*f*(1-f)*b.alt^2. Here, f is the minor allele frequency and b.alt is the effect size i.e. beta estimate obtained from GEMMA. This gives me a value of 0.03 meaning 3% variation explained which seems reasonable to me.

My question is that which of the following method is correct? or Is there any other way to estimate the percent variation explained?

Alternatively, from the GEMMA google group, I have got this formula pve <- var(x) * (beta^2 + se^2)/var(y). But I do not understand how can I obtain the value of var(x) and var(y).

It will be great to receive some feedback on this. Thank you.

GWAS variance statistics SNP R-studio • 4.3k views

ADD COMMENT • link 4.5 years ago by anikcropscience ▴ 270

0

Entering edit mode

In your case:

x=a$SNP_data
y=a$Phenotypic_value

In linear regression involving no covariates (y=alpha+beta*x+e), the correlation coefficient between x and y can be expressed as

r=sqrt(var(x))*beta/sqrt(var(y))

and then you want to take the square of this. I am not sure where the se^2 term comes from, but I see the author of GEMMA won't back up his claim. Generate some fake data in R and you'll see the formula is wrong and the se^2 does not belong there (for simple regression). There's no reason why an estimate having a higher se would explain a higher % of the variance. Maybe it has to do with the fact that GEMMA is a LMM, I don't know I am not familiar enough.

Since

var(x)=2*f*(1-f),

your other formula is equivalent only if your y has unit variance.

ADD REPLY • link 4.5 years ago by Lemire ▴ 940

0

Entering edit mode

Hi @Lemire Ok, so the correct formula is then pve <-sqrt(var(x))*beta/sqrt(var(y)) and then pve^2 where var(x) is 2*f*(1-f)?

Do you have any reference sources for that?

Thank you very much.

ADD REPLY • link 4.5 years ago by anikcropscience ▴ 270

0

Entering edit mode

What about the second formula? Is it correct this way?

ADD REPLY • link 4.5 years ago by anikcropscience ▴ 270