I am trying to find a formal reference and more details to this equation:
I saw it in this Nature Genetics paper (Online Methods section), by Park et al. (2010).
This equation determines the contribution of a SNP to the genetic variance of a trait or effect size (ES), taking into account its regression effect (Beta) and allele frequencies (f).
I have searched in books, articles and throughout the internet, but found nothing.
Could someone provide a more detailed reference or explanation about it?
Here is my guess, that ES is the variance that can be attributed to a particular SNP when you look at the regression equation. Say you have a regression model where Y = BX + epsilon, where Y is your phenotype, B is your regression coefficient, epsilon is the random noise, and X is either 0,1, or 2 depending on the status of the SNP. ES seems to be the following ES=Var[BX]=B^2Var[X], where Var is the variance and remember when you have constant when you take the variance that you can bring it outside but the constant needs to be squared. Var[X] given a binomial distribution with 2 trials should be Var[X]=2f(1-f), where f is the minor allele frequency. So altogether you have ES=2B^2*f(1-f). Perhaps, some one with more background in genome wide association studies can give you a more definitive answer though.
Hi Collin! I had already figured more or less the same: It contains everything an effect size formula should contain, however what I ideally need it would be a formal text book reference or something similar. Thanks anyway for your time!
In my opinion, "The effect size, as defined above, corresponds to the contribution of the locus to the genetic variance of the trait under Hardy-Weinberg equilibrium and an additive polygenic model (Park et al., 2010)" should be ES/Var(G) rather than the ES as they defined. Var(G) is the genetic variance of trait. My proof follows:
Model: phenotype value P consists of genetic value G and environmental value E. P = G + E. In total there are many susceptibility SNPs, i.e., causal SNPs. Assume these SNPs are independent. Suppose there is a susceptibility SNP having genetic value G1, where G1=beta*X, beta is the effect of this SNP on phenotype, X is the number of allele. G1 is part of G, so G = G1 + GR, R represents for remain.
The genetic variance explained by that susceptibility SNP is the coefficient of determination for the regression G=G1+GR. It can be calculated as Cov(G,G1)^2/(Var(G)Var(G1)). As G1 is part of G and SNPs are independent, in the nominator Cov(G,G1)=Cov(G1+GR,G1)=Var(G1). Thus, the coefficient of determination is equal to Var(G1)/Var(G), which is the proportion of variance of that susceptibility SNP to genetic variance.
However, Var(G1) itself is the effect size defined by Park et al. In my understanding, the definitions of ES in formula and in words are different things. Can you explain where I misunderstand their definition? Thank you.
Besides, I think effect size is a general term. It may refer to mean difference, logOR or correlation coefficient. In the paper I think the authors just define an effect size that reflects the contribution of a causal SNPs to the genetic variance.
My colleague told me that I misunderstood the meaning of the contribution of the locus to the genetic variance of the trait. It is not the same with the genetic variance explained by that SNP, it is just the variance of that SNP.
Hi Collin, I think the BX represents for genetic value of one causal SNP here. The ES is not the same as Var(Y).
Yes, one would need to add the variance of the epsilon term to get the full variance of Y, Var[Y] = Var[BX] + Var[epsilon]. But the variance of the random noise term wouldn't be considered a genetic influence on the phenotype.