proportion of variance
1
0
Entering edit mode
10 months ago

Hi

I'm reading this paper about the prediction of gene expression from SNPs. In this paper, they used PVE(proportion of variance explained) but I can't understand what is PVE and how they use this in their model. I know maybe this is a simple question. I would be really thankful for a simple answer.

variance gene SNP • 543 views
5
Entering edit mode
10 months ago
Vincent Laufer ★ 2.1k

Khatami - I have addressed this at length, elsewhere.

Please see: GWAS: low explained heritability

There are a variety of phrases that are used; variance explained is one of them - don't be thrown off ... the concept is very simple.

For example, let's say I am trying to predict your height. Suppose I get a lot of data on people's mother, their father, and their children. Then I could build a model like:

1. Y  =  BX + E

2. child  height =  (some coefficent)(height of dad )  +   error..


The total variance in Y is going to be equal to the total variance in the predictor you have (X) plus "error". Here, error is sort of like any part of the phenomenon Y that you cannot predict accurately... i.e. the variance in child height that height of dad can not "explain" will go in to the error term.... for now.

Intuitively, you might expect this model is "better than nothing" but not very good ... The "better than nothing part .. that is the amount of Y (child height) than can be "explained" by variation in dad's height (X). What about the rest? That is leftover in the error term, the part of child height we cannot predict yet .. our model does not "explain that variance".

But since we are using multivariate modeling, we can add more terms..Intuitively, imagine this model doesn't "cut it" because it only uses dads height as a predictor (X). But suppose we have 100 children born to 100 dads that are 170cm, but the heights of the mothers vary... .. Clearly, we need to add another term to allow the variation in mom's height to improve our prediction of childs height...

One of the goals of modeling is to find the best model terms that predict as much as possible about Y, leaving as little as possible "left over" in the error term....

1. child height = (some coefficent)(height of dad )+ (another coefficent)(height of mom ) + error..

Now suppose my model is much better.. maybe I can get an estimate with better accuracy. How much better accuracy? Well depends on the proportion of variance of Y (child height) explained by all the predictors in my model. Let's keep going:

4. child  height =  (some coefficent)(height of dad ) +    (another coefficent)(height of mom )  +    (third coefficient)(did you have enough food in childhood) +  error..


Suppose now the amount of variance left over is small ... in other words the mom, dad, and nutrition predictors (X1, X2, and X3) now explain more of the child's height than the error term. For argument's sake, lets say they explain 90% of the variation in Y and the error 10% of the total variance of Y ... then we'd say:

our model predicts 90% of the variation in child height using the variation in parental height and nutrition status .. etc.

This is a lot like saying, if you subtract the varaition of our predictors from the variation of height, how much is left over?

Y = B1X1 + B2X2 + B3X3 + E
Y = (coeff)(dads ht) + (coeff2)(moms ht) + (coeff3)(nutrition) + error


subtracting from both sides we have:

5. Y - (B1X1 + B2X2 + B3X3) = amount of variation left over to explain... (error)


The prior answer at the other link should help too. GWAS: low explained heritability

0
Entering edit mode

I understand completely, thanks very very much. I write this code for the calculation of PVE in R, Is this correct?

pve_calculator <- function(main, predicted){
total_var <- var(main)
residual_var <- var(predicted - main)
explained_var <- total_var - residual_var
pve <- explained_var / total_var
return(pve)
}

0
Entering edit mode

Hi Khatami,

So there are people who have written very, very sophisticated linear modeling (and mixed modeling) packages in R. I would not attempt to write my own code for that. I would instead learn how to use an existing package that is very good. For example, you could look at:

1. For (fixed effects) linear modeling - lmfit: https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/lmFit
2. For analysis of gene expression data - DESeq2: https://bioconductor.org/packages/release/bioc/html/DESeq2.html
3. For general linear mixed modeling (or random effects modeling) - GLMM: https://cran.r-project.org/web/packages/glmm/glmm.pdf