Question: RNA-Seq DGE: Evaluate model quality
gravatar for rmf
10 months ago by
rmf460 wrote:

I am using DESeq2 to find differentially expressed genes between conditions control/treated. The PCA using vst counts shows family-based clustering.

enter image description here

I have conjured up a few different GLM models as below along with the number of differentially expressed genes detected:

Model Num of DEGs
~condition 80
~family+condition 623

(using principal components from pca above)
~pc1+pc2+pc3+condition 21

(sva was used to generate one surrogate variable)
~sva+condition 73
~sva+family+condition 568

A venn diagram of the DEGs from the 5 models looks like:

enter image description here

Is there some way to objectively evaluate (quantify) which model is better?

This question is related to this.

ADD COMMENTlink modified 10 months ago by Devon Ryan86k • written 10 months ago by rmf460
gravatar for Devon Ryan
10 months ago by
Devon Ryan86k
Freiburg, Germany
Devon Ryan86k wrote:

I'm not sure why you bothered expending the effort to use SVA and add principal components as covariates. Unsurprisingly, you have a family effect that you need to account for. So ~family + condition is the logical first (and probably only) model to use given that.

ADD COMMENTlink written 10 months ago by Devon Ryan86k

I agree with Devon! Devon, I was hoping that you would respond. There was a discussion in the previous thread: RNA-Seq DGE using PCA components as covariate

The ~ family + condition model is also what I would choose, but definitely not do any adjustment based on PC covariates

ADD REPLYlink modified 10 months ago • written 10 months ago by Kevin Blighe33k

@Kevin The general suggestion is that I use family+condition. I do get your explanation behind it. But, I would still like to know how one would evaluate these models. How do I know which model gives me the 'correct' set of DEGs. Is there some metric that I could use to decide model is better? For example something like AIC, but I am not sure how one would get to that using DESeq2.

ADD REPLYlink written 10 months ago by rmf460

The issue is mostly that each gene has its own AIC/BIC. The normal model comparison metrics aren't intended for cases where you're applying that model independently to 30,000 datasets.

In general, one uses the simplest possible model that reasonably describes the model (fun fact: if you add enough covariates to your model you can get almost everything to be DE...I accidentally did that once and was confused that I suddenly had thousands of DE genes associated with a small effect-size phenotype). Using PCs or SVA is reasonable only when you have some sort of confounding effect that you don't already understand. You know that families are a confounder already.

As an aside, principal components are basically an inferior form of what SVA is doing.

ADD REPLYlink written 10 months ago by Devon Ryan86k

@Devon Yep ~family + condition is my current choice. Sure, I could just do that and move on. But, I am just curious how the other models work. All I have at the moment to evaluate this is the number of DEGs found. I am trying to find out if there is a better metric that I could use to evaluate which model is best.

ADD REPLYlink written 10 months ago by rmf460

Devon's experience matches that of my own in that the simplest model should always be chosen. You do not want to over-adjust, because then, as Devon (and I) implies, 'everything' becomes statistically significant. I cannot see past the best model in your situation being ~ family + condition. It does not make sense to use PC covariates in this particular context. You even include PC3 but only show PC1 and PC2 in your bi-plot. PC3 may very well be meaningless.

Having a good study design generally makes the analyst's job a lot easier, too.

With regard to the problem of figuring out if the model is the best one or not, that will be played out as you perform downstream analyses. If the model is entirely inaccurate, then certain genes will have failed the independent filtering and Cook's distance tests, and have their P values set to NA - see Here for further details. Also, if you perform something like hierarchical clustering using your statistically significant differentially expressed genes, and they don't actually segregate your cases and controls, then something is wrong.

Generally, a RNA-seq experiment is an exploratory analysis, so, we should keep an open mind about which genes may be involved in our condition of interest. The ultimate test will then be in follow-up studies where we aim to build a predictive regression model using our genes of interest, which can usually be done with a targeted technology like NanoString, Fluidigm, or something else. In these situations, the AIC/BIC to which Devon refers become more important as we literally heavily scrutinise each and every gene in the final panel in order to gauge its predictive potential.

Apologies if my comments come cross as harsh but I also in the past struggled over which design model to choose.

ADD REPLYlink written 10 months ago by Kevin Blighe33k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1231 users visited in the last hour