Question: RNA-seq gene expression analysis using 0-counts
4.2 years ago by
United States
johntlovell10 wrote:

Hi Folks.

I am conducting a differential gene expression analysis using RNA-seq. My experimental design is blocked and repeated, so I need to fit mixed effects models and cannot make use of standard DGE packages such as DESeq, edgeR etc. This is not a problem when the count data is generalizable to the negative biominal (poisson etc.) distribution; however, for many of the genes, I have highly 0-inflated, or binary distributed count data. For example, for many of the genes, there are 0 counts for one parent and >5 counts for the other parent. Please advise on the best way to analyze genes that behave this way. 

Thanks, John

ADD COMMENTlink modified 4.1 years ago by Biostar ♦♦ 20 • written 4.2 years ago by johntlovell10
  1. Are you sure you actually need to use a mixed-effect model? Given that DESeq2/edgeR/etc. use shrinkage, a mixed-effect model is unlikely to benefit you.
  2. Have a look at limma's duplicateCorrelation() function.
ADD REPLYlink written 4.2 years ago by Devon Ryan88k

Thanks Devon. This comment has come up in many of the posts that I have read. 

For me, when an experiment is designed with blocking and replication within the individual, the individual and experimental blocking must be analyzed as random effects. This is a pretty standard quantitative genetics design. Furthermore, we have a ton of replication within the experimental factors we are testing among, so I am not convinced that shrinkage is a particularly good method to estimate within group variances. 

Anyways, even if I did use fixed effects, I am still unsure about the best way to analyze these highly 0-inflated and binary gene expression phenotypes. Thanks again.

ADD REPLYlink written 4.2 years ago by johntlovell10

Certainly if you were to compare a straight GLM and a GLMM on your dataset then the GLMM would work better...but of course a GLMM is just doing shrinkage in a different way than DESeq2 et al., which aren't straight GLMs.

Regarding the zeroes, it depends a bit on exactly what you mean by zero inflated and where the problem is. If the case is that you have absolutely 0 expression in all but one sample, then that can be problematic. I suppose how to deal with that depends on whether you find those cases biologically interesting. For most people they wouldn't be, but I can think of counter examples (e.g., single-cell sequencing).

ADD REPLYlink written 4.2 years ago by Devon Ryan88k
