RNA-Seq: using GLMM to detect differentially expressed genes
Entering edit mode
9.3 years ago
alesssia ▴ 580

Hi All.

I have a set of raw count data and I am interested in using (G)LMM to detect differentially expressed genes. However, I have a number of questions about how to prepare the better (correct?) pipeline for this task.

  1. I am aware that using linear models (instead of well-know tools, such as DESeq2) will give me less power -- unless I have a large set of samples. I know that this is a dumb question, but which number of samples can be called "large"?
  2. To have meaningful results I believe that a filtering and a normalisation step are needed beforehand. Is this assumption correct? Which is a reliable approach to filter/normalise my data?
  3. May it be useful to work with transformed versions of the count data?
  4. I usually use LMMs (lme4 R package) when looking for differentially expressed genes in the context of microarray data -- I work with multiplex family data and I want to correct for samples' relatedness. However, when RNA-Seq counts are at hand, is it better to use zero-inflated Poisson models? Or can I assume that there is only an overdispersion problem? Can the answer to this question be data-dependent?

Thanks in advance for your help,


GLMM differential-expression RNA-Seq • 3.3k views
Entering edit mode
9.3 years ago
  1. I suspect that Gordon Smyth has given a recommendation on this somewhere, though I haven't ever come across it. My gut would say you should a hundred of samples or so, but that should be taken with a large grain of salt without empirical data. I should note that you'll always have lower power without sharing information across genes, it's just a question of how much you've lost. Of course, the more complicated the model, the more samples you'd really need to have.
  2. Normalization yes, filtering no. Well, filtering other than just removing rows with 0 counts (or otherwise will break the (G)LMM function) isn't necessary. You'll need to perform a library-size normalization. The most straight-forward way to do this is to first use DESeq or DESeq2 and get the resulting sizeFactor(). This can then be used as weight in your glmm. You can perform independent filtering after the fact once you have raw p-values. The genefilter package is convenient for this.
  3. Possible. If you run everything through limma::voom() first, then you'd have data in a nice format for a more traditional LMM.
  4. I've not seen much of any gain from zero-inflated based models over "simple" negative binomial models. There are a couple papers out there comparing negative-binomial, zero-inflated negative-binomial, and zero-inflated poisson models if you want some hard numbers on this.
Entering edit mode

Thank you very much Devon: you answers are very helpful!


Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6