I have a set of raw count data and I am interested in using (G)LMM to detect differentially expressed genes. However, I have a number of questions about how to prepare the better (correct?) pipeline for this task.
- I am aware that using linear models (instead of well-know tools, such as DESeq2) will give me less power -- unless I have a large set of samples. I know that this is a dumb question, but which number of samples can be called "large"?
- To have meaningful results I believe that a filtering and a normalisation step are needed beforehand. Is this assumption correct? Which is a reliable approach to filter/normalise my data?
- May it be useful to work with transformed versions of the count data?
- I usually use LMMs (lme4 R package) when looking for differentially expressed genes in the context of microarray data -- I work with multiplex family data and I want to correct for samples' relatedness. However, when RNA-Seq counts are at hand, is it better to use zero-inflated Poisson models? Or can I assume that there is only an overdispersion problem? Can the answer to this question be data-dependent?
Thanks in advance for your help,