Question: Resources to understand regression models for count data
3
3.0 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

I'd like to better understand the mathematical background behind the many R packages used for RNA-seq analysis, such as DESeq2 and edgeR, but my background in statistics is not extensive and I am struggling to understand all of the concepts introduced. So far I have read the main papers which introduce these packages, and also a handful of papers they reference. I've also went through the Statistics for Genomic Data Science resource for added information about generalised linear models. I understand that a regression model is fitted for each feature, and that usually a negative-binomial is used to account for dispersion. Also, because the number of replicates is usually low, the dispersion parameter is estimated from information shared between all features. Fold change is then measured using the predicted counts for each condition and significance inferred by a likelihood ratio test. Whilst in theory I believe I understand this (please correct me if I'm wrong), I'd really like to be able to implement generally the main steps in R to get a more hands-on appreciation of what's happening. Ideally, I'm looking for tutorials which aim to analyse RNA-seq or similar genomic count data in R using regression models which don't gloss over the details by simply calling the functions from these packages. Or alternatively, recommendations for biostatistics books which have a section on regression models for count data with examples in R.

written 3.0 years ago by James Ashmore2.6k

Your question is a great one, and I look forward to other answers from those who know of where people hide these kind of resources.

Unfortunately, you're going to need to put in quite a bit of work to get where you want to be. There aren't tutorials for implementing basics with respect to complete data analysis with these techniques, so you'll have to do some of the following:

Check out this other answer on DESeq

If you come from a programming background, check out the source code for DESeq2 and see how they do it.

For implementation of the equations behind statistical programming, take a look at people's code for Markov Chain Monte Carlo in R, since these are often implemented explicitly to interface with programs like Stan, or find examples of their implementations in blog posts: MCMC, maximum likelihood

Overall, the best descriptions of processes are going to be in the papers themselves, and going from literature to implementation is a matter of practicing the translation of formulas to code.

Other book resources for good examples of Bayesian R implementations include Doing Bayesian Data Analysis.