Question

Linear regression vs DESeq2 models for DEG analysis

0

Entering edit mode

7 weeks ago

FioG • 0

I am very new to this field and looking to get some feedback. I have bulk short read RNA sequencing results from cell culture samples WT and mutant. N=5 per genotype. I want to calculate DEGs and wondering which is the preferred method in this scenario: LM (linear regression model) or DESeq2?

I ran code for LM following cqn normalization and RPKM filtering and controlled for batch effect and RIN, and received 0 DEGs. In contrast, I also ran DESeq2 on raw counts and controlled for batch and RIN, and obtained hundreds of DEGs. Why are the results between the two methods so different and how do you decide on which method to use?

From my reading, I believe that DESeq2 would be the best based on my sample size. Any help or guidance greatly appreciated! Thank you in advance!

Bulk seq analysis RNA • 4.5k views

ADD COMMENT • link updated 7 weeks ago by ATpoint 90k • written 7 weeks ago by FioG • 0

2

Entering edit mode

7 weeks ago

swbarnes2 15k

DE genes in bulk RNASeq is what DESeq2 was made for, so why wouldn't you use it? RPKM is not an appropriate normalization method here.

Most people do not include RIN as a variable. You really have a batch in sample prep with only 10 samples?

ADD COMMENT • link 7 weeks ago by swbarnes2 15k

0

Entering edit mode

swbarnes2 Thank you for your response! I am new to RNA seq analysis and was originally recommended to use the LM method following cqn and RPKM normalization by my school, but began to grow suspicious after I began reading into various methods.

In terms of the batch correction, I am using iPSC-derived astrocytes (2 genotypes) differentiated in 5 independent experiments.

ADD REPLY • link 7 weeks ago by FioG • 0

0

Entering edit mode

It is not clear to me that you can meaningfully correct for 5 different batches with two samples a piece.

ADD REPLY • link 7 weeks ago by swbarnes2 15k

score 12 · Accepted Answer · 2025-09-28

There are two main problems with using linear models for read-count based analyses.

Firstly, linear models assume a normal distribution. RNA-seq, being count based, is not even vaugely normally distributed, even when normalized, underlying it is still discrete, count-based data. Various tools have been made to model the log counts from RNAseq as normal, most succesfully limma-voom, but these require specialist corrections. A poisson generalised LM would be more appropriate, but it turns out that RNA-seq data is overdispersed, and doesn't even really fit a poisson model that well. DESeq2 and edgeR both model RNAseq count data using a negative binomial generalised LM.

Secondly, estimation of variance. When you use a linear model, the variance is being estimated using the variance of each condition. As the number of replicates in an RNAseq experiment is generally low (5 in this case), this leads to a poor estimate of the variance (a small number of degrees of freedom). This must be accounted for in the linear models, and leads to them not having much power. Differential expression analysis tools (those for microarrays as well as for RNAseq) use emprical bayes to "borrow" information between genes (similar genes are expected to have similar variance), giving much more powerful estimates of variance.

So DESeq2, edgeR and limma-voom all use different distributions for the data and estimate them in a different way to a linear model.

Simply put, DESeq2/edgeR/limma are valid analyses of RNA seq data, and LM on cqn normalised RPKMs are not.