Linear regression vs DESeq2 models for DEG analysis
3
0
Entering edit mode
1 day ago
FioG • 0

I am very new to this field and looking to get some feedback. I have bulk short read RNA sequencing results from cell culture samples WT and mutant. N=5 per genotype. I want to calculate DEGs and wondering which is the preferred method in this scenario: LM (linear regression model) or DESeq2?

I ran code for LM following cqn normalization and RPKM filtering and controlled for batch effect and RIN, and received 0 DEGs. In contrast, I also ran DESeq2 on raw counts and controlled for batch and RIN, and obtained hundreds of DEGs. Why are the results between the two methods so different and how do you decide on which method to use?

From my reading, I believe that DESeq2 would be the best based on my sample size. Any help or guidance greatly appreciated! Thank you in advance!

Bulk seq analysis RNA • 1.2k views
ADD COMMENT
4
Entering edit mode
7 hours ago

There are two main problems with using linear models for read-count based analyses.

Firstly, linear models assume a normal distribution. RNA-seq, being count based, is not even vaugely normally distributed, even when normalized, underlying it is still discrete, count-based data. Various tools have been made to model the log counts from RNAseq as normal, most succesfully limma-voom, but these require specialist corrections. A poisson generalised LM would be more appropriate, but it turns out that RNA-seq data is overdispersed, and doesn't even really fit a poisson model that well. DESeq2 and edgeR both model RNAseq count data using a negative binomial generalised LM.

Secondly, estimation of variance. When you use a linear model, the variance is being estimated using the variance of each condition. As the number of replicates in an RNAseq experiment is generally low (5 in this case), this leads to a poor estimate of the variance (a small number of degrees of freedom). This must be accounted for in the linear models, and leads to them not having much power. Differential expression analysis tools (those for microarrays as well as for RNAseq) use emprical bayes to "borrow" information between genes (similar genes are expected to have similar variance), giving much more powerful estimates of variance.

So DESeq2, edgeR and limma-voom all use different distributions for the data and estimate them in a different way to a linear model.

Simply put, DESeq2/edgeR/limma are valid analyses of RNA seq data, and LM on cqn normalised RPKMs are not.

ADD COMMENT
0
Entering edit mode
1 day ago

DE genes in bulk RNASeq is what DESeq2 was made for, so why wouldn't you use it? RPKM is not an appropriate normalization method here.

Most people do not include RIN as a variable. You really have a batch in sample prep with only 10 samples?

ADD COMMENT
0
Entering edit mode
2 hours ago
FioG • 0

swbarnes2 Thank you for your response! I am new to RNA seq analysis and was originally recommended to use the LM method following cqn and RPKM normalization by my school, but began to grow suspicious after I began reading into various methods.

In terms of the batch correction, I am using iPSC-derived astrocytes (2 genotypes) differentiated in 5 independent experiments.

ADD COMMENT

Login before adding your answer.

Traffic: 3291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6