Question

RNASeq: Differential gene expression and qqplot

3

Entering edit mode

6.2 years ago

firestar ★ 1.6k

I ran DGE on my dataset using DESeq2 and Limma. I used the same dataset, similar settings and same GLM model (~family+condition). The number of DEGs I get are as below (based on unadjusted p-values <0.05 and BH adjusted q-values <0.05 shown separately):

Tool   num_degs_pval  num_degs_qval
DESeq2 3155           623
Limma  2853           32

I plotted qqplots using unadjusted pvalues from both DESeq2 and Limma results.

library(snpStats)
qq.chisq(-2*log(deseq2_results$pvalue),df=2,pvals=T,overdisp=T)
qq.chisq(-2*log(limma_results$pvalue),df=2,pvals=T,overdisp=T)

This is what I get:

enter image description here

Now questions;

Is the qqplot useful in RNA-Seq to estimate DGE quality?
In this example, does the better fit of Limma mean that the fewer set of DEGs returned by Limma is more reliable?
Similarly, does the inflated curve for DESeq2 mean that the huge number of DEGs returned by DESeq2 contains more false positives?

Couple of caveats: The function I used qq.chisq is for chisq tests and I am not sure if that is an issue since my Deseq2 script uses Wald test and Limma uses bayesian eBayes test. The df argument in qq.chisq is set to 2 arbitrarily. I am not sure what value I should plug in here.

limma differential-gene-expression RNA-Seq deseq2 • 5.2k views

ADD COMMENT • link updated 10 days ago by Ram 43k • written 6.2 years ago by firestar ★ 1.6k

score 0 · Answer 1 · 2018-02-25

Interesting findings but one would need to know your sample size before making any further comments. Wait, is this that family study where we were previously debating over the inclusion/exclusion of PCA eigenvectors as covariates? In that case, I have a rough idea of the sample size.

While the limma QQ plot looks a lot more convincing, I am not confident that the same QQ Chi-squared measure of goodness of fit can be applied equally to both sets of P values. DESeq2 derives P values by applying the Wald test to its negative binomial model of raw/normalised counts (as you implied), while Limma goes by quantile normalisation, which follows a binomial.

You may wish to retry this by generating the QQ plot differently for each based on the underlying distribution. Apparently you can do this with this R function: https://www.rdocumentation.org/packages/qualityTools/versions/1.55/topics/qqPlot

y, character string specifying the distribution of x. The function qqPlot will support the following character strings for y:

 - “beta”
 - “cauchy”
 - “chi-squared”
 - “exponential”
 - “f”
 - “gamma”
 - “geometric”
 - “log-normal”
 - “lognormal”
 - “logistic”
 - “negative binomial”
 - “normal”
 - “Poisson”
 - “t”
 - “weibull”

You may also be interested in the work of these guys (see Figure 3): A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments