Question: edgeR artifacts on p-value and/or fold change?
0
gravatar for moxu
3.7 years ago by
moxu460
moxu460 wrote:

I use RSEM to align and quantify RNA-seqs and then use edgeR to do the differential gene expression analysis with or without ERCC normalization.

The problem is that I almost always see pairs of "wings" at the bottom of the -log(P-value) ~ log(FC) plot. For instance,

-log(P-value) vs log(FC)

At the bottom of the plot, you can see a set of genes on the left and another somewhat symmetric set of genes on the right are obviously separated from the other genes. I have seen such a pattern over and over again, with or without ERCC normalization before feeding the expression data to edgeR.

Any explanations?

edger rsem rna-seq • 1.9k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by moxu460

Have a look at the counts for those genes, presumably there's a high degree of variation.

ADD REPLYlink written 3.7 years ago by Devon Ryan95k

Maybe you can use this script to find out what those genes are.

And this volcano plot instance in this page have a so-so similar "wings".

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Farbod3.3k

I've used some other methods to find out the genes. Thanks for your input.

ADD REPLYlink written 3.7 years ago by moxu460

I checked the expression file, and you are right -- they are the genes with 0 counts in the control samples, and low counts in the treatment samples.

Since edgeR uses negative binomial distribution to calculate the p-values, the p-values should be OK for theses genes. The problem now is the fold change -- they should be infinite before adjustment. So what shall we do with the genes on the wings? Remove them by setting expression or fold change threshold, or just leave them there?

ADD REPLYlink written 3.7 years ago by moxu460

Did you filter out constantly lowly expressed genes in all samples by cpm or counts before downstream analysis ? Usually these genes are already filtered out because they do not provide enough statistical evidence for judgement.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by GZ1995360

No I didn't filter by counts for the reason stated above: negative binomial distribution should take care of whatever count values to give the correct p-values, so why bother? I am just not sure how the fold change is computed for a gene with 0 counts in one or more groups.

ADD REPLYlink written 3.7 years ago by moxu460

It matters for FDR ajustment and estimate mean-variance relationship (dispersion) for NB model as suggested by edgeR author here. edgeR computes logFC based on fitting NB generalized linear model with empirical bayes based shrinkage, not simply using raw counts, so it shouldn't be infinity.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by GZ1995360

OK, now I see. It sounds reasonable to filter out low count genes. What would be a good method to filter?

ADD REPLYlink written 3.7 years ago by moxu460
1

Read Section 2.6 of edgeR userguide.

ADD REPLYlink written 3.7 years ago by GZ1995360

Hi,

For genes with zero counts I have found similar situation in Trinity group, please search for:

>     It reflects some of the difficulties of trying to map reads back to partial transcript contigs.
>     
>     Here's some info on Kallisto and rna-seq Sleuth:
>     https://liorpachter.wordpress.com/2015/08/17/a-sleuth-for-rna-seq/
>     
>     If you're not getting read counts for some of your key targets, then that's clearly a problem.  You can try running your PE reads as
> single reads with the existing RSEM pipeline, and you might recoup the
> reads that correspond to the transcript that was reconstructed.
ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Farbod3.3k
1
gravatar for moxu
3.7 years ago by
moxu460
moxu460 wrote:

I think the "wings" are caused by heuristic fold change computation for genes with 0 count in either the control or the treatment group -- fold change should be infinity, right? But apparently edgeR used a heuristic method (e.g. use global adjusted count values or add some background noise to each 0 count) to avoid the divided by 0 problem. That made the fold changes of such genes to deviate from those of other genes, and thus the "wings" separation.

ADD COMMENTlink written 3.7 years ago by moxu460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1395 users visited in the last hour