Question

Should "dispersion" in edgeR be calculated for entire dataset or just conditions being analyzed?

0

Entering edit mode

3.1 years ago

O.rka ▴ 710

I'm creating a wrapper around edgeR's exactTest. I noticed (and understand why) the results different for when I calculate dispersion using the entire dataset and when just using the 2 pairs I'm analyzing. My question is whether or not one option is preferred. My intuition is telling me that it's better to calculate dispersion using the entire dataset, even if I'm only going to be looking at conditions separately.

In this terrible example (but for sake of simplicity), I'm using the iris dataset. There are 150 samples and 4 "genes" (['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) with 3 "conditions" (['setosa', 'versicolor', 'virginica']). I'm treating setosa as my "reference" condition so everything will be in relation to that.

If I calculate dispersion for each pair individually (i.e., setosa vs. versicolor, setosa vs. virginica). I get the following output from the exactTest:

enter image description here

If I calculate dispersion for the entire dataset first then I get this:

enter image description here

What is preferred by the bioinformatics community?

What are the pros and cons of using one way over another?

differential-gene-expression edgeR RNA-Seq • 631 views

ADD COMMENT • link updated 11 weeks ago by Ram 43k • written 3.1 years ago by O.rka ▴ 710