Question

Asymmetric Differential Expression

0

Entering edit mode

6.4 years ago

gaber898 • 0

Hey guys, I'm using RNA-seq data to determine differential expression between two conditions. It's accepted that between the two conditions there should be no up-regulation of genes, only down regulation (possibly) because transcription is turned off. It's not clear whether degradation is present, which is what I'm trying to answer. The problem is that when I use packages such as DESeq and edgeR I'm getting ~30 genes which are upregulated and ~20 genes which are downregulated. Then when I use DEGES/TbT I get no differentially expressed genes (up or down). After reading a paper (http://pages.pomona.edu/~jsh04747/Student%20Theses/CiaranEvans16.pdf) I realized the DESeq and edgeR normalization techniques may not be appropriate for datasets where differential expression is highly asymmetric. The paper suggests that DEGES is an alternative but I want to be sure. Does anyone have any thoughts on this? Thanks, Gabriel

RNA-Seq R DESeq edgeR DESeq2 • 1.5k views

ADD COMMENT • link updated 6.4 years ago by Kevin Blighe 87k • written 6.4 years ago by gaber898 • 0

0

Entering edit mode

Did you include spike-ins or other normalization controls?

ADD REPLY • link 6.4 years ago by Sean Davis 26k

0

Entering edit mode

Nope, there were no spike-ins.

ADD REPLY • link 6.4 years ago by gaber898 • 0

score 0 · Answer 1 · 2017-11-13

The mixture of up- and down-regulated genes is probably reflective of the way that both EdgeR and DESeq2 normalise data, i.e., a method based on the geometric median ratio of each sample to a pseudo-reference group of samples (broadly-explained). In the case of DESeq2, the Wald test is then performed on the negative binomial-distributed normalised counts. Thus, one will always end up with more or less an equal balance between up- and down-regulated genes.

I don't know much about DEGES, but my [radical] suggestion would be to nevertheless normalise data using DESeq2 or EdgeR and arrive at logged or regularised logged counts, respectively, that follow a binomial / Gaussian distribution. Then, per gene:

factorise the levels of expression into tertiles or quartiles
set the upper level as the reference level when encoding the factors
test each gene independently in a binomial logistic regression model predicting for your outcome of interest.

Thus, when the model runs, in each case you are comparing the lower- and mid gene expression levels to the ceiling/upper level in the context of the outcome of interest.

I do have some experience in doing this when I encountered a similar issue as yours. When we then divided our expression in tertiles, the data made sense in relation to the bundles of published literature on the subject (it didn't make sense when going by the continuous gene expression values).

Kevin