Question

How to define a non differentially expressed gene?

1

Entering edit mode

7.1 years ago

xieshaojun0621 ▴ 210

For example, I have C1, T1, C2 and T2 (each with triplicate). I ran DESeq2 to do pair-wise comparison.

Now I'd like to extract the list of genes that show differentially expressed between C1 and T1, but non-differentially expressed between C2 and T2.

For DEG, I can filter adj-pvalue and fold change to get the DEG.

How should I define the non-differentially expressed genes between C2 and T2? For example, adj-pvalue >0.1. Please let me know if you have any suggestions. Thanks.

rna-seq DEG non-DEG DESeq2 edgeR • 4.0k views

ADD COMMENT • link updated 7.0 years ago by Michael Love ★ 2.6k • written 7.1 years ago by xieshaojun0621 ▴ 210

1

Entering edit mode

There are some similar posts at the bioconductor mailing list I'd probably do a different analysis though, and set up the following contrasts: A = (T1 - C1) B = (T2 - C2) - (T1 - C1)

Pull out all genes that are significant wrt A into set G_A

For just the genes in G_A, pull out any that are significant wrt B into set G_B.

Then filter the genes in G_B to keep only those whose log-fold-change has opposite parity for contrast A and contrast B: that is, those that are (A-positive, B-negative) or (A-negative, B-positive). Really I'd want to do a one-sided test, that the contrast B is less than 0 whenever A is greater than 0 (and vice versa for A<0, B>0).

ADD REPLY • link 7.1 years ago by russhh 5.7k

0

Entering edit mode

Let's imagine that T1 is a drug treatment in one kind of vehicle (solvent), T2 a drug treatment in a different vehicle, and C1 and C2 the use of each vehicle only (no drug dissolved).

To rephrase your question are you saying that you have a design matrix like the following (triplicates not indicated) and you are looking for an effect that is seen specifically in treatment 1, after correcting for differences induced by the vehicle ?

    Veh Drug 
C1   A   - 
T1   A   1 
C2   B   - 
T2   B   2

(In that case, I do not have the solution, but I would be interested to read it from somebody else!)

ADD REPLY • link 7.1 years ago by Charles Plessy ★ 2.9k

score 4 · Answer 1 · 2017-03-06

4

Entering edit mode

7.1 years ago

WouterDeCoster 47k

You cannot select non-differentially expressed genes. Because absence of evidence is not evidence of absence of differential expression. You can select those genes for which there is no statistical evidence of differential expression (which doesn't mean they aren't). Those genes would just be all genes in the experiment (also those prior to filtering) without those which turned out to be significant.

ADD COMMENT • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks for your response. But do you have any suggestions of how to achieve my goal?

ADD REPLY • link 7.1 years ago by xieshaojun0621 ▴ 210

score 3 · Answer 2 · 2017-03-07

As WouterDeCoster said, you can't simply take the genes that fail to meet your threshold for differential expression. Those genes just happen to have insufficient evidence to call them differentially expressed. To define a set of genes that don't change between conditions you'll want to control the Type II error rate. You could carry out a power analysis to determine a p-value threshold at which you expect to be able to identify at least x% of genes that change enough to meet your fold change threshold. Conceptually that is still a bit problematic, I think, but it does at least make it more explicit what exactly you mean by "non-differentially expressed" and does give you a bit of a handle on the error rate.

Edit based on discussion below

An alternative would be to forego the usual testing approach entirely and focus on the estimation of the fold change instead. Instead of testing whether the log fold change is 0 for each gene in each of the pairs, estimate the fold change for each pair, including a confidence interval. You could then determine sets of differentially expressed genes by comparing the estimate for each gene to your fold change threshold. Similarly, a set of genes that don't change can be derived by comparing the estimate to 1 (or 0 for log fold change). A reasonable scheme might be to choose all genes where the confidence interval contains 1 but not your chosen threshold.

score 2 · Answer 3 · 2017-03-08

2

Entering edit mode

7.1 years ago

unksci ▴ 180

Rather than using significance (or a given arbitrary threshold for it), you might want to use effect size (after ensuring that you only plot genes which are well in the regime of your detection method). This might be particularly the case if you needed to justify "biological relevance" or find genes for well-doable follow-up experiments.

If C1 and C2 should be biologically different (e.g.: different cell lines,...) a 2D scatter plot might be a better argument than p-values.

If you plot log(T1/C1) on X, and log(T2/C2) on Y, you will immediately see if most genes share a trend TvsC, and whether there would be subsets of genes that would clearly stand out from the rest of the genes, and populate the X or Y axis. If the general trend does not follow the diagonal, you might conclude that 1 and 2 would affect the same genes, but that in 1 or 2 the response is stronger / faster. If you do not see evidence for either, differences in the p-values might result from chance, or slight variation in the experimental settings.

A scatter plot will also tell you if it changes in T1/C1 and T2/C2 affect induced and repressed genes similarly.

(Also note that, if you had arbitrarily precise measurement method, all genes would appear differentially expressed.)

ADD COMMENT • link 7.1 years ago by unksci ▴ 180

1

Entering edit mode

note that my suggestion could also be extended by only considering genes, which pass some significance threshold within one of the comparisons;

for one example where this approach led to correct prioritization (which also passed - the less selective - statistical tests, and follow up work / verification) see Figure 1 of: http://www.nature.com/nature/journal/v523/n7558/abs/nature14429.html

(side note: if experiment is performed nicely, non-significantly differentially expressed should all occupy same space in center of scatter-plot – thus they do not obstruct the goal of such an analysis that focuses on genes that are clearly different from bulk trend)

(side note2): my comment on precise measurements served to say that the specific scope of non-differentially expressed genes primarily depends on experimental parameters (and some intrinsic properties of genes, such as closeness to LE/HE boundary), rather than them having identical expression.

ADD REPLY • link 7.1 years ago by unksci ▴ 180

0

Entering edit mode

I don't believe it makes sense to create plots of 20000 genes to find those which don't appear differentially expressed, I'd argue that human judgement, in this case, is not a good measurement and statistical analysis is more appropriate.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Visually inspecting all genes doesn't seem like a good idea. A statistical analysis is certainly called for. However, focusing on estimation (of the fold change) rather than testing may indeed be helpful.

Also note that, if you had arbitrarily precise measurement method, all genes would appear differentially expressed.

Note that the OP includes a fold change threshold to identify differentially expressed genes, so this assertion isn't true.

ADD REPLY • link 7.1 years ago by Peter Humburg ▴ 50

score 2 · Answer 4 · 2017-04-07

We discuss this in the DESeq2 paper and software vignette, you can use altHypothesis="lessAbs" in which the alternative hypothesis is that |beta| < lfcThreshold. We don't perform testing on the full null region, but substitute with a test at the boundary, and return the maximum p-value of the upper and lower one-sided tests. Also see ?results for details.