Question: Too many differentially expressed genes in RNA-seq?
gravatar for virginia.s.markham
15 months ago by
virginia.s.markham30 wrote:

Hello all,

I have some RNA-seq data from an unusual biological system, and as a result I need to look for genes that are differentially expressed genes between two samples, one of which was super stressed. My DESeq2 results said that more than 70% of genes had an adjusted p-value below 0.1! I was expecting a lot of differentially expressed genes, but this seems very high. Is there such a thing as 'too many' DEGs? Is there a way I can test whether my data fall into this category? Does anyone know of any solutions to this issue? Thanks in advance!

rna-seq deseq2 • 2.1k views
ADD COMMENTlink modified 15 months ago by h.mon24k • written 15 months ago by virginia.s.markham30

Note that one of the first steps of the DESeq2 process is library normalization, where DESeq normalizes based on assuming that the genes whose expression is the median is expressed the same between your samples. If this fundamental assumption is wrong, it's going to affect the results you get. You might need to think about some other kind of normalization.

ADD REPLYlink written 15 months ago by swbarnes25.2k

Thanks for this response! I liked what Tiago mentioned about cutoffs so I made a volcano plot. In this plot, the blue is everything with padj < 1e-10, green is everything with log2 fold change > 5, and yellow is both. Would be really interested to hear anyone's thoughts on this. Not sure if it's ideal to just choose cutoffs that 'look good'. I'll look into other normalization methods like you suggest, though admittedly I'm not sure where to start...

ADD REPLYlink modified 15 months ago • written 15 months ago by virginia.s.markham30

Here is a little tip for your volcano plot: Make a column with information about who is DEG, and who is not

DESeq.Result$super <- DESeq.Result$log2FoldChange >  1  & DESeq.Result$padj < 0.01
DESeq.Result$sub   <- DESeq.Result$log2FoldChange < -1  & DESeq.Result$padj < 0.01
DESeq.Result$threshold <- as.factor(abs(DESeq.Result$log2FoldChange) > 1 & DESeq.Result$padj < 0.01)

volcano_plot <- ggplot(data=DESeq.Result, aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(data=DESeq.Result, size=1, colour="gray") +
  geom_point(data=DESeq.Result[DESeq.Result$super==TRUE, ], size=2, colour="#CC0000") +
  geom_point(data=DESeq.Result[DESeq.Result$sub  ==TRUE, ], size=2, colour="#000099") +
  xlab("log2 fold change") +
  ylab("-log10 p-value adjusted") +
  ggtitle("My Title")+
  scale_x_continuous() +
  scale_y_continuous() +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16, colour="black"),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  theme(plot.title = element_text(hjust = 0.5))
ADD REPLYlink modified 15 months ago • written 15 months ago by tiago2112871.1k
gravatar for h.mon
15 months ago by
h.mon24k wrote:

This question has been asked before, see previous discussion at Question: Differential expression for two very different samples. As swbarnes2 said, one assumption of DESeq2 and edgeR is that not too many genes are differentially expressed. edgeR allows the user to mitigate this issue with the argument logratioTrim in the function calcNormFactors() - see here. I don't know DESeq2 has a similar option - I believe it doesn't.

Regarding cut-offs: the author of DESeq2 argues that instead of applying a cut-off after a regular test for differential expression (that is, a test against the null hypothesis of zero LFC), it is better to test if the LFC is above a certain threshold. He argues that the later approach keeps the nominal fdr, while the former doens't. See discussions on DESeq2 vignette and paper, and at the BioConductor forum here and here. Use the arguments lfcThreshold and altHypothesis from results() to set the desired LFC threshold and test.

ADD COMMENTlink written 15 months ago by h.mon24k

Whoops, I don't know how I missed that earlier post. I will give the edgeR docs a read. I also like what you suggested about the L2FC cutoff--it seems like an intuitive way to compare very different samples. Thank you!!!

ADD REPLYlink written 15 months ago by virginia.s.markham30
gravatar for tiago211287
15 months ago by
tiago2112871.1k wrote:

It depends on your target organism and tissue. Also, adjusted p-values(padj) of 0.1 aren't big. 0.1 = 10%. In general, a good cutoff for considering a gene as a DEG is less than 0.01 (1%). Better to visualize it in the minus (-log10(padj)) scale so that anything between 0 and 1 will be a non-DEG, and everything bigger than 2 may be considered as a DEG. Keep in mind that a better approach is to use a cutoff for the fold change as well. Considering only the padj may not be relevant.

ADD COMMENTlink modified 15 months ago • written 15 months ago by tiago2112871.1k
gravatar for JJ
15 months ago by
JJ430 wrote:

I agree it depends on the biology behind it and usually you know if you expect a lot or not. I also agree with Tiago that the p value cutoff should be set stricter. Standard cutoff is 0.05. Additionally you could set a cutoff for the fold change and/or for the FPKM/RPKM values. Standard cutoff for log2FC is usually < -1 and > 1.

ADD COMMENTlink written 15 months ago by JJ430
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1641 users visited in the last hour