Question: Too many differentially expressed genes in RNA-seq?
gravatar for virginia.s.markham
2.2 years ago by
virginia.s.markham50 wrote:

Hello all,

I have some RNA-seq data from an unusual biological system, and as a result I need to look for genes that are differentially expressed genes between two samples, one of which was super stressed. My DESeq2 results said that more than 70% of genes had an adjusted p-value below 0.1! I was expecting a lot of differentially expressed genes, but this seems very high. Is there such a thing as 'too many' DEGs? Is there a way I can test whether my data fall into this category? Does anyone know of any solutions to this issue? Thanks in advance!

rna-seq deseq2 • 3.3k views
ADD COMMENTlink modified 2.2 years ago by h.mon29k • written 2.2 years ago by virginia.s.markham50

Note that one of the first steps of the DESeq2 process is library normalization, where DESeq normalizes based on assuming that the genes whose expression is the median is expressed the same between your samples. If this fundamental assumption is wrong, it's going to affect the results you get. You might need to think about some other kind of normalization.

ADD REPLYlink written 2.2 years ago by swbarnes27.5k

Thanks for this response! I liked what Tiago mentioned about cutoffs so I made a volcano plot. In this plot, the blue is everything with padj < 1e-10, green is everything with log2 fold change > 5, and yellow is both. Would be really interested to hear anyone's thoughts on this. Not sure if it's ideal to just choose cutoffs that 'look good'. I'll look into other normalization methods like you suggest, though admittedly I'm not sure where to start...

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by virginia.s.markham50

Here is a little tip for your volcano plot: Make a column with information about who is DEG, and who is not

DESeq.Result$super <- DESeq.Result$log2FoldChange >  1  & DESeq.Result$padj < 0.01
DESeq.Result$sub   <- DESeq.Result$log2FoldChange < -1  & DESeq.Result$padj < 0.01
DESeq.Result$threshold <- as.factor(abs(DESeq.Result$log2FoldChange) > 1 & DESeq.Result$padj < 0.01)

volcano_plot <- ggplot(data=DESeq.Result, aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(data=DESeq.Result, size=1, colour="gray") +
  geom_point(data=DESeq.Result[DESeq.Result$super==TRUE, ], size=2, colour="#CC0000") +
  geom_point(data=DESeq.Result[DESeq.Result$sub  ==TRUE, ], size=2, colour="#000099") +
  xlab("log2 fold change") +
  ylab("-log10 p-value adjusted") +
  ggtitle("My Title")+
  scale_x_continuous() +
  scale_y_continuous() +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16, colour="black"),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  theme(plot.title = element_text(hjust = 0.5))
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by tiago2112871.2k
gravatar for h.mon
2.2 years ago by
h.mon29k wrote:

This question has been asked before, see previous discussion at Question: Differential expression for two very different samples. As swbarnes2 said, one assumption of DESeq2 and edgeR is that not too many genes are differentially expressed. edgeR allows the user to mitigate this issue with the argument logratioTrim in the function calcNormFactors() - see here. I don't know DESeq2 has a similar option - I believe it doesn't.

Regarding cut-offs: the author of DESeq2 argues that instead of applying a cut-off after a regular test for differential expression (that is, a test against the null hypothesis of zero LFC), it is better to test if the LFC is above a certain threshold. He argues that the later approach keeps the nominal fdr, while the former doens't. See discussions on DESeq2 vignette and paper, and at the BioConductor forum here and here. Use the arguments lfcThreshold and altHypothesis from results() to set the desired LFC threshold and test.

ADD COMMENTlink written 2.2 years ago by h.mon29k

Whoops, I don't know how I missed that earlier post. I will give the edgeR docs a read. I also like what you suggested about the L2FC cutoff--it seems like an intuitive way to compare very different samples. Thank you!!!

ADD REPLYlink written 2.2 years ago by virginia.s.markham50
gravatar for tiago211287
2.2 years ago by
tiago2112871.2k wrote:

It depends on your target organism and tissue. Also, adjusted p-values(padj) of 0.1 aren't big. 0.1 = 10%. In general, a good cutoff for considering a gene as a DEG is less than 0.01 (1%). Better to visualize it in the minus (-log10(padj)) scale so that anything between 0 and 1 will be a non-DEG, and everything bigger than 2 may be considered as a DEG. Keep in mind that a better approach is to use a cutoff for the fold change as well. Considering only the padj may not be relevant.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by tiago2112871.2k
gravatar for JJ
2.2 years ago by
JJ500 wrote:

I agree it depends on the biology behind it and usually you know if you expect a lot or not. I also agree with Tiago that the p value cutoff should be set stricter. Standard cutoff is 0.05. Additionally you could set a cutoff for the fold change and/or for the FPKM/RPKM values. Standard cutoff for log2FC is usually < -1 and > 1.

ADD COMMENTlink written 2.2 years ago by JJ500

Yes - I strongly agree with having a fold-change filter.

While it is a slightly different topic, I thought this was an issue when the number of differentially expressed genes was approaching the number of genes in the genome (which would lead to inappropriate power calculations):

ADD REPLYlink written 8 months ago by Charles Warden7.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1574 users visited in the last hour