Question: Too many differentially expressed genes in RNA-seq?
gravatar for virginia.s.markham
5 months ago by
virginia.s.markham0 wrote:

Hello all,

I have some RNA-seq data from an unusual biological system, and as a result I need to look for genes that are differentially expressed genes between two samples, one of which was super stressed. My DESeq2 results said that more than 70% of genes had an adjusted p-value below 0.1! I was expecting a lot of differentially expressed genes, but this seems very high. Is there such a thing as 'too many' DEGs? Is there a way I can test whether my data fall into this category? Does anyone know of any solutions to this issue? Thanks in advance!

rna-seq deseq2 • 654 views
ADD COMMENTlink modified 5 months ago by h.mon15k • written 5 months ago by virginia.s.markham0

Note that one of the first steps of the DESeq2 process is library normalization, where DESeq normalizes based on assuming that the genes whose expression is the median is expressed the same between your samples. If this fundamental assumption is wrong, it's going to affect the results you get. You might need to think about some other kind of normalization.

ADD REPLYlink written 5 months ago by swbarnes23.6k

Thanks for this response! I liked what Tiago mentioned about cutoffs so I made a volcano plot. In this plot, the blue is everything with padj < 1e-10, green is everything with log2 fold change > 5, and yellow is both. Would be really interested to hear anyone's thoughts on this. Not sure if it's ideal to just choose cutoffs that 'look good'. I'll look into other normalization methods like you suggest, though admittedly I'm not sure where to start...

ADD REPLYlink modified 5 months ago • written 5 months ago by virginia.s.markham0

Here is a little tip for your volcano plot: Make a column with information about who is DEG, and who is not

DESeq.Result$super <- DESeq.Result$log2FoldChange >  1  & DESeq.Result$padj < 0.01
DESeq.Result$sub   <- DESeq.Result$log2FoldChange < -1  & DESeq.Result$padj < 0.01
DESeq.Result$threshold <- as.factor(abs(DESeq.Result$log2FoldChange) > 1 & DESeq.Result$padj < 0.01)

volcano_plot <- ggplot(data=DESeq.Result, aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(data=DESeq.Result, size=1, colour="gray") +
  geom_point(data=DESeq.Result[DESeq.Result$super==TRUE, ], size=2, colour="#CC0000") +
  geom_point(data=DESeq.Result[DESeq.Result$sub  ==TRUE, ], size=2, colour="#000099") +
  xlab("log2 fold change") +
  ylab("-log10 p-value adjusted") +
  ggtitle("My Title")+
  scale_x_continuous() +
  scale_y_continuous() +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16, colour="black"),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  theme(plot.title = element_text(hjust = 0.5))
ADD REPLYlink modified 5 months ago • written 5 months ago by tiago211287990
gravatar for h.mon
5 months ago by
h.mon15k wrote:

This question has been asked before, see previous discussion at Question: Differential expression for two very different samples. As swbarnes2 said, one assumption of DESeq2 and edgeR is that not too many genes are differentially expressed. edgeR allows the user to mitigate this issue with the argument logratioTrim in the function calcNormFactors() - see here. I don't know DESeq2 has a similar option - I believe it doesn't.

Regarding cut-offs: the author of DESeq2 argues that instead of applying a cut-off after a regular test for differential expression (that is, a test against the null hypothesis of zero LFC), it is better to test if the LFC is above a certain threshold. He argues that the later approach keeps the nominal fdr, while the former doens't. See discussions on DESeq2 vignette and paper, and at the BioConductor forum here and here. Use the arguments lfcThreshold and altHypothesis from results() to set the desired LFC threshold and test.

ADD COMMENTlink written 5 months ago by h.mon15k

Whoops, I don't know how I missed that earlier post. I will give the edgeR docs a read. I also like what you suggested about the L2FC cutoff--it seems like an intuitive way to compare very different samples. Thank you!!!

ADD REPLYlink written 4 months ago by virginia.s.markham0
gravatar for tiago211287
5 months ago by
tiago211287990 wrote:

It depends on your target organism and tissue. Also, adjusted p-values(padj) of 0.1 aren't big. 0.1 = 10%. In general, a good cutoff for considering a gene as a DEG is less than 0.01 (1%). Better to visualize it in the minus (-log10(padj)) scale so that anything between 0 and 1 will be a non-DEG, and everything bigger than 2 may be considered as a DEG. Keep in mind that a better approach is to use a cutoff for the fold change as well. Considering only the padj may not be relevant.

ADD COMMENTlink modified 5 months ago • written 5 months ago by tiago211287990
gravatar for JJ
5 months ago by
JJ310 wrote:

I agree it depends on the biology behind it and usually you know if you expect a lot or not. I also agree with Tiago that the p value cutoff should be set stricter. Standard cutoff is 0.05. Additionally you could set a cutoff for the fold change and/or for the FPKM/RPKM values. Standard cutoff for log2FC is usually < -1 and > 1.

ADD COMMENTlink written 5 months ago by JJ310
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1567 users visited in the last hour