Question: Too many differentially expressed genes in RNA-seq?
1
gravatar for virginia.s.markham
10 months ago by
virginia.s.markham10 wrote:

Hello all,

I have some RNA-seq data from an unusual biological system, and as a result I need to look for genes that are differentially expressed genes between two samples, one of which was super stressed. My DESeq2 results said that more than 70% of genes had an adjusted p-value below 0.1! I was expecting a lot of differentially expressed genes, but this seems very high. Is there such a thing as 'too many' DEGs? Is there a way I can test whether my data fall into this category? Does anyone know of any solutions to this issue? Thanks in advance!

rna-seq deseq2 • 1.4k views
ADD COMMENTlink modified 10 months ago by h.mon21k • written 10 months ago by virginia.s.markham10
2

Note that one of the first steps of the DESeq2 process is library normalization, where DESeq normalizes based on assuming that the genes whose expression is the median is expressed the same between your samples. If this fundamental assumption is wrong, it's going to affect the results you get. You might need to think about some other kind of normalization.

ADD REPLYlink written 10 months ago by swbarnes24.5k

Thanks for this response! I liked what Tiago mentioned about cutoffs so I made a volcano plot. In this plot, the blue is everything with padj < 1e-10, green is everything with log2 fold change > 5, and yellow is both. Would be really interested to hear anyone's thoughts on this. Not sure if it's ideal to just choose cutoffs that 'look good'. I'll look into other normalization methods like you suggest, though admittedly I'm not sure where to start...

ADD REPLYlink modified 10 months ago • written 10 months ago by virginia.s.markham10
1

Here is a little tip for your volcano plot: Make a column with information about who is DEG, and who is not

DESeq.Result$super <- DESeq.Result$log2FoldChange >  1  & DESeq.Result$padj < 0.01
DESeq.Result$sub   <- DESeq.Result$log2FoldChange < -1  & DESeq.Result$padj < 0.01
DESeq.Result$threshold <- as.factor(abs(DESeq.Result$log2FoldChange) > 1 & DESeq.Result$padj < 0.01)

volcano_plot <- ggplot(data=DESeq.Result, aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(data=DESeq.Result, size=1, colour="gray") +
  geom_point(data=DESeq.Result[DESeq.Result$super==TRUE, ], size=2, colour="#CC0000") +
  geom_point(data=DESeq.Result[DESeq.Result$sub  ==TRUE, ], size=2, colour="#000099") +
  xlab("log2 fold change") +
  ylab("-log10 p-value adjusted") +
  ggtitle("My Title")+
  scale_x_continuous() +
  scale_y_continuous() +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16, colour="black"),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  theme(plot.title = element_text(hjust = 0.5))
volcano_plot
ADD REPLYlink modified 10 months ago • written 10 months ago by tiago2112871.0k
1
gravatar for h.mon
10 months ago by
h.mon21k
Brazil
h.mon21k wrote:

This question has been asked before, see previous discussion at Question: Differential expression for two very different samples. As swbarnes2 said, one assumption of DESeq2 and edgeR is that not too many genes are differentially expressed. edgeR allows the user to mitigate this issue with the argument logratioTrim in the function calcNormFactors() - see here. I don't know DESeq2 has a similar option - I believe it doesn't.

Regarding cut-offs: the author of DESeq2 argues that instead of applying a cut-off after a regular test for differential expression (that is, a test against the null hypothesis of zero LFC), it is better to test if the LFC is above a certain threshold. He argues that the later approach keeps the nominal fdr, while the former doens't. See discussions on DESeq2 vignette and paper, and at the BioConductor forum here and here. Use the arguments lfcThreshold and altHypothesis from results() to set the desired LFC threshold and test.

ADD COMMENTlink written 10 months ago by h.mon21k

Whoops, I don't know how I missed that earlier post. I will give the edgeR docs a read. I also like what you suggested about the L2FC cutoff--it seems like an intuitive way to compare very different samples. Thank you!!!

ADD REPLYlink written 10 months ago by virginia.s.markham10
2
gravatar for tiago211287
10 months ago by
tiago2112871.0k
USA
tiago2112871.0k wrote:

It depends on your target organism and tissue. Also, adjusted p-values(padj) of 0.1 aren't big. 0.1 = 10%. In general, a good cutoff for considering a gene as a DEG is less than 0.01 (1%). Better to visualize it in the minus (-log10(padj)) scale so that anything between 0 and 1 will be a non-DEG, and everything bigger than 2 may be considered as a DEG. Keep in mind that a better approach is to use a cutoff for the fold change as well. Considering only the padj may not be relevant.

ADD COMMENTlink modified 10 months ago • written 10 months ago by tiago2112871.0k
0
gravatar for JJ
10 months ago by
JJ390
JJ390 wrote:

I agree it depends on the biology behind it and usually you know if you expect a lot or not. I also agree with Tiago that the p value cutoff should be set stricter. Standard cutoff is 0.05. Additionally you could set a cutoff for the fold change and/or for the FPKM/RPKM values. Standard cutoff for log2FC is usually < -1 and > 1.

ADD COMMENTlink written 10 months ago by JJ390
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1160 users visited in the last hour