Question

Typical Volcano Plot for DGE

0

Entering edit mode

9 months ago

fakeeha • 0

Hello,

This is my first doing a DGE and created a volcano plot for the genes that were found to be significantly differentially expressed. I have been looking at gene expression volcano plots in the literature and mine doesn't look quite similar to those. I am concerned if I even did the analysis the correct way. Would really appreciate some feedback on my plot. enter image description here

RNAseq volcano DGE plots • 1.9k views

ADD COMMENT • link updated 9 months ago by ATpoint 82k • written 9 months ago by fakeeha • 0

score 1 · Answer 1 · 2023-07-30

This will only give names to the differential expressed with adj P Value < 0.05, deg is a table from limma you just need to adjust to DESeq2

x <- rownames(deg)
y <- deg$adj.P.Val

for (i in 1:length(rownames(deg))) {
  if (y[i] > 0.05) {
    x[i] <- ""
  }
}

EnhancedVolcano(
  deg,
  lab = x,
  x = 'logFC',
  y = 'P.Value',
  labSize = 3.0,
  pCutoff = 1e-02,
  title = 'Cluster Three No Death vs Death',
  legendPosition = 'right',
  drawConnectors = TRUE,
  max.overlaps = 100
)

enter image description here

GenoMax · Answer 2 · 2023-07-29

0

Entering edit mode

9 months ago

jared.andrews07 ★ 16k

Code for how you're generating it would be helpful. The most glaring things are that your horizontal red line seemingly doesn't correspond to whatever you're using as a significant cutoff and that your vertical foldchange lines don't seem to correspond to any aesthetic mapping (e.g. ignoring genes between the two and coloring them red).

As an aside, avoid using red/green in the same plot - they're a pain to distinguish for color blind folks.

ADD COMMENT • link 9 months ago by jared.andrews07 ★ 16k

0

Entering edit mode

Thankyou for your response. I used DESeq2 to get the differential genes.

resLFC <- lfcShrink(dds, coef = "Fertilizer_H_vs_L", type = "apeglm")

resLFC_df <- as.data.frame(resLFC)

#adding a new column to resLFC data frame for differentially expressed genes

resLFC_df$diffexpressed <- "NO" #labelling all genes as NO

resLFC_df$diffexpressed[resLFC_df$log2FoldChange>0.1 & resLFC_df$padj<0.05 ] <- "UP"

resLFC_df$diffexpressed[resLFC_df$log2FoldChange<0.1 & resLFC_df$padj<0.05 ] <- "DOWN"

#Volcano plot 

ggplot(data = resLFC_df , aes(x= log2FoldChange , y= -log10(pvalue), col = diffexpressed))+
  geom_point()+
  theme_minimal()+
  scale_color_manual(values = c('skyblue', 'salmon', 'seagreen'))+
  theme(text = element_text(size = 20))+
  geom_vline(xintercept=c(-0.1, 0.1), col="red") +
  geom_hline(yintercept=-log10(0.05), col="red")

ADD REPLY • link updated 9 months ago by GenoMax 142k • written 9 months ago by fakeeha • 0

0

Entering edit mode

An update:

Instead of using the LFC shrink results, I used the actual DESeq results dataframe. This time the plot seems much better to me. Please let me know what you think. Thanks.

enter image description here

ADD REPLY • link 9 months ago by fakeeha • 0

score 0 · Answer 3 · 2023-07-30

0

Entering edit mode

9 months ago

ATpoint 82k

The whole point of the post-hoc lfc shrinkage methods in DESeq2 is to squeeze logFCs that are noisy and unreliable towards zero while preserving reliable logFCs. Volcanos usually have this inflated area at the bottom because large unreliable logFCs (with large p-values, hence small -log10(p)) are present in abundance. The shrinkage removes this. Both plots are perfectly fine and expected. Arguably, people are more used to the first type of plot which doesn't make the second one wrong. Subjectively, the second one is better as it emphasizes reliable logFCs while the first one mostly shows noise, but that's a personal taste. For QB purposes the first one, or MA-plots on unshrunken logFCs make anlot of sense as one can really see the data 'as-is' rather than the logFCs after the shrinkage magic has happened.

Also, note that you filter in the table based on padj, so consequenctly you would also (correctly!) also show padj in the volcano, but you show uncorrected/nominal p-value. I know it's common but not helpful imo as we do not use the uncorrected pvalue for anything, it's always padj/FDR that we use, and rightfully so as multiple testing problem is a real thing.

Does that make sense to you?

ADD COMMENT • link 9 months ago by ATpoint 82k

0

Entering edit mode

Thank you for curating such an informative response. It makes more sense now. Also, it was a mistake on my part where I input the p.values instead of padj. I have corrected that. Lastly, I have another concern regarding the log2foldchange value. As in, which value would be an ideal one to distinguish between upregulated and downregulated genes. I came across a lot of tutorials opting for either 0.1 or 1 while some even set it at 0. Please let me know what you think would be most appropriate.

ADD REPLY • link 9 months ago by fakeeha • 0

1

Entering edit mode

That is empirical, I would plot the distribution of the log2foldchange values in a histogram plot and decide a suitable threshold given the distribution.

ADD REPLY • link 9 months ago by Manuel Sokolov Ravasqueira ▴ 100

0

Entering edit mode

Seconding this. There is no reliable cutoff. Sometimes people use 1.5, sometimes 2, sometimes just 0. It depends on how many DEGs you have at all. if many you might be more strict, if few you might want to be generous to have any DEGs at all. Statistically, filtering on logFC is not even appropriate since this is not what the p-value is testing. It tests by default whether the logFC is zero. Strictly speaking, for logFC filter you need to test against that logFC, like in lfcShrink of DESeq2 or glmTreat in edgeR, but I am guilty of this myself much more often than I am willing to admit -- but often enough data are not powerful enough for such a stringent test so we do filtering when testing against 0. Lets say if you do 1.5 (so log2(1.5)) then you probably won't get harsh criticism for it since it's very common.

ADD REPLY • link 9 months ago by ATpoint 82k