Question: How to show both edgeR and deseq2 results in a single volcano plot; highliting overlaps
2
gravatar for Farbod
11 months ago by
Farbod3.2k
Toronto
Farbod3.2k wrote:

Dear Biostars, Hi.

I there any way to show the edgeR and DESeq2 DEG results in one single volcano, AND showing their overlaps in a different colour (e.g. Non-significant= blue, significant = red, significant overlapped between both package= orange) ?

Note: This is the code I usually use for drawing my volcano plot in Rsudio but I only can feed one counts.matrix file to it, each time.

library(ggplot2)
data <- read.table("trans.counts.matrix.J_vs_M.edgeR.DE_results", header=TRUE)
##Identify the genes that have a FDR < 0.001
data$threshold = as.factor(data$FDR < 0.001)
##Construct the plot object
g <- ggplot(data=data, 
            aes(x=logFC, y =-log10(FDR), 
                colour=threshold)) +

  geom_point(alpha=0.4, size=3) +
  xlim(c(-14, 14)) +
  xlab("log2 fold change") + ylab("-log10 FDR") +
  theme_bw() +
  scale_color_manual(values = c("blue", "red")) +
  theme(legend.position="none")
g
deg volcano rna-seq R • 841 views
ADD COMMENTlink modified 11 months ago by russhh4.3k • written 11 months ago by Farbod3.2k

Please provide example data.

ADD REPLYlink written 11 months ago by zx87547.1k

Dear zx8754. Hi and thank you for your help.

I did not get the point clearly about "example data". If you mean counts.matrix structure, it could be any counts, but the head of my "data" in the code above is as below"

> head (data)
         external_gene_name     logFC   logCPM      P.Value          FDR threshold
1 TRINITY_DN179827_c0_g2_i4 -13.87327 4.618556 3.038112e-30 1.878978e-24      TRUE
2 TRINITY_DN194528_c5_g3_i4  13.26971 4.016180 1.219369e-27 3.770709e-22      TRUE
3 TRINITY_DN192883_c9_g2_i2  11.22692 3.821780 3.201530e-26 6.600158e-21      TRUE
4 TRINITY_DN185936_c0_g1_i1  12.85547 3.603356 2.980189e-24 4.607886e-19      TRUE
5 TRINITY_DN180711_c0_g1_i1 -11.87279 2.627359 1.814995e-20 2.245037e-15      TRUE
6 TRINITY_DN197264_c6_g1_i6  10.32434 2.894909 2.161066e-19 2.227587e-14      TRUE
ADD REPLYlink modified 11 months ago • written 11 months ago by Farbod3.2k

For a given gene, how will you illustrate the connection between its result in DESeq and it's (dependent) result in edgeR?

ADD REPLYlink written 11 months ago by russhh4.3k

A gene A in edgeR and DESeq will get twice a logFC and a p-value, so you cannot plot these at the same time using a volcano plot.

ADD REPLYlink written 11 months ago by WouterDeCoster38k
2

yes you can, you just plot two different points for the same gene. Whether it's meaningful to do so is up for discussion

ADD REPLYlink written 11 months ago by russhh4.3k

Dear russhh and Wouter, Hi. Here is the problem

I have done DEG analysis by both edgeR and DESeq2 and I checked the overlaps of both package in my both conditions (condition1 and condition2) using a venn diagram and I tried to annotate the common/overlap transcripts reported in both exact test and GLM method. I also checked the results using SARTools. So, now I have two volcano for edgeR and DESeq2 that have about 30% overlap of DEGs in condition1 and 50% overlap in condition2.

I was wondering if I can show all these in a single volcano plot using 3 colours.

ADD REPLYlink modified 11 months ago • written 11 months ago by Farbod3.2k

sorry, is it two different contrasts each of which has been tested using both edgeR and DESeq2?

ADD REPLYlink written 11 months ago by russhh4.3k

I have used the same data and same expressin counts resulted from Trinity for condition1 and condition2 for both edgeR and DESeq2 and focused on the overlaps base on this idea if multiple programs give you the same results, then you can be confident that those results do not depend on the particular assumptions that are made by the programs/statistic tests.

ADD REPLYlink written 11 months ago by Farbod3.2k

I think the two methods are a bit too similar for that to be appropriate; I'd agree if you were comparing voom and DESeq2. I'd agree with you if you'd subsampled either the counts or the exons and ran the two methods on the two partitions. Not sure if the overlayed volcano would be of value to help choose between the methods - I'd rather see two MA plots stacked next to each other

ADD REPLYlink written 11 months ago by russhh4.3k

Good idea! I will check and compare MA plots, too. Aren't MA plots similar to Volcano plots, in a whole perspective?

ADD REPLYlink written 11 months ago by Farbod3.2k

Not really, it will show how your diffexes vary across different levels of average expression for the two different methods

ADD REPLYlink written 11 months ago by russhh4.3k

Okay you can, but no, I don't consider it meaningful :)

When plotting, you need to think what the message is that you want to give. In this case, that's quite unclear.

ADD REPLYlink written 11 months ago by WouterDeCoster38k

If you are asking me,

by this approach you (I) can first change the way of visualising the results (converting two different volcano and a venn diagram to only one volcano)

and also make it clear for the readers that the core DEGs you have chosen in the most conservative approach are significantly differentially expressed in both two statistical packages.

ADD REPLYlink modified 11 months ago • written 11 months ago by Farbod3.2k
3
gravatar for cpad0112
11 months ago by
cpad011211k
India
cpad011211k wrote:

Try this in basic plotting: (note: I created two dataframes with 100 genes and each dataframe shares 10 common genes with identical FDR and logFC). Common genes are colored in red . and labelled, rest in green and light green.

output: Rplot01

input:

set.seed(100)
# Create a dataframe by name edge
edge = data.frame(
  gene = paste0("gene",sample(100)),
  logFC = c(rnorm(80,0,1),rnorm(20,0,12)),
  logFDR = rnorm(100,mean=0.05, sd=0.02)
)
# Create a dataframe by name dsq
dsq =data.frame(
  gene = paste0("gene",sample(100)),
  logFC = c(rnorm(70,0,1),rnorm(30,0,12)),
  logFDR = rnorm(100,mean=0.05, sd=0.01)
)

set.seed(200)
## Select random 10 genes from edge and their FC and FDR values

cf=edge[abs(edge$logFC)>2,][sample(nrow(edge[abs(edge$logFC)>2,]),10),]
#View(cf)

## Replace same genes in dsq with values above so that edge and dsq share 10 genes with identical FC and FDR.
dsq[dsq$gene %in% cf$gene,]=cf

## sort the dataframes
edge.sorted=edge[with(edge,order(gene)),]
dsq.sorted=dsq[with(dsq,order(gene)),]

## plot first dataframe
plot(
  x = edge.sorted$logFC,
  y = edge.sorted$logFDR,
  col = "darkgreen",
  pch = 16,
  cex=2,
  xlab="Log(2) Fold Change",
  ylab="FDR",
  abline(v=c(-2,2),h=c(0,0.05), col="red", lty=3,lwd=3)
)
## plot second data frame over first plot
points(
  x = dsq.sorted$logFC,
  y = dsq.sorted$logFDR,
  col = "green",
  pch = 16,
  cex=2
)
## Highlight points of interest
points(
    x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
    y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
  col = "red",
  pch = 16,
  cex=2
)
## Add labels to points of interest
text(
  x = edge.sorted$logFC[edge.sorted$logFC == dsq.sorted$logFC],
  y = edge.sorted$logFDR[edge.sorted$logFDR == dsq.sorted$logFDR],
  edge.sorted$gene[edge.sorted$logFC == dsq.sorted$logFC] ,
  cex = 2,
  pos=1,
  col = "red"
)

in ggplot same (recycled code from @russh for merged dataframe creation, dataframe is code from above post):

output: Rplot01 input:

# Load libraries
library(dplyr)
library(ggplot2)

# Merge data frames by gene name
dfm = merge(edge.sorted, dsq.sorted, by = "gene")

# Create a data frame for common genes by logFC and logFDR
dfm1 = dfm[dfm$logFC.x == dfm$logFC.y & dfm$logFDR.x == dfm$logFDR.y, ]
dfm1
head(dfm)
head(dfm1)
## plot
ggplot(dfm) +
  geom_point(
    data = dfm,
    aes(x = logFC.x, y = logFDR.x),
    color = "green",
    cex = 3
  ) +
  geom_point(
    data = dfm,
    aes(x = logFC.y, y = logFDR.y),
    color = "lightgreen",
    cex = 3
  ) +
  geom_point(
    data = dfm1,
    aes(x = logFC.x, y = logFDR.x),
    color = "blue",
    cex = 3
  ) +
  geom_text(
    data = dfm1,
    aes(x = logFC.x, y = logFDR.x, label = gene),
    hjust = 1,
    vjust = 2
  ) +
  theme_bw() +
  xlab("Log(2) fold change") +
  ylab("FDR") +
  geom_vline(
    xintercept = 2,
    col = "red",
    linetype = "dotted",
    size = 1
  ) +
  geom_vline(
    xintercept = -2,
    col = "red",
    linetype = "dotted",
    size = 1
  ) +
  geom_hline(
    yintercept = 0.05,
    col = "red",
    linetype = "dotted",
    size = 1
  )
ADD COMMENTlink modified 11 months ago • written 11 months ago by cpad011211k

sure @ admin

ADD REPLYlink modified 11 months ago • written 11 months ago by cpad011211k
1

Great coding as usual cpad

ADD REPLYlink written 11 months ago by Kevin Blighe41k
1

I don't quite understand the filtering. Surely it's extremely unlikely that the fold-changes and the FDRs match between the DESeq and the edgeR runs isn't it?

ADD REPLYlink written 11 months ago by russhh4.3k
1

It is very unlikely that FDRs and FCs match between ER and DS2 exactly. However, above one is simulated data, not a true data set. However OP wants to over lay. Probably OP has a way to scale the results to comparable results (just my guess). (IMO)

ADD REPLYlink modified 11 months ago • written 11 months ago by cpad011211k

Dear rushhh You are right,

the FDRs and FC are not the same numbers in two approaches, but the order of IDs are somehow the same. first DEG in cond1 in deseq2 is same as first DEG in cond1 in edgeR and so on.

is it helpful?

ADD REPLYlink written 11 months ago by Farbod3.2k

Thanks @Kevin

ADD REPLYlink written 11 months ago by cpad011211k

I moved your comment to an answer, could you please merge your last comment to this answer?

ADD REPLYlink written 11 months ago by zx87547.1k

Dear cpad0112, Hi and thank you.

I will try your valuable code and ask some more question then, as R is always not so friendly to me.

ADD REPLYlink written 11 months ago by Farbod3.2k
2
gravatar for russhh
11 months ago by
russhh4.3k
UK, U. Glasgow
russhh4.3k wrote:

If it were me, I'd link the edgeR and DESeq results using coloured edges, rather than colour the points:

edg <- data.frame(
    gene_name = letters[1:3],
    logFC = rnorm(3),
    logFDR = c(-10, -3, -1),
    threshold = c(T, T, F)
    )

dsq <- data.frame(
    gene_name = letters[1:4],
    logFC = rnorm(4),
    logFDR = c(-8, -4, -2, -5),
    threshold = c(T, T, F, T)
    )

merge(edg, dsq, by = "gene_name") %>%
    filter(threshold.x & threshold.y) %>%
    ggplot() +
    geom_segment(aes(x = logFC.x, y = logFDR.x, xend = logFC.y, yend = logFDR.y, group = gene_name), col = "orange") +
    geom_point(data = edg, aes(x = logFC, y = logFDR), alpha = 0.1) +
    geom_point(data = dsq, aes(x = logFC, y = logFDR), alpha = 0.1)
ADD COMMENTlink written 11 months ago by russhh4.3k

Thank you for your valuable script. I will try it and share the results/problem with you then.

ADD REPLYlink written 11 months ago by Farbod3.2k

I received this error: Error in eval(expr, envir, enclos) : object 'logFC.x' not found

ADD REPLYlink written 11 months ago by Farbod3.2k

@Farbod: I guess you need to load libraries dplyr and ggplot2. On the top of the code, add

library(ggplot2)
library(dplyr)
ADD REPLYlink written 11 months ago by cpad011211k
1

@Farbod The issue you're having is that merging your edgeR and DESeq2 results did not result in a dataframe with the same colmn names as in my example. This could have been remedied if you'd posted some reproducible example data (this is good practise on any Q&A site). You can reproduce my code if you make a subdataframe from each of your DESeq2 and edgeR results that contains the columns gene_name, logFC, logFDR and threshold; use dplyr::select and dplyr::rename to achieve this

ADD REPLYlink written 11 months ago by russhh4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour