Tutorial: Creating Interactive Volcano Plots with R and Plot.ly
31
gravatar for steve
2.2 years ago by
steve1.8k
United States
steve1.8k wrote:

Experienced Bioinformaticians are probably familiar with the standard technique for creating volcano plots in R. Now there is a fun and interactive alternative available using the Plot.ly library. This code sample will demonstrate how to use this library to create an interactive plot.

To preview the full code + interactive plot, see my post here:
https://stevekm.github.io/2016/09/24/Plot.ly-Volcano-Plot.html

I also have the same page listed in a Gist (minus the interactive plot output) here:

volcano plot plotly tutorial R • 18k views
ADD COMMENTlink modified 8 months ago by mail2steff50 • written 2.2 years ago by steve1.8k
2

I have the same problem the colnames(diff_df) looks like normal to me:

"external_gene_name" "Fold"               "FDR"                "group"

and head(diff_df) looks like this:

head(diff_df)
  external_gene_name     Fold      FDR                  group
1             CXCL12 3.471136 1.00e-20 Significant&FoldChange
2             ADORA1 3.175682 3.01e-32 Significant&FoldChange
3              KCNK5 2.953598 7.44e-14 Significant&FoldChange
4             MAMLD1 2.786176 6.65e-16 Significant&FoldChange
5              KCNQ5 2.777046 9.25e-14 Significant&FoldChange
6              RBBP8 2.758790 4.84e-45 Significant&FoldChange
ADD REPLYlink modified 10 months ago by WouterDeCoster35k • written 10 months ago by roberto.ferrari20

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 10 months ago by WouterDeCoster35k

The same problem here and the colnames(diff_df) looks the same as above in my case. I am still getting object not found error. Any suggestions?

ADD REPLYlink written 8 months ago by proteinlover0

I think you should ask a new question and add a link to this question from your new question. That would get you better results as the visibility would be better.

ADD REPLYlink written 8 months ago by RamRS19k

what exactly is the problem? Can you update the post to include the exact code you are running & the error message?

ADD REPLYlink written 8 months ago by steve1.8k
1

Hi and thank you for sharing this.

I usually use Trinity package that will draw Volcano plot for its DEG analysis (because I am new in R).

I will show you the related script and would you please kindly check them and see that is there any of the Trinity files that I can import them in R for drawing a volcano plot same as what you have provided?

Thank you in advance:

  $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
   --matrix counts.matrix --method DESeq2 --samples_file samples_described.txt

The output from running the DE analysis will reside in the output directory you specified, and if not, a default directory name that includes the name of the method used (ie. DESeq2/ ).

In this output directory, you'll find the following files for each of the pairwise comparisons performed:

1- ${prefix}.sampleA_vs_sampleB.${method}.DE_results : the DE analysis results,\ including log fold change and statistical significance (see FDR column).

2- ${prefix}.sampleA_vs_sampleB.${method}.MA_n_Volcano.pdf : MA and Volcano plots \ features found DE at FDR <0.05 will be colored red. Plots are shown\ with large (top) or small (bottom) points only for choice of aesthetics.

3- ${prefix}.sampleA_vs_sampleB.${method}.Rscript : the R-script executed to perform the DE analysis.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Farbod3.2k
1

The files you are referencing are located on your computer and not accessible in this way through the Internet. However if you check in my post I have shown the data format needed here:

# preview the dataset; data required for the plot
head(diff_df)

##   external_gene_name  Fold      FDR
## 1      RP11-431K24.1 -4.13 1.04e-05
## 2              UBE4B  2.42 8.71e-06
## 3             UBIAD1  4.27 5.50e-06
## 4             UBIAD1  1.89 2.16e-04
## 5             UBIAD1  2.74 8.59e-05
## 6             UBIAD1  3.42 1.39e-08

What this means is that the code I posted should work with any data that is held in a text based format (.txt, .csv, .tsv file, for example) with the following 3 columns: gene, fold change, significance (p value, etc.). If you have data in a file like this, you can try to load it into R using a function such as read.delim() as I used here (my preferred function for this), or one of the other options listed here. Once you have the data loaded in R, all you need to do is select the columns with the relevant information and pass them to the different aspects of the code shown and it should work.

ADD REPLYlink written 2.2 years ago by steve1.8k
1

Dear steve, Hi and Thank you

The Trinity files has some more columns but I can make a new file only with the 3 columns you have kindly provided.

by the way, instead of "gene" there would be "Trinity IDs" in that file, but I think I can change some of the most biologically important annotated trinity IDs with the genes that blast has showed.

e.g. there are about 500k transcripts in the Trinity assembly and only 2000 of them are DEG and I just need to show 5 of them in each of my two condition in the volcano plot. What do you think about this approach?

NOTE: I knew that the files in my computer are not reachable by you and I just mentioned those names because I guessed that may be you are familiar with Trinity output files. sorry and Thank you again.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Farbod3.2k
1

You can use R to subset the data as needed (what I showed here), or you can even do it without subsetting the data at all by passing the references to the entries. The file you want is probably the DE_results file. I would try to open that file in something like Excel or TextEdit to verify its format so you can figure out how to load it into R. In particular, you will need to find out what the separator between columns is (the 'delimiter'); this might not be visible from Excel but should be visible from a raw editor such as TextEdit or Notepad or Sublime, and corresponds to the sep argument of read.delim() or read.table(). If it is comma separated, it would be sep=',', if it is tab separated it would be sep='\t'

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by steve1.8k
2

Dear Steve, With your step-by-step description I could draw the volcano plot for my data and it was very awesome and interesting. Thank you.

ADD REPLYlink written 2.2 years ago by Farbod3.2k
1

Dear steve, Hi.

Is there any way to show some genes other than the "five most DEG" on this volcano plot?

Because some times the top 5 genes are the genes related to some stress reactions or even disease and the

interesting biological DEGs situate at lover level of importance (e.g I have about 500 Upregulated DEGs for my

"male" condition and the important genes locate at the level of 2,, 52, 145, 302 positins on the DEG-result.matrix)

I hope I could describe the situation clearly.

ADD REPLYlink written 2.2 years ago by Farbod3.2k
1

yes, the way this example is set up, all you need to do is create a subset of the original dataframe that holds the genes you would like to label and use that in place of the top_peaks dataframe that I used here. The R code needed to obtain these entries will depend on your original data set. For example, if you don't mind hard-coding it and you know that you want entries 2, 52, 145, and 302, you could just do something like label_peaks <- diff_df[c(2, 52, 145, 302),] and pass this to the labeling list code.

ADD REPLYlink written 2.2 years ago by steve1.8k

Hi Steve,

Do you have the Rscript for this showing the up-and-down regulated transcripts using volcano? I already have pre-filtered data. My significant data is FDR<1e-5. My upregulated is log2FC>0 while my down is log2FC<0.

ADD REPLYlink written 5 months ago by bioinfool0

Not sure I understand the question, since the original post documents all the R code used in this example. The transcript regulation values were produced using DiffBind using the ChIP-Seq workflow from this pipeline

ADD REPLYlink written 5 months ago by steve1.8k

This looks nice. I understand that you can't render the interactive plot directly here but could you upload it somewhere and link it in your post ? So we can get a better idea of how interactive the plot truly is.

ADD REPLYlink written 2.2 years ago by Carlo Yague4.3k
1

I put the HTML output in my DropBox and added a link. I am looking into trying to put the plot up on their public site. EDIT: Link replaced by this page which includes the interactive version of the plot.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by steve1.8k

After logging on dropbox, I had access to the html file. Its cool ! Thanks.

ADD REPLYlink written 2.2 years ago by Carlo Yague4.3k
1

oh yes I think you can just click the 'X' on the login window that Dropbox gives and still access the file.

ADD REPLYlink written 2.2 years ago by steve1.8k

I have my dropbox open but don't know how to get the link from your dropbox account.

Just a simple quick question: does it mean that if I click on a dot on the plot, the gene name would show up and if I move the cursor elsewhere and the gene name disappears? That would be very cool.

ADD REPLYlink written 2.2 years ago by moushengxu360

Hi,

Yes that's how it is

ADD REPLYlink written 2.2 years ago by Farbod3.2k

I took down the Dropbox link and instead hosted the plot online at the URL listed in the post. The gene name will appear as you mouse over the points.

ADD REPLYlink written 2.2 years ago by steve1.8k

That looks incredibly cool!

I have something similar:

d <- read.table("mydata.txt", header = TRUE);
attach(d);
d$Significant <- ifelse(FDR < 0.25, "FDR < 0.25", "Not Sig")

ggplot(d, aes(x = logFC, y = -log10(PValue))) +
  geom_point(aes(color = Significant)) +
  scale_color_manual(values = c("blue", "grey")) +
  theme_bw(base_size = 16) +
  geom_text_repel(
    data = subset(d, (abs(logFC)>=4) & (FDR < 0.25)),
    aes(label = GENE_ID),
    size = 5,
    box.padding = unit(0.35, "lines"),
    point.padding = unit(0.3, "lines")
  )

The good part of it is that the gene names does not squeeze into each other and they look nice on the plot. The bad part is that it cannot do the interactive display. Suppose I just want to dynamically show the gene names when "mouse over", what should I do to change the code to do so?

Thanks much!

ADD REPLYlink written 2.2 years ago by moushengxu360

Dear Steve, Hi

I am recently receiving this error in my R-studio, would you please help

  • layout(annotations = a) Error in plot_ly(data = diff_df, x = Fold, y = -log10(FDR), text = external_gene_name, : object 'Fold' not found > p Error: ScalesList was built with an incompatible version of ggproto. Please reinstall the package that provides this extension.
ADD REPLYlink written 2.0 years ago by Farbod3.2k
1

Yes, that's an incompatibility error, we had the same with ggrepel and ggplot2, the best solution (so far, advice appreciated) is to crawl under your desk and cry.

ADD REPLYlink written 2.0 years ago by WouterDeCoster35k

Hi @WouterDeCoster,

But it was working well before !

Thanks for crawl and cry solution,

Is it any way to draw an non-interactive volcano plot with the same R script (maybe by removing some of its lines) ?

Do you have any experiences about THIS.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Farbod3.2k
1

But it was working well before

Unfortunately I find myself in that situation often as well. Usually, some change in other installed system packages are to blame. For R, its frequently a change in the loaded version of gcc or zlib, though this may not be the case here.

The code posted in the OP here was designed specifically for Plotly, so you will not be able to use the same code verbatim for a base R plot, but since its essentially just a scatter plot it shouldn't need too many tweaks; my fallback would be the standard volcano plot code described here. If you want your plot to be available both in interactive and non-interactive forms from the same code, I would go with ggplot2 instead (see here), but of course that will not help if you are having ggplot2 compatibility issues.

ADD REPLYlink modified 23 months ago • written 23 months ago by steve1.8k

Hi Stephen, this is amazingly helpful and is really helping with getting my head around R language for plotting.

I have a few ,what may be very simple questions. And feel free to point me towards other resources I should be reading.

1) How can I simply add a list of names that I specify from my gene name column, but maintain all the arrow information etc. I understand there must be a simple way of populating list a, with selected gene names. But I am too new to this to know how to do that.

2) is there a simple way of adding dashed lines to the graph where the cut-offs are.

3) is there a simple way to split the axis. I am thinking the y axis in my case, but am sure it would be similar for the x.

Thanks again,

It's been a great help already.

BW

ADD REPLYlink written 15 months ago by reubenmcgregor8840
  1. Are you referring to the hard-coded black arrows on the plot? In this example, that is coming from the top_peaks data frame; you can substitute this for any df that has the gene name & values, just tweak the for loop shown to put the Fold Change, FDR, and Gene Name in the x, y, and text fields respectively in each entry in the list being built and pass it to the plot layout.

2 & 3. I honestly do not know off the top of my head, since writing this example I have moved almost all of my plotting to use ggplot2 instead. I find it to be advantageous because it makes much nicer 2-D plots, and can be imported directly into Plotly without modification. It looks like the Plotly docs have moved here, which might show you how to accomplish something like this using Plotly directly as in this example, or you could try building a similar plot in ggplot2 and import it to Plotly that way.

ADD REPLYlink modified 15 months ago • written 15 months ago by steve1.8k

Great advice, just spent the morning honing my ggplot2 skills.

Works nicely, thanks

ADD REPLYlink written 15 months ago by reubenmcgregor8840

Hello Steve and thank you very much.

I am a very new user to R and I am following your script to create a volcano plot of some proteomic data that I have. All the commands are being accepted by the software but just at the very end, it pops up some errors. These are types of errors like the following one:

> # make the Plot.ly plot
> p <- plot_ly(data = diff_df, x = Fold, y = -log10(p.value), text = external_gene_name, mode = "markers", color = "group") %>% 
+   layout(title ="DiffBind Volcano Plot")
Error in plot_ly(data = diff_df, x = Fold, y = -log10(p.value), text = external_gene_name,  : 
  object 'Fold' not found

So what I did was to modify it a little bit by adding a few "" and I manage to get a plot but it doen't look like a volcano. After my modifications the code is like that:

p <- plot_ly(data = diff_df, x = "Fold", y = "FDR", text = "external_gene_name", mode = "markers", color = "group") %>% 
  layout(title ="Volcano Plot") %>%
  layout(annotations = a)
p

Could you please help me with that? Thank you very much in advance.

Sincerely, Yiannis

ADD REPLYlink modified 12 months ago by WouterDeCoster35k • written 12 months ago by yiannis.manolaras0

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 12 months ago by WouterDeCoster35k

x = Fold is referring to the name of a column in the data frame. object 'Fold' not found could mean that column is not present in diff_df. You might want to check head(diff_df) and colnames(diff_df) to make sure it resembles what I showed in the example, and that all the columns being used in the plotting function are present. If it still does not look the way you expected, posting a screenshot here may help.

ADD REPLYlink modified 12 months ago • written 12 months ago by steve1.8k

I dont have P Values, than how to calculate that.

ADD REPLYlink written 9 months ago by bikash251030

Don't do this. If you have a question, open a new post. If you wish to add a comment, use ADD COMMENT.

ADD REPLYlink written 9 months ago by RamRS19k

How to export(save as excel file) upregulated gene from volcano plot and downregulated gene separately.

ADD REPLYlink written 8 months ago by kalyanimeha30
2

In R, you would use the write.table function to save a copy of the dataframe (diff_df in this case) after filtering it for the up and down regulated genes (examples here).

ADD REPLYlink modified 8 months ago • written 8 months ago by steve1.8k

Dear Dr.Kelly,

I have executed the above script with the Gfold values (similar to logfold2 change). since we are working with non replicated data, we don't have a p value. Thus, I would like to have +fold change on x axis and -fold change on y axis. I am able to get the graph but a regular scatter plot. I get the error message as below: No trace type specified: Based on info supplied, a 'scatter' trace seems appropriate. Read more about this trace type -> https://plot.ly/r/reference/#scatter.

Kindly help me to trouble shoot.

-Dr. Megha H.S

ADD REPLYlink written 8 months ago by mail2steff50

Is there a straightforward way to change the colors in the grouping?

ADD REPLYlink written 3 months ago by stellaparker10

You can change colours easily with my EnhancedVolcano package, which is on Bioconductor

ADD REPLYlink written 11 weeks ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1281 users visited in the last hour