Question

Volcano Plot help. Label and color significant genes.

1

Entering edit mode

4.6 years ago

3kixintehead ▴ 10

I have a volcano plot with genes of log2FC > 1 and -log10(padj) > 0.5 highlighted. However, I would also like to label and color differently the top 25 genes in variability. I have a table of ENS ids and their hgnc symbols. However with my code below I get the error: Aesthetics must be either length 1 or the same as the data (7030): label

I think the reason for this is that my list is only 25 rows. I need to get it to the full length, but still only label the genes I have selected. I would also like to color these a different color to highlight them further. I am relatively new to R, but I have been stuck on this problem for a while. Is there a fairly simple fix or do I need to rebuild my gene table with all genes?

ggplot(filter_df, aes(x=log2FoldChange, y=-log10(padj))) +
geom_point(aes(color=test), size=1.5, alpha=0.4) +
scale_color_manual(values=c('gray', 'red')) +
geom_text_repel(aes(x = log2FoldChange, y = -log10(padj),label=Gene_list$hgnc_symbol)) +    
ggtitle('Volcano Plot') +
labs(y=expression('-Log'[10]*' P'[adj]), x=expression('Log'[2]*' fold change')) +
theme_minimal() +
theme(legend.position="none", plot.title = element_text(size = rel(1.5), hjust = 0.5))

RNA-Seq • 8.7k views

ADD COMMENT • link 4.6 years ago by 3kixintehead ▴ 10

1

Entering edit mode

Take a look at EnhancedVolcano - maybe it'll make your job easier. You shouuld be able to generate a list of gene identifiers that you can pass to selectLab argument.

ADD REPLY • link 4.6 years ago by Ram 45k

0

Entering edit mode

I have come across that, but I don't really want to add another package to my script. I will if I need to, but I also feel like there's something rather simple that I'm missing.

ADD REPLY • link 4.6 years ago by 3kixintehead ▴ 10

1

Entering edit mode

In that case, you'll need to change your test field so the colors for the top 25 genes are different. You seem to have multiple data frames. Try and get Gene_list integrated into filter_df so you can assign label, color, siize etc from the same data.frame.

Your filter_df should have the color, size, label, etc for each point in its own row. That way, ggplot2 can pick all required attributes by matching them to columns in the data frame.

ADD REPLY • link 4.6 years ago by Ram 45k

0

Entering edit mode

I am having trouble doing this. I am trying to put the hgnc symbols in the same table with the ENS ids, but I keep getting an error.

Cache found
Error in `$<-.data.frame`(`*tmp*`, anno, value = list(ensembl_gene_id = c("ENSG00000001461",  : 
  replacement has 7025 rows, data has 7030

My code for this is with the biomaRt package.

filter_df_anno <- filter_df
filter_df$anno <- getBM(filters= "ensembl_gene_id", 
    attributes=c("ensembl_gene_id","hgnc_symbol"),values=filter_df$Gene,mart= mart)

I don't know what happened to the 5 rows, but filtering NA doesn't seem to do anything.

Additionally, even if I could get all the info into one dataframe, then how would I filter to label and highlight a few (25 most differentially expressed genes) while still showing pval and log2FC cutoffs? Something like the volcanoplot here, but with the labeled genes, also colored differently. I really think I'm missing something simple here. The way you phrased it above, what I'm trying to do seems impossible, but I don't think that's the case with Rs graphing abilities.

ADD REPLY • link updated 4.5 years ago by Ram 45k • written 4.5 years ago by 3kixintehead ▴ 10

1

Entering edit mode

I was assuming that your ggplot2 already creates a volcano plot and you're just looking to change a few colors and labels. Was that assumption wrong?

As for the mismatching rows, you're going to have to drill down and figure out the ENSG IDs without HGNC symbols. For those, you may either have to assign NA or empty strings.

ADD REPLY • link 4.5 years ago by Ram 45k

0

Entering edit mode

Yes, it does. Currently it is red for pval adj > 0.5 and logFC2 > 1, gray otherwise. I'd like to highlight a few specific genes that are particularly highly differentially expressed. Those are the ENS ids in the original "Gene_list" dataframe I used. But, that is a df of only 25 rows and it is not working with the code I have above.

That is a good explanation of why the rows are not matching. I'm not sure how to remove them though. I imagine some use of if/else would work, but I haven't written any functions in R, only python.

ADD REPLY • link 4.5 years ago by 3kixintehead ▴ 10

1

Entering edit mode

R works better with vectorization than loops. Create a data.frame object (not a column like you're doing right now). Something like

filter_df$Gene <- as.character(filter_df$Gene)
available_data <- getBM(filters= "ensembl_gene_id", 
    attributes=c("ensembl_gene_id","hgnc_symbol"),values=filter_df$Gene,mart= mart)

And then compare available_data$ensembl_gene_id to filter_df$Gene. Also, ensure that both of them are strings, not factors - I've added a line of code that ensures that filter_df$Gene is character. If you're new to R, factors are R's way of storing string vectors by storing one instance of the string and replacing it with a (sort of) pointer to it in each subsequent occurrence. This helps in grouping, sorting, ordering etc.

ADD REPLY • link 4.5 years ago by Ram 45k

1

Entering edit mode

See if following example works for you:

## Load libraries
library(ggplot2)
library(ggrepel)

## Load example data
filter_df=read.csv("ggplot_test.txt", header = T, strip.white = T, stringsAsFactors = F)
filter_df=filter_df[complete.cases(filter_df),]
filter_df=filter_df[order(filter_df$padj),]

## Look at the data (first few lines)
> head(filter_df)
          Gene.names FoldChange_log2         padj
3354 ENSG00000106565       -3.311823 3.271649e-88
32   ENSG00000002933       -3.590700 1.459236e-80
8812 ENSG00000145681        3.103019 1.113140e-61
4387 ENSG00000114948        2.584960 6.046981e-41
4725 ENSG00000117226        2.364465 3.114875e-34
2130 ENSG00000099725       -3.826117 1.178056e-30

## Volcano plot code
ggplot(filter_df, aes(x=FoldChange_log2, y=-log10(padj))) +
    geom_point(aes(color="grey", size=1.5, alpha=0.4)) +
    ggtitle('Volcano Plot') +
    labs(y=expression('-Log'[10]*' P'[adj]), x=expression('Log'[2]*' fold change')) +
    theme_minimal() +
    theme(legend.position="none", plot.title = element_text(size = rel(1.5), hjust = 0.5))+
    geom_text_repel(data=filter_df[1:10,],aes(x = FoldChange_log2, y = -log10(padj),label=Gene.names))+
    geom_point(data=filter_df[1:10,],aes(x = FoldChange_log2, y = -log10(padj),color="red", size=1.5,alpha=0.4))+
    scale_color_manual(values=c('gray','red'))

Final figure

ADD REPLY • link 4.5 years ago by cpad0112 21k

0

Entering edit mode

Just a heads up: Don't use T and F in place of TRUE and FALSE. The former can be names of variables, the latter cannot. If there's some code that creates a T <- FALSE or T <- 0 at some point, it will mess up everything. That's not a problem here, but it's a coding best practice to be safe.

ADD REPLY • link 4.5 years ago by Ram 45k

0

Entering edit mode

Thanks! This is really helpful, and the comments here have helped me realize that looking for variability in the volcano plot is not necessarily very useful. I got a modified version of this code working to highlight the most significant genes by adjusted p-value.

ADD REPLY • link 4.5 years ago by 3kixintehead ▴ 10