I have a volcano plot with genes of log2FC > 1 and -log10(padj) > 0.5 highlighted. However, I would also like to label and color differently the top 25 genes in variability. I have a table of ENS ids and their hgnc symbols. However with my code below I get the error: Aesthetics must be either length 1 or the same as the data (7030): label
I think the reason for this is that my list is only 25 rows. I need to get it to the full length, but still only label the genes I have selected. I would also like to color these a different color to highlight them further. I am relatively new to R, but I have been stuck on this problem for a while. Is there a fairly simple fix or do I need to rebuild my gene table with all genes?
ggplot(filter_df, aes(x=log2FoldChange, y=-log10(padj))) +
geom_point(aes(color=test), size=1.5, alpha=0.4) +
scale_color_manual(values=c('gray', 'red')) +
geom_text_repel(aes(x = log2FoldChange, y = -log10(padj),label=Gene_list$hgnc_symbol)) +
ggtitle('Volcano Plot') +
labs(y=expression('-Log'[10]*' P'[adj]), x=expression('Log'[2]*' fold change')) +
theme_minimal() +
theme(legend.position="none", plot.title = element_text(size = rel(1.5), hjust = 0.5))
Take a look at EnhancedVolcano - maybe it'll make your job easier. You shouuld be able to generate a list of gene identifiers that you can pass to
selectLab
argument.I have come across that, but I don't really want to add another package to my script. I will if I need to, but I also feel like there's something rather simple that I'm missing.
In that case, you'll need to change your
test
field so the colors for the top 25 genes are different. You seem to have multiple data frames. Try and getGene_list
integrated intofilter_df
so you can assign label, color, siize etc from the same data.frame.Your filter_df should have the color, size, label, etc for each point in its own row. That way, ggplot2 can pick all required attributes by matching them to columns in the data frame.
I am having trouble doing this. I am trying to put the hgnc symbols in the same table with the ENS ids, but I keep getting an error.
My code for this is with the biomaRt package.
I don't know what happened to the 5 rows, but filtering NA doesn't seem to do anything.
Additionally, even if I could get all the info into one dataframe, then how would I filter to label and highlight a few (25 most differentially expressed genes) while still showing pval and log2FC cutoffs? Something like the volcanoplot here, but with the labeled genes, also colored differently. I really think I'm missing something simple here. The way you phrased it above, what I'm trying to do seems impossible, but I don't think that's the case with Rs graphing abilities.
I was assuming that your ggplot2 already creates a volcano plot and you're just looking to change a few colors and labels. Was that assumption wrong?
As for the mismatching rows, you're going to have to drill down and figure out the ENSG IDs without HGNC symbols. For those, you may either have to assign NA or empty strings.
Yes, it does. Currently it is red for pval adj > 0.5 and logFC2 > 1, gray otherwise. I'd like to highlight a few specific genes that are particularly highly differentially expressed. Those are the ENS ids in the original "Gene_list" dataframe I used. But, that is a df of only 25 rows and it is not working with the code I have above.
That is a good explanation of why the rows are not matching. I'm not sure how to remove them though. I imagine some use of if/else would work, but I haven't written any functions in R, only python.
R works better with vectorization than loops. Create a
data.frame
object (not a column like you're doing right now). Something likeAnd then compare
available_data$ensembl_gene_id
tofilter_df$Gene
. Also, ensure that both of them are strings, not factors - I've added a line of code that ensures thatfilter_df$Gene
is character. If you're new to R, factors are R's way of storing string vectors by storing one instance of the string and replacing it with a (sort of) pointer to it in each subsequent occurrence. This helps in grouping, sorting, ordering etc.See if following example works for you:
Final figure
Just a heads up: Don't use
T
andF
in place ofTRUE
andFALSE
. The former can be names of variables, the latter cannot. If there's some code that creates aT <- FALSE
orT <- 0
at some point, it will mess up everything. That's not a problem here, but it's a coding best practice to be safe.Thanks! This is really helpful, and the comments here have helped me realize that looking for variability in the volcano plot is not necessarily very useful. I got a modified version of this code working to highlight the most significant genes by adjusted p-value.