Removing Outliers from volcano plot
1
1
Entering edit mode
6.6 years ago
tiago211287 ★ 1.3k

I used this R script of rnaseqdata to generate a Volcano plot:

but there are 3 outliers point deforming the graphic shape. How can I remove it ?

resultado_table=rbind(data.frame(results(dds, contrast=c("condition", "Controle" , "Infectado"))))
resultado_table$Variable[1:39179] <- "Controle_vs_Colombiana" ##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off resultado_table$threshold = as.factor(abs(resultado_table$log2FoldChange) > 1 & resultado_table$padj < 0.05)

geom_point(alpha=0.4, size=1.75) +
xlab("\n log2 fold change") +
theme_bw() +
theme(axis.title.y = element_text(face="bold", size=16),
axis.title.x = element_text(face="bold", size=16),
axis.text = element_text(size=12),
legend.title =element_blank() ,
legend.text = element_text(size = 12)) +
facet_wrap(~ Variable, ncol=3) +
theme(strip.text = element_text(size=12, face="bold"))
volcano_plot

volcanoplot outliers • 5.5k views
0
Entering edit mode

You can just use the xlim() and ylim() options to just set the bounds and not have to literally remove the datapoints.

0
Entering edit mode

Thank you Devon, as always you helping me. But where exactly I use this to actually work?

0
Entering edit mode

No problem. I should note there's a third option that's used by DESeq2 and a few other programs. In short, you modify the values that you're plotting such that anything outside of the bounds you want is now exactly on the border. You then denote that by changing the symbol used for these points. An example of that with an MA plot is below, where triangles denote values outside of the bounds.

I should note that this image is originally from Deseq's plotMA color-coding at random?

0
Entering edit mode

I'm actually trying to create a volcano plot in ggplot2 that INCLUDES outliers and displays them as triangles at the axis border. How can I code this? Thanks!

1
Entering edit mode

Assuming you have everything in a dataframe and the values are the "values" column:

df$shape <- ifelse(abs(df$values)>5, "triangle", "circle")
df$values[df$values>5] <- 5
df$values[df$values< -5] <- -5

You then use shape=shape in the aesthetic.

0
Entering edit mode

Thanks, it works!

What if I wanted to display the outliers like this for both the x and y axes?

1
Entering edit mode

It's the same general idea. You take whatever holds your X axis values and

df$shape[df$something>50] <- "triangle"
df$something[df$something>50] <- 50

I  picked 50 as an arbitrary value.

0
Entering edit mode

Hmm, when I apply the code to both axes, all of my points just appear as a vertical line at x=1.5

Here's my code:

#y axis

df$shape <- ifelse(abs(df$pval) < 0.1, "triangle", "circle")
df$pval[df$pval<0.1] <- 0.1

#x axis

df$shape <- ifelse(abs(df$logFC) > 2.8, "triangle," circle")
df$logFC[df$logFC >  2.8] <- 1.5
df$logFC[df$logFC < -2.8] <- -1.5

g <- ggplot(df, aes(x=logFC, y=-log10(pval), shape=shape)) +
geom_point(alpha=0.5, size=4) +
theme(legend.position = "none") +
xlim(c(-1.5, 1.5)) + ylim(c(0, 1)) +
xlab("log2 fold change") + ylab("-log10 p-value")

1
Entering edit mode

A couple things:

1. Don't set shape multiple times like that. You need to set just a subset of shape the second time: df$shape[abs(df$logFC)>1.5] <- "triangle"
2. You're doing the trimming of logFC a bit weird. I assume you intend to do something like df$logFC[df$logFC>1.5] <- 1.5 and df$logFC[df$logFC < -1.5] <- -1.5.
3. You don't need the xlim() part.

If for some reason you still get a line at x=1.5, then just have a look at df$logFC after each of the two trimming steps. It should be apparent then when things are going wrong and why. ADD REPLY 2 Entering edit mode Got it. The line of points was due to me screwing up the cutoff for replacing values and then forgetting to reset them to original values before trying the code again. I end up having to set the xlim and ylim to the values that give the best zoom of the the majority of data points. I had 10 of so points that were very distant from the rest hence me finding this thread! Thanks for the help! Here's my code and the volcano plot it produced for anyone's reference: df$shape <- ifelse(df$pval < 0.03, "triangle", "circle") df$pval[df$pval<0.03] <- 0.03 df$shape[(abs(df$logFC) > 1.5)] <- "triangle" df$logFC[df$logFC > 1.5] <- 1.5 df$logFC[df$logFC < -1.5] <- -1.5 ggplot(data=df, aes(x=logFC, y=-log10(pval), shape=shape)) + geom_point(alpha=0.5, size=4) + theme(legend.position = "none") + xlim(c(-1.5, 1.5)) + ylim(c(0, -log10(0.03))) + xlab("log2 fold change") + ylab("-log10 p-value") ADD REPLY 4 Entering edit mode 6.6 years ago If you really want to remove the outliers you could remove the values for the x and/or y axis the data points outside a given quantile range. E.g. to remove values less than 0.01 and greater than 0.99 quantile: qqx<- quantile(resultado_table$log2FoldChange, p= c(0.01, 0.99))
qqy<- quantile(resultado_table$padj, p= c(0.01, 0.99)) resultado_table2<- resultado_table[ (resultado_table$log2FoldChange > qqx[1] & resultado_table$log2FoldChange < qqx[2]) & (resultado_table$padj > qqy[1] & resultado_table\$padj < qqy[2]), ]

Now use resultado_table2 for plotting.

0
Entering edit mode

Just 3 months later I figure out why I could not make this work. You solution works like a charm. Thank you. I was using write csv and cutting the values on excel then bringing back to R. But I came again on the post to read and figure out the problem.