Question: Removing Outliers from volcano plot
0
gravatar for tiago211287
5.2 years ago by
tiago2112871.1k
USA
tiago2112871.1k wrote:

I used this R script of rnaseqdata to generate a Volcano plot:

but there are 3 outliers point deforming the graphic shape. How can I remove it ?

 

 

resultado_table=rbind(data.frame(results(dds, contrast=c("condition", "Controle" , "Infectado"))))
resultado_table$Variable[1:39179] <- "Controle_vs_Colombiana"

##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off
resultado_table$threshold = as.factor(abs(resultado_table$log2FoldChange) > 1 & resultado_table$padj < 0.05)
write.table(resultado_table, file="./GENES_DE")

volcano_plot <- ggplot(data=resultado_table, aes(x=log2FoldChange, y=-log10(padj) , colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  xlab("\n log2 fold change") +
  ylab("-log10 p-value adjusted \n") +
  theme_bw() +
  theme(axis.title.y = element_text(face="bold", size=16),
        axis.title.x = element_text(face="bold", size=16),
        axis.text = element_text(size=12),
        legend.title =element_blank() ,
        legend.text = element_text(size = 12)) +
  facet_wrap(~ Variable, ncol=3) +
  theme(strip.text = element_text(size=12, face="bold"))
volcano_plot

 

 

outliers volcanoplot • 4.6k views
ADD COMMENTlink modified 5.2 years ago by Devon Ryan93k • written 5.2 years ago by tiago2112871.1k

You can just use the xlim() and ylim() options to just set the bounds and not have to literally remove the datapoints.

ADD REPLYlink written 5.2 years ago by Devon Ryan93k

Thank you Devon, as always you helping me. But where exactly I use this to actually work?

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by tiago2112871.1k

No problem. I should note there's a third option that's used by DESeq2 and a few other programs. In short, you modify the values that you're plotting such that anything outside of the bounds you want is now exactly on the border. You then denote that by changing the symbol used for these points. An example of that with an MA plot is below, where triangles denote values outside of the bounds.

I should note that this image is originally from Deseq's plotMA color-coding at random?

ADD REPLYlink written 5.2 years ago by Devon Ryan93k

I'm actually trying to create a volcano plot in ggplot2 that INCLUDES outliers and displays them as triangles at the axis border. How can I code this? Thanks!

ADD REPLYlink written 4.7 years ago by lmolokin20
1

Assuming you have everything in a dataframe and the values are the "values" column:

df$shape <- ifelse(abs(df$values)>5, "triangle", "circle")
df$values[df$values>5] <- 5
df$values[df$values< -5] <- -5

You then use shape=shape in the aesthetic.

ADD REPLYlink written 4.7 years ago by Devon Ryan93k

Thanks, it works!

What if I wanted to display the outliers like this for both the x and y axes?

ADD REPLYlink written 4.7 years ago by lmolokin20
1

It's the same general idea. You take whatever holds your X axis values and

df$shape[df$something>50] <- "triangle"
df$something[df$something>50] <- 50

I  picked 50 as an arbitrary value.

ADD REPLYlink written 4.7 years ago by Devon Ryan93k

Hmm, when I apply the code to both axes, all of my points just appear as a vertical line at x=1.5

Here's my code:

#y axis

df$shape <- ifelse(abs(df$pval) < 0.1, "triangle", "circle")
df$pval[df$pval<0.1] <- 0.1

#x axis

df$shape <- ifelse(abs(df$logFC) > 2.8, "triangle," circle")
df$logFC[df$logFC >  2.8] <- 1.5
df$logFC[df$logFC < -2.8] <- -1.5

g <- ggplot(df, aes(x=logFC, y=-log10(pval), shape=shape)) +
  geom_point(alpha=0.5, size=4) +
  theme(legend.position = "none") +
  xlim(c(-1.5, 1.5)) + ylim(c(0, 1)) +
  xlab("log2 fold change") + ylab("-log10 p-value")

 

ADD REPLYlink written 4.7 years ago by lmolokin20
1

A couple things:

  1. Don't set shape multiple times like that. You need to set just a subset of shape the second time: df$shape[abs(df$logFC)>1.5] <- "triangle"
  2. You're doing the trimming of logFC a bit weird. I assume you intend to do something like df$logFC[df$logFC>1.5] <- 1.5 and df$logFC[df$logFC < -1.5] <- -1.5.
  3. You don't need the xlim() part.

If for some reason you still get a line at x=1.5, then just have a look at df$logFC after each of the two trimming steps. It should be apparent then when things are going wrong and why.

ADD REPLYlink written 4.7 years ago by Devon Ryan93k
2

Got it. The line of points was due to me screwing up the cutoff for replacing values and then forgetting to reset them to original values before trying the code again.

I end up having to set the xlim and ylim to the values that give the best zoom of the the majority of data points. I had 10 of so points that were very distant from the rest hence me finding this thread! Thanks for the help!

Here's my code and the volcano plot it produced for anyone's reference:

df$shape <- ifelse(df$pval < 0.03, "triangle", "circle")
df$pval[df$pval<0.03] <- 0.03

df$shape[(abs(df$logFC) > 1.5)] <- "triangle"
df$logFC[df$logFC >  1.5] <- 1.5
df$logFC[df$logFC < -1.5] <- -1.5

ggplot(data=df, aes(x=logFC, y=-log10(pval), shape=shape)) +
  geom_point(alpha=0.5, size=4) +
  theme(legend.position = "none") +
  xlim(c(-1.5, 1.5)) + ylim(c(0, -log10(0.03))) +
  xlab("log2 fold change") + ylab("-log10 p-value")

 

ADD REPLYlink written 4.7 years ago by lmolokin20
4
gravatar for dariober
5.2 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

If you really want to remove the outliers you could remove the values for the x and/or y axis the data points outside a given quantile range. E.g. to remove values less than 0.01 and greater than 0.99 quantile:

 

qqx<- quantile(resultado_table$log2FoldChange, p= c(0.01, 0.99))
qqy<- quantile(resultado_table$padj, p= c(0.01, 0.99))

resultado_table2<- resultado_table[
    (resultado_table$log2FoldChange > qqx[1] & resultado_table$log2FoldChange < qqx[2]) &
    (resultado_table$padj > qqy[1] & resultado_table$padj < qqy[2]), ]

Now use resultado_table2 for plotting.

 

 

ADD COMMENTlink written 5.2 years ago by dariober10k

Just 3 months later I figure out why I could not make this work. You solution works like a charm. Thank you. I was using write csv and cutting the values on excel then bringing back to R. But I came again on the post to read and figure out the problem. 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by tiago2112871.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 843 users visited in the last hour