How can I remove the outliers from a boxplot and fill the groups?
2
0
Entering edit mode
3.7 years ago
dpc ▴ 240

I have generated these two boxplots: two boxplots.

But I am unable to remove the points and the outliers. I already have used outliers.shape = NA which didn't work.

I also want to fill the groups like this image: enter image description here

Can anyone please tell me how can I do that?

here's my code:

p <- plot_richness (physeq_rarefied, x="type", color = "type", measures=c("Shannon", "Observed")) +
stat_compare_means(method = "wilcox.test") +
geom_boxplot() +
labs(x= "Sample types", y= "Alpha Diversity Measure", 
title = "Alpha diversity of control and test samples")

Thanks, dpc

R statistics • 6.4k views
ADD COMMENT
2
Entering edit mode

outlier.shape = NA for ggplot to hide outliers. Try changing notch.width values.

library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length, fill=Species)) +
    geom_boxplot(
        outlier.shape = NA,
        notch = T,
        notchwidth = 0.10)

Instead of box plots, try beeswarm or violin plots with jitter.

ADD REPLY
2
Entering edit mode
3.7 years ago

Hi @dpc,

To have a notched boxplot (I believe this is the right term for the figure that you want to make: https://sites.google.com/site/davidsstatistics/home/notched-box-plots ) just add the following option to your code between the geom_boxplot() function:

geom_boxplot(notch = TRUE)

Then, after run your code, you can do:

p$layers[1] <- NULL

This will remove the first ggplot layer, that corresponds to the geom_point().

I hope this helps,

António

ADD COMMENT
0
Entering edit mode

Thanks Sir, @Antonio. Yes, I am aware about the notched boxplot :) . Only, concern was "fill" the types and remove the outliers. However, I will try as you have suggested.

Thanks a lot, dpc

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

In this way all the dots disappeared except outliers.

Like this image: enter image description here

Also, how can I fill the boxes with colours?

ADD REPLY
1
Entering edit mode

Hi,

Using the same instructions that I gave you, but substitute the line with geom_boxplot() function with the following:

geom_boxplot(aes(fill = type), notch = TRUE, outliers.shape = NA)

Let me know if worked.

António

ADD REPLY
0
Entering edit mode

This is my code now:

p <- plot_richness(physeq_rarefied, x="type", color = "type", measures=c("Shannon", "Observed")) + 
                      stat_compare_means(method = "wilcox.test") +
                      geom_boxplot(aes(fill = type), notch = TRUE, outliers.shape = NA) + 
                      scale_fill_manual(values = c("hotpink", "skyblue"))+
                      labs(x= "Sample types", y= "Alpha Diversity Measure", 
                      title = "Alpha diversity of control and test samples") 
    p
    p$layers[1] <- NULL
    p

And this is output. Outliers still exist:

enter image description here

EDIT: However, outlier.size = -1 replacing outliers.shape = NA removes all the outliers. Here's the output:

enter image description here

Thanks, dpc

ADD REPLY
1
Entering edit mode

try geom_boxplot2 funtion from Ipaper (https://github.com/kongdd/Ipaper/) for boxplots without outlier.

Example code:

library(Ipaper)
library(ggplot2)

ggplot(iris[,c(1,5)], aes(Species,Sepal.Length))+
    geom_boxplot2()
ADD REPLY
1
Entering edit mode
3.7 years ago

If you want clarity about and control over how you deal with outliers, without dealing with the "blackbox" that other people's code provides, there are a couple common approaches you can use to do this yourself:

  1. Trimming
  2. Winsorization

In both approaches, you specify a percentage cutoff. You sort the data from low to high, and you take some percentage of values from the full set of data and deal with them, depending on the method.

For instance, if you have 1000 points, with a 10% cutoff, the values you deal with are the top 500 and bottom 500 values. Each of these subsets makes up 5% of the total dataset — or 10%, in total.

With the trimming method, any value from your dataset which falls in this cutoff is removed. If you start with 1000 values and have a 10% cutoff, you end up with a dataset containing 900 values.

With the Winsorization approach, unlike trimming, any value which falls in this cutoff is not removed, but is instead replaced with the next lowest or highest value. You still end up with 1000 values.

Both approaches change your distribution, but they filter outliers.

How many outliers are removed depends entirely on your choice of cutoff.

In R, you could trim simply by excising a number of rows from a dataset that meet the criteria (e.g., using a 10% cutoff):

q <- quantile(x,  probs = c(5, 95)/100)
trimmed_x <- x[x>q[1] & x<q[2]]

In R, you could Winsor with the winsorize function from statar (e.g., using a 10% cutoff):

winsorized_x <- winsorize(x, probs = c(0.05, 0.95))

Then plot trimmed_x or winsorized_x, as if it was your original dataset.

ADD COMMENT
0
Entering edit mode

Another approach would be to calculate box plot stats,use mean (between upper and lower values of the box) to equally trim on either side of the box, multiply the limits with appropriate factor and use cartesian coordinates in ggplot.

ADD REPLY

Login before adding your answer.

Traffic: 3063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6