Question: How to generate Beta diversity boxplot from phyloseq object?
gravatar for dpc
5 months ago by
dpc150 wrote:

Hi !!! I am working with profiled metagenomic taxonomy abundance data. I want to generate beta diversity (bray-curtis) boxplot from a phyloseq object where two groups (control and test) will be shown. Something like this figure. Can anybody help by sharing some code or some useful tutorial?

ADD COMMENTlink written 5 months ago by dpc150


Phyloseq does not have built-in functions that allow you to produce a figure like that. You need to calculate the Bray-Curtis distance with vegan or phyloseq, but then, you need to extract the information from the matrix to build a box plot like that with for instance, ggplot2.


ADD REPLYlink written 5 months ago by antonioggsousa1.9k

I have used the following code after looking at a tutorial and getting the following plot. I don't know why two plots are coming. Can you tell some modifications to get a single one?

physeq = merge_phyloseq(physeq, sampledata, random_tree)

wu = phyloseq::distance(physeq, "bray")
wu.m = melt(as.matrix(wu))
wu.m = wu.m %>%
  filter(as.character(Var1) != as.character(Var2)) %>%
  mutate_if(is.factor, as.character)

sd = data.frame(sample_data(physeq))

sd = sd %>%
  select("SampleID", "type") %>%

colnames(sd) = c("Var1", "Type1") = left_join(wu.m, sd, by = "Var1")

colnames(sd) = c("Var2", "Type2") =, sd, by = "Var2")

p =, aes(x = Type2, y = value)) +
  theme_bw() +
  geom_point() +
  geom_boxplot(aes(color = ifelse(Type1 == Type2, "red", "black" ))) +
  scale_color_identity() +
  facet_wrap(~ Type1, scales = "free_x") +
  theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5)) +
  ggtitle(paste0("Distance Metric = ", "bray")) +
  ylab("bray") +

ADD REPLYlink modified 5 months ago • written 5 months ago by dpc150

Remove the facet_wrap(~ Type1, scales = "free_x") + line from your code and try again.

Let me know if this solved the issue,


ADD REPLYlink written 5 months ago by antonioggsousa1.9k

It's giving this image.

ADD REPLYlink written 5 months ago by dpc150

Remove also geom_point() +!

Let me know if this solved the issue,


ADD REPLYlink written 5 months ago by antonioggsousa1.9k

Not working. Image

ADD REPLYlink written 5 months ago by dpc150

Did you remove both: facet_wrap(~ Type1, scales = "free_x") + and geom_point() +?


ADD REPLYlink written 5 months ago by antonioggsousa1.9k

I just wrote the function below. You might want to test it. I did not test it extensively, so it might have errors. If you use it you need to be very careful and test it on a dataset that you know well.

It takes as input the phyloseq object, the method that you want to use to perform beta-diversity (using one of the methods that you can pass to phyloseq beta-diversity analysis) and the group variable (on the sample_data frame):

## Import packages and data


## Create function
beta_boxplot <- function(physeq, method = "bray", group) {

  # physeq: phyloseq-class object
  # method: beta-diversity metric. Default "bray", i.e., Bray-Curtis dissimilarity 
  # group: factorial variable to group

  ## Packages
  require("phyloseq") # v.1.30.0
  require("ggplot2") # v.3.3.2

  ## Identify the correspondence: group and samples
  group2samp <- list() # list to save the correspondence between group <--> samples
  group_list <- get_variable(sample_data(physeq), group) # list of group elements
  for (groups in levels(group_list)) { # loop over the no. of group levels
    target_group <- which(group_list == groups) # vct pos of the curr group variable 
    group2samp[[ groups ]] <- sample_names(physeq)[target_group] # matching samples: based on vct pos

  ## Calculate beta-diversity
  beta_div_dist <- distance(physeq = physeq, method = method)
  beta_div_dist <- as(beta_div_dist, "matrix")

  ## Coerce distance mtx into a tidy data frame by group
  dist_df <- data.frame() # save results in df 
  counter <- 1 
  for (groups in names(group2samp)) { # loop over group fct levels 
    sub_dist <- beta_div_dist[ group2samp[[groups]], group2samp[[groups]] ] # subset dist mtx to curr samples
    no_samp_col <- ncol(sub_dist) # n cols: curr sub dist
    no_samp_row <- nrow(sub_dist) # n rows: curr sub dist
    for ( cols in seq(no_samp_col) ) { # loop over cols: curr sub_dist
      if ( cols > 1 ) {
        for ( rows in seq((cols-1)) ) { # loop over rows: curr sub_dist 
          ## Save results
          dist_df[ counter, "sample_pair" ] <- paste0( colnames(sub_dist)[cols], "-",  
                                                       rownames(sub_dist)[rows] ) # sample pair
          dist_df[ counter, "group" ] <- groups # group  
          dist_df[ counter, "beta_div_method" ] <- method # method
          dist_df[ counter, "beta_div_value" ] <- sub_dist[rows, cols] # beta-diversity for the sample pair     
         counter = counter + 1

  ## Create a ggplot2 boxplot
  plot_boxplot <- ggplot(data = dist_df, aes(x = group, y = beta_div_value, color = group)) + 
    geom_boxplot(outlier.shape=NA) +
    geom_jitter() + 
    theme_bw() + 
    xlab(group) + ylab(method) +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

  ## Save df and boxplot into a list 
  list_Out <- list("data" = dist_df, "plot" = plot_boxplot) 


## Test function 
beta_boxplot_result <- beta_boxplot(physeq = GlobalPatterns, method = "bray", group = "SampleType")

## Data

## Plot

Again, you should be very careful and test well first. I'll perform some tests later and I'll come back in case I found errors. The result is a list with a data frame and the box plot (from ggplot class). So, you can compare this data frame with all pairwise comparisons belonging to each group with the table obtained from phyloseq and see if everything looks fine.

I hope this helps,


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

Hi... I could not reply yesterday as I reached highest number of posts a day (as a new member). I am seeing it works nicely with "GlobalPatterns", but not with my data set. I don't understand what have you done here:

for (groups in levels(group_list)) { # loop over the no. of group levels
    target_group <- which(group_list == groups) # vct pos of the curr group variable 
    group2samp[[ groups ]] <- sample_names(physeq)[target_group] # matching samples: based on vct pos

Why is there a loop?

ADD REPLYlink written 5 months ago by dpc150


Probably there are many ways of getting that information, even without a loop, but here the idea is to loop over the different levels of the group variable, e.g., let's say that you have a variable called sex, where the levels can be male or female. The loop will go through female and, then, male and for each will retrieve the list of samples belonging to male or female. So, basically, group2samp, in this example, will contain two elements male or female, in which each of them will hold a vector of samples belonging to themselves.

The only reason that I can think that is not working for you is because group parameter needs to be a factor class variable. If you give a character class variable will not work. So, you can check which class is by doing: class(get_variable(sample_data(your_physeq_object), your_group_variable)).

If your variable is different from factor, you need to convert it to factor class before trying again.


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

Yes sir... Finally It's running. I am getting an output like this after removing "jeom jitter" . Thanks a lot sir for your kind help. I am really indebted to you.

Also, I will like to know some basics about what you have done in this script. I think you have calculated "bray" index between the controls samples; and then, between the obese samples. After that, you have compared their median by box plot. Right?

(One little request: It would be great if you upvote my questions so that I can conversate for longer here. Yeasterday reached highest limit too early, only 5 messages )

ADD REPLYlink modified 5 months ago • written 5 months ago by dpc150

Sorry, I up voted your question now!

So, if you want to remove the points without changing the original code function, you can because you are working with a plot from ggplot class, by doing:

## Remove jittered points
beta_boxplot_result$plot$layers[[2]] <- NULL

You can even change the x- and y-axis labels by doing:

## Add y and x labels
beta_boxplot_result$plot + xlab("Group variable") + ylab("Bray-Curtis dissimilarity")

You can write what ever you want in xlab() and ylab() for your x- and y-axis labels. You can change the whole plot, background and so on, because the plot is from ggplot class. Since you also have the data frame that was use to create the plot, you can even create the plot from scratch.

Is this plot that you got equal to the plot that you did before following the tutorial?


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

No sir. They are different. This is yours and this is from tutorial.

ADD REPLYlink written 5 months ago by dpc150

So, I added the geom_boxplot(outlier.shape=NA) + to the code above. If you run again with the updated code, those outlier points should be removed.

Regarding the differences, you can not compare the 2 plots because with the tutorial you're plotting twice, the type variable is being used as x-axis label and as facet_wrap label. You should remove facet, and then you compare them.


ADD REPLYlink written 5 months ago by antonioggsousa1.9k

Thanks sir. I think the problem there is not the facet rather the distance matrix is problem. Because, they have calculated distances among control samples, among obese samples and between control and obese samples. When I check, this is the result (omitted some lines for space limitation):
                     Var1              Var2     value   Type1   Type2
    1   ERR260268_profile ERR275252_profile 0.6813452 control   obese
    2   ERR260265_profile ERR275252_profile 0.7162228 control   obese
    3   ERR260264_profile ERR275252_profile 0.6417904 control   obese
    4   ERR260263_profile ERR275252_profile 0.5717646 control   obese
    5   ERR260261_profile ERR275252_profile 0.5619948   obese   obese
    6   ERR260260_profile ERR275252_profile 0.5124622 control   obese
    45  ERR260152_profile ERR275252_profile 0.5769312   obese   obese
    46  ERR260150_profile ERR275252_profile 0.6799476   obese   obese
    47  ERR260148_profile ERR275252_profile 0.5879797   obese   obese
    48  ERR260147_profile ERR275252_profile 0.7578910 control   obese
    49  ERR260145_profile ERR275252_profile 0.7650143   obese   obese
    50  ERR260140_profile ERR275252_profile 0.5465588   obese   obese
    51  ERR260136_profile ERR275252_profile 0.5901467   obese   obese
    52  ERR275252_profile ERR260268_profile 0.6813452   obese control
    53  ERR260265_profile ERR260268_profile 0.6878742 control control
    54  ERR260264_profile ERR260268_profile 0.6256261 control control
    55  ERR260263_profile ERR260268_profile 0.6589539 control control
    66  ERR260239_profile ERR260268_profile 0.7411252   obese control
    67  ERR260238_profile ERR260268_profile 0.7181616   obese control
    68  ERR260235_profile ERR260268_profile 0.7699306   obese control
    69  ERR260227_profile ERR260268_profile 0.8265505 control control
    70  ERR260226_profile ERR260268_profile 0.6008277 control control
    71  ERR260224_profile ERR260268_profile 0.6035165   obese control
    72  ERR260222_profile ERR260268_profile 0.7815420   obese control
    81  ERR260182_profile ERR260268_profile 0.7567220   obese control
    82  ERR260179_profile ERR260268_profile 0.6433029   obese control
    83  ERR260178_profile ERR260268_profile 0.6629624   obese control
    84  ERR260177_profile ERR260268_profile 0.7537858   obese control
    85  ERR260176_profile ERR260268_profile 0.7984294   obese control
    86  ERR260175_profile ERR260268_profile 0.5524126   obese control
    87  ERR260173_profile ERR260268_profile 0.6311589   obese control
    88  ERR260171_profile ERR260268_profile 0.7350795 control control
    89  ERR260170_profile ERR260268_profile 0.7354421 control control
    115 ERR260242_profile ERR260265_profile 0.6311954 control control
    123 ERR260222_profile ERR260265_profile 0.7742520   obese control
    124 ERR260217_profile ERR260265_profile 0.6709858 control control
    125 ERR260216_profile ERR260265_profile 0.5212192 control control
    126 ERR260215_profile ERR260265_profile 0.7765179 control control
    127 ERR260209_profile ERR260265_profile 0.7275992 control control
    128 ERR260205_profile ERR260265_profile 0.7094909 control control
    136 ERR260176_profile ERR260265_profile 0.8724801   obese control
    137 ERR260175_profile ERR260265_profile 0.7654367   obese control
    138 ERR260173_profile ERR260265_profile 0.7298885   obese control
    139 ERR260171_profile ERR260265_profile 0.6762393 control control
    140 ERR260170_profile ERR260265_profile 0.7731602 control control
    141 ERR260169_profile ERR260265_profile 0.6677658   obese control
    142 ERR260167_profile ERR260265_profile 0.7804287   obese control
    143 ERR260163_profile ERR260265_profile 0.7105553   obese control
    150 ERR260147_profile ERR260265_profile 0.5144030 control control
    151 ERR260145_profile ERR260265_profile 0.9180573   obese control
    152 ERR260140_profile ERR260265_profile 0.7449615   obese control
    153 ERR260136_profile ERR260265_profile 0.7330777   obese control
    154 ERR275252_profile ERR260264_profile 0.6417904   obese control
    155 ERR260268_profile ERR260264_profile 0.6256261 control control
     [ reached 'max' / getOption("max.print") -- omitted 2652 rows ]

In your case you have calculated distances among the controls and among the obese samples. Right?

ADD REPLYlink modified 5 months ago • written 5 months ago by dpc150

If you look at line 1 - 1 ERR260268_profile ERR275252_profile 0.6813452 control obese - and line 52 - 52 ERR275252_profile ERR260268_profile 0.6813452 obese control, you see that they are the same. So, the thing is, when you do a beta-diversity analysis you get a distance matrix table, that is square, all the samples are compared against all the samples. This means that your upper and bottom part of the matrix are exactly the same, i.e., comparing, in terms of beta-diversity, sample A against B, or sample B against A, the value of beta-diversity is the same. What you have in your table is exactly this, all the comparisons. What you have in my data frame is just the upper part of the matrix, i.e., sample A against B, but you don't have sample B against A, because is the same comparison.

Where did you find this tutorial code?


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

From here I've collected it.

  • So, you are saying that you also measured distance among all possible combinations (and each distance taken once only )? Then why in your distance matrix there are only 699 comparisons while in that tutorial matrix it is around 2852. It should have been only twice. Right? Also, in the matrix there are either control or obese for each distance. Why?
  • One more thing. For wilcoxon ranksum test between the control and test samples I am using the following code. Is it ok?

p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group)

or I should use this one?

p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, sample_data(physeq)$type)

ADD REPLYlink modified 5 months ago • written 5 months ago by dpc150

Yes, I've also measured the distance among all possible combinations. Actually, the code that I've used is the same as it appears in that online script/function: beta_div_dist <- distance(physeq = physeq, method = method) (my) versus wu = phyloseq::distance(p, m) (online).

The way that I retrieve the values and plot differs. The difference is that I retrieved only the unique values for each pairwise comparison, for each group (sorry I forgot to mention this before). This means that there were many comparisons made for which I did not recover the pairwise comparisons. So, for instance in your case, you are only interested in comparisons made within control samples or within obese samples, and not between control and obese samples, right? So, in the case of the online function, you'll get all: within control samples, within obese samples, and between control and obese samples pairwise comparisons. In addition, for the online function will get twice the exact same no., as I explained you before, from the comparison between sample A and B, and between sample B and A. That's why you've such a big difference between the result of my function and the online function. So, assuming that n is the number of samples that you've, in order to get the no. of values using the online function you need to apply the formula: (n * n) - n (i.e., all the samples against all, minus the no. of samples that correspond to the same sample being compare against itself: sample A against sample A is equal to 0 beta-diversity). To calculate the no. of pairwise comparisons retrieved with my function, you need to do the following, considering g as the number of samples for one of the groups being compared: ((g * g) - g)/2 (you need to sum this result to the next group). In your case it should be like: (((gc * gc) - gc)/2) + (((go * go) - go)/2) = 699 (where gc is the no. of samples for the control group, and go the no. of samples for the obese group).

If you run my and the online functions (filtering the result of the online function to show only the comparisons retrieved with my function), you'll see that the result is the same:

## Comparing built-in function 'beta_boxplot()' 
#with online function 'plotDistances()' from:
# built-in fun
beta_boxplot_result_test <- beta_boxplot(physeq = GlobalPatterns, method = "bray", group = "SampleType")

# online fun
online_fun_test <- plotDistances(p = GlobalPatterns, m = "bray", s = "X.SampleID", d = "SampleType", plot = FALSE)
online_fun_test[online_fun_test$Type1==online_fun_test$Type2,] # filter result to show the same comparisons

So, The plot made by the online function gives you all possible sample group pairwise comparisons. The red boxplots are for within sample group comparisons. So, If you compare the boxplots obtained with my function with the red boxplots obtained with the online function, they should be the same.

I'll answer below to your second part.


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

Yes, I think your first option is Ok, regarding the Wilcoxon test:

p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group)

I tested with the example data and it works fine:

## Wilcoxon rank test
fake_data <- beta_boxplot_result$data[beta_boxplot_result$data$group %in% c("Feces","Ocean"),]

pairwise.wilcox.test(fake_data$beta_div_value, fake_data$group)

I just subset to "Feces" and "Ocean" because otherwise I would not be able to do Wilcoxon with multiple sample groups.

You might consider using ggpubr to do a plot similar to the one that you want:

# plot

p <- ggboxplot(fake_data, x = "group", y = "beta_div_value",
               color = "group", palette =c("#00AFBB", "#E7B800"))
my_comparisons <- list( c("Feces", "Ocean") )
p + stat_compare_means(comparisons = my_comparisons, label = "p.signif")+ 
  stat_compare_means(label.y = 1)

Instead of fake_data use the data obtained with my function. The result of Wilcoxon should be the same obtained before with the function pairwise.wilcox.test().

Let me know if this worked,


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

Okay. I understand. But here I have 13 control and 39 obese samples. So, the calculation supposed to be: 78+741 = 819. But, it is 699 here. Doesn't match.

As far as the wilcoxon test is concerned, I am getting p value 8.7e-06 for p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group). While pvalue 0.67 for p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, sample_data(physeq)$type)

So, the result from first option seems very unusual. Not? However, here in this tutorial they have used the group data from

ADD REPLYlink modified 5 months ago • written 5 months ago by dpc150

Regarding the lack of match about 78+741 = 819, I need to think about it.

The p-values are different because you can not use sample_data(physeq)$type, the order does not match the order at beta_boxplot_result$data$beta_div_value. Probably the first value is ok, because you have a lot of replicates and there seems to be a difference between control and obese.

You might open a new question, regarding the issue o p-value, since as @RamRS pointed out, this is too long.


ADD REPLYlink modified 5 months ago • written 5 months ago by antonioggsousa1.9k

Ok sir. Thanks... if you resolve the Issue regarding the sum difference, please write here. And, I will post a different question for the p value.... thanks....

ADD REPLYlink written 5 months ago by dpc150

I did the counts for the online version assuming that you've 52 samples (13 + 39), and based on (52 * 52) - 52 you get 2652 and not 2852. So, are you sure about the no. of lines. To know the exact no. you need to do nrow(data_frame), where data_frame is your data table. This is quite strange and I can't understand why is happening.


ADD REPLYlink written 5 months ago by antonioggsousa1.9k

Sorry. It's 2652. That's fine. But, did you think about nrows coming from your script which was 699 in spite of 819? Also, I will draw your attention to this query related to this question only. Sorry, Sir @RamRS. Please, allow us for last a few converasations.

ADD REPLYlink written 5 months ago by dpc150

Please do not ask for upvotes or hand out votes for every comment. That is bad etiquette. There is a reason these limits exist - this chain is an indication that either your premise or the level of effort you put in the initial question is not good enough. If there were sufficient detail in your question, you'd not need to have this prolonged back-and-forth.

ADD REPLYlink written 5 months ago by _r_am32k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1343 users visited in the last hour