How to generate Beta diversity boxplot from phyloseq object?
0
2
Entering edit mode
2.1 years ago
dpc ▴ 240

Hi !!! I am working with profiled metagenomic taxonomy abundance data. I want to generate beta diversity (bray-curtis) boxplot from a phyloseq object where two groups (control and test) will be shown. Something like this figure. Can anybody help by sharing some code or some useful tutorial?

phyloseq beta diversity metagenomics • 4.5k views
ADD COMMENT
1
Entering edit mode

Hi,

Phyloseq does not have built-in functions that allow you to produce a figure like that. You need to calculate the Bray-Curtis distance with vegan or phyloseq, but then, you need to extract the information from the matrix to build a box plot like that with for instance, ggplot2.

António

ADD REPLY
1
Entering edit mode

I have used the following code after looking at a tutorial and getting the following plot. I don't know why two plots are coming. Can you tell some modifications to get a single one?

physeq = merge_phyloseq(physeq, sampledata, random_tree)
physeq

wu = phyloseq::distance(physeq, "bray")
wu.m = melt(as.matrix(wu))
wu.m = wu.m %>%
filter(as.character(Var1) != as.character(Var2)) %>%
mutate_if(is.factor, as.character)

sd = data.frame(sample_data(physeq))

sd = sd %>%
select("SampleID", "type") %>%
mutate_if(is.factor,as.character)

colnames(sd) = c("Var1", "Type1")
wu.sd = left_join(wu.m, sd, by = "Var1")

colnames(sd) = c("Var2", "Type2")
wu.sd = left_joinwu.sd, sd, by = "Var2")

p = ggplotwu.sd, aes(x = Type2, y = value)) +
theme_bw() +
geom_point() +
geom_boxplot(aes(color = ifelse(Type1 == Type2, "red", "black" ))) +
scale_color_identity() +
facet_wrap(~ Type1, scales = "free_x") +
theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Distance Metric = ", "bray")) +
ylab("bray") +
xlab("type")

p

ADD REPLY
1
Entering edit mode

Remove the facet_wrap(~ Type1, scales = "free_x") + line from your code and try again.

Let me know if this solved the issue,

António

ADD REPLY
0
Entering edit mode

It's giving this image.

ADD REPLY
1
Entering edit mode

Remove also geom_point() +!

Let me know if this solved the issue,

António

ADD REPLY
0
Entering edit mode

Not working. Image

ADD REPLY
2
Entering edit mode

Did you remove both: facet_wrap(~ Type1, scales = "free_x") + and geom_point() +?

António

ADD REPLY
2
Entering edit mode

I just wrote the function below. You might want to test it. I did not test it extensively, so it might have errors. If you use it you need to be very careful and test it on a dataset that you know well.

It takes as input the phyloseq object, the method that you want to use to perform beta-diversity (using one of the methods that you can pass to phyloseq beta-diversity analysis) and the group variable (on the sample_data frame):

## Import packages and data
library("phyloseq")
library("ggplot2")

data("GlobalPatterns")

## Create function
beta_boxplot <- function(physeq, method = "bray", group) {

# physeq: phyloseq-class object
# method: beta-diversity metric. Default "bray", i.e., Bray-Curtis dissimilarity
# group: factorial variable to group

## Packages
require("phyloseq") # v.1.30.0
require("ggplot2") # v.3.3.2

## Identify the correspondence: group and samples
group2samp <- list() # list to save the correspondence between group <--> samples
group_list <- get_variable(sample_data(physeq), group) # list of group elements
for (groups in levels(group_list)) { # loop over the no. of group levels
target_group <- which(group_list == groups) # vct pos of the curr group variable
group2samp[[ groups ]] <- sample_names(physeq)[target_group] # matching samples: based on vct pos
}

## Calculate beta-diversity
beta_div_dist <- distance(physeq = physeq, method = method)
beta_div_dist <- as(beta_div_dist, "matrix")

## Coerce distance mtx into a tidy data frame by group
dist_df <- data.frame() # save results in df
counter <- 1
for (groups in names(group2samp)) { # loop over group fct levels
sub_dist <- beta_div_dist[ group2samp[[groups]], group2samp[[groups]] ] # subset dist mtx to curr samples
#print(sub_dist)
no_samp_col <- ncol(sub_dist) # n cols: curr sub dist
no_samp_row <- nrow(sub_dist) # n rows: curr sub dist
for ( cols in seq(no_samp_col) ) { # loop over cols: curr sub_dist
if ( cols > 1 ) {
for ( rows in seq((cols-1)) ) { # loop over rows: curr sub_dist
## Save results
dist_df[ counter, "sample_pair" ] <- paste0( colnames(sub_dist)[cols], "-",
rownames(sub_dist)[rows] ) # sample pair
dist_df[ counter, "group" ] <- groups # group
dist_df[ counter, "beta_div_method" ] <- method # method
dist_df[ counter, "beta_div_value" ] <- sub_dist[rows, cols] # beta-diversity for the sample pair
counter = counter + 1
}
}
}
}

## Create a ggplot2 boxplot
plot_boxplot <- ggplot(data = dist_df, aes(x = group, y = beta_div_value, color = group)) +
geom_boxplot(outlier.shape=NA) +
geom_jitter() +
theme_bw() +
xlab(group) + ylab(method) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

## Save df and boxplot into a list
list_Out <- list("data" = dist_df, "plot" = plot_boxplot)

return(list_Out)
}

## Test function
beta_boxplot_result <- beta_boxplot(physeq = GlobalPatterns, method = "bray", group = "SampleType")

## Data
beta_boxplot_result$data ## Plot beta_boxplot_result$plot


Again, you should be very careful and test well first. I'll perform some tests later and I'll come back in case I found errors. The result is a list with a data frame and the box plot (from ggplot class). So, you can compare this data frame with all pairwise comparisons belonging to each group with the table obtained from phyloseq and see if everything looks fine.

I hope this helps,

António

ADD REPLY
1
Entering edit mode

Hi... I could not reply yesterday as I reached highest number of posts a day (as a new member). I am seeing it works nicely with "GlobalPatterns", but not with my data set. I don't understand what have you done here:

for (groups in levels(group_list)) { # loop over the no. of group levels
target_group <- which(group_list == groups) # vct pos of the curr group variable
group2samp[[ groups ]] <- sample_names(physeq)[target_group] # matching samples: based on vct pos


Why is there a loop?

ADD REPLY
1
Entering edit mode

Hi,

Probably there are many ways of getting that information, even without a loop, but here the idea is to loop over the different levels of the group variable, e.g., let's say that you have a variable called sex, where the levels can be male or female. The loop will go through female and, then, male and for each will retrieve the list of samples belonging to male or female. So, basically, group2samp, in this example, will contain two elements male or female, in which each of them will hold a vector of samples belonging to themselves.

The only reason that I can think that is not working for you is because group parameter needs to be a factor class variable. If you give a character class variable will not work. So, you can check which class is by doing: class(get_variable(sample_data(your_physeq_object), your_group_variable)).

If your variable is different from factor, you need to convert it to factor class before trying again.

António

ADD REPLY
2
Entering edit mode

Yes sir... Finally It's running. I am getting an output like this after removing "jeom jitter" . Thanks a lot sir for your kind help. I am really indebted to you.

Also, I will like to know some basics about what you have done in this script. I think you have calculated "bray" index between the controls samples; and then, between the obese samples. After that, you have compared their median by box plot. Right?

(One little request: It would be great if you upvote my questions so that I can conversate for longer here. Yeasterday reached highest limit too early, only 5 messages )

ADD REPLY
1
Entering edit mode

Sorry, I up voted your question now!

So, if you want to remove the points without changing the original code function, you can because you are working with a plot from ggplot class, by doing:

## Remove jittered points
beta_boxplot_result$plot$layers[[2]] <- NULL


You can even change the x- and y-axis labels by doing:

## Add y and x labels
beta_boxplot_result$plot + xlab("Group variable") + ylab("Bray-Curtis dissimilarity")  You can write what ever you want in xlab() and ylab() for your x- and y-axis labels. You can change the whole plot, background and so on, because the plot is from ggplot class. Since you also have the data frame that was use to create the plot, you can even create the plot from scratch. Is this plot that you got equal to the plot that you did before following the tutorial? António ADD REPLY 0 Entering edit mode No sir. They are different. This is yours and this is from tutorial. ADD REPLY 0 Entering edit mode So, I added the geom_boxplot(outlier.shape=NA) + to the code above. If you run again with the updated code, those outlier points should be removed. Regarding the differences, you can not compare the 2 plots because with the tutorial you're plotting twice, the type variable is being used as x-axis label and as facet_wrap label. You should remove facet, and then you compare them. António ADD REPLY 1 Entering edit mode Thanks sir. I think the problem there is not the facet rather the distance matrix is problem. Because, they have calculated distances among control samples, among obese samples and between control and obese samples. When I check wu.sd, this is the result (omitted some lines for space limitation):  wu.sd Var1 Var2 value Type1 Type2 1 ERR260268_profile ERR275252_profile 0.6813452 control obese 2 ERR260265_profile ERR275252_profile 0.7162228 control obese 3 ERR260264_profile ERR275252_profile 0.6417904 control obese 4 ERR260263_profile ERR275252_profile 0.5717646 control obese 5 ERR260261_profile ERR275252_profile 0.5619948 obese obese 6 ERR260260_profile ERR275252_profile 0.5124622 control obese ... 45 ERR260152_profile ERR275252_profile 0.5769312 obese obese 46 ERR260150_profile ERR275252_profile 0.6799476 obese obese 47 ERR260148_profile ERR275252_profile 0.5879797 obese obese 48 ERR260147_profile ERR275252_profile 0.7578910 control obese 49 ERR260145_profile ERR275252_profile 0.7650143 obese obese 50 ERR260140_profile ERR275252_profile 0.5465588 obese obese 51 ERR260136_profile ERR275252_profile 0.5901467 obese obese 52 ERR275252_profile ERR260268_profile 0.6813452 obese control 53 ERR260265_profile ERR260268_profile 0.6878742 control control 54 ERR260264_profile ERR260268_profile 0.6256261 control control 55 ERR260263_profile ERR260268_profile 0.6589539 control control 66 ERR260239_profile ERR260268_profile 0.7411252 obese control 67 ERR260238_profile ERR260268_profile 0.7181616 obese control 68 ERR260235_profile ERR260268_profile 0.7699306 obese control 69 ERR260227_profile ERR260268_profile 0.8265505 control control 70 ERR260226_profile ERR260268_profile 0.6008277 control control 71 ERR260224_profile ERR260268_profile 0.6035165 obese control 72 ERR260222_profile ERR260268_profile 0.7815420 obese control ... 81 ERR260182_profile ERR260268_profile 0.7567220 obese control 82 ERR260179_profile ERR260268_profile 0.6433029 obese control 83 ERR260178_profile ERR260268_profile 0.6629624 obese control 84 ERR260177_profile ERR260268_profile 0.7537858 obese control 85 ERR260176_profile ERR260268_profile 0.7984294 obese control 86 ERR260175_profile ERR260268_profile 0.5524126 obese control 87 ERR260173_profile ERR260268_profile 0.6311589 obese control 88 ERR260171_profile ERR260268_profile 0.7350795 control control 89 ERR260170_profile ERR260268_profile 0.7354421 control control ... 115 ERR260242_profile ERR260265_profile 0.6311954 control control ... 123 ERR260222_profile ERR260265_profile 0.7742520 obese control 124 ERR260217_profile ERR260265_profile 0.6709858 control control 125 ERR260216_profile ERR260265_profile 0.5212192 control control 126 ERR260215_profile ERR260265_profile 0.7765179 control control 127 ERR260209_profile ERR260265_profile 0.7275992 control control 128 ERR260205_profile ERR260265_profile 0.7094909 control control ... 136 ERR260176_profile ERR260265_profile 0.8724801 obese control 137 ERR260175_profile ERR260265_profile 0.7654367 obese control 138 ERR260173_profile ERR260265_profile 0.7298885 obese control 139 ERR260171_profile ERR260265_profile 0.6762393 control control 140 ERR260170_profile ERR260265_profile 0.7731602 control control 141 ERR260169_profile ERR260265_profile 0.6677658 obese control 142 ERR260167_profile ERR260265_profile 0.7804287 obese control 143 ERR260163_profile ERR260265_profile 0.7105553 obese control ... 150 ERR260147_profile ERR260265_profile 0.5144030 control control 151 ERR260145_profile ERR260265_profile 0.9180573 obese control 152 ERR260140_profile ERR260265_profile 0.7449615 obese control 153 ERR260136_profile ERR260265_profile 0.7330777 obese control 154 ERR275252_profile ERR260264_profile 0.6417904 obese control 155 ERR260268_profile ERR260264_profile 0.6256261 control control ... [ reached 'max' / getOption("max.print") -- omitted 2652 rows ]  In your case you have calculated distances among the controls and among the obese samples. Right? ADD REPLY 1 Entering edit mode If you look at line 1 - 1 ERR260268_profile ERR275252_profile 0.6813452 control obese - and line 52 - 52 ERR275252_profile ERR260268_profile 0.6813452 obese control, you see that they are the same. So, the thing is, when you do a beta-diversity analysis you get a distance matrix table, that is square, all the samples are compared against all the samples. This means that your upper and bottom part of the matrix are exactly the same, i.e., comparing, in terms of beta-diversity, sample A against B, or sample B against A, the value of beta-diversity is the same. What you have in your table is exactly this, all the comparisons. What you have in my data frame is just the upper part of the matrix, i.e., sample A against B, but you don't have sample B against A, because is the same comparison. Where did you find this tutorial code? António ADD REPLY 2 Entering edit mode From here I've collected it. • So, you are saying that you also measured distance among all possible combinations (and each distance taken once only )? Then why in your distance matrix there are only 699 comparisons while in that tutorial matrix it is around 2852. It should have been only twice. Right? Also, in the matrix there are either control or obese for each distance. Why? • One more thing. For wilcoxon ranksum test between the control and test samples I am using the following code. Is it ok? p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group) or I should use this one? p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, sample_data(physeq)$type)

ADD REPLY
2
Entering edit mode

Yes, I've also measured the distance among all possible combinations. Actually, the code that I've used is the same as it appears in that online script/function: beta_div_dist <- distance(physeq = physeq, method = method) (my) versus wu = phyloseq::distance(p, m) (online).

The way that I retrieve the values and plot differs. The difference is that I retrieved only the unique values for each pairwise comparison, for each group (sorry I forgot to mention this before). This means that there were many comparisons made for which I did not recover the pairwise comparisons. So, for instance in your case, you are only interested in comparisons made within control samples or within obese samples, and not between control and obese samples, right? So, in the case of the online function, you'll get all: within control samples, within obese samples, and between control and obese samples pairwise comparisons. In addition, for the online function will get twice the exact same no., as I explained you before, from the comparison between sample A and B, and between sample B and A. That's why you've such a big difference between the result of my function and the online function. So, assuming that n is the number of samples that you've, in order to get the no. of values using the online function you need to apply the formula: (n * n) - n (i.e., all the samples against all, minus the no. of samples that correspond to the same sample being compare against itself: sample A against sample A is equal to 0 beta-diversity). To calculate the no. of pairwise comparisons retrieved with my function, you need to do the following, considering g as the number of samples for one of the groups being compared: ((g * g) - g)/2 (you need to sum this result to the next group). In your case it should be like: (((gc * gc) - gc)/2) + (((go * go) - go)/2) = 699 (where gc is the no. of samples for the control group, and go the no. of samples for the obese group).

If you run my and the online functions (filtering the result of the online function to show only the comparisons retrieved with my function), you'll see that the result is the same:

## Comparing built-in function 'beta_boxplot()'
#with online function 'plotDistances()' from: https://rdrr.io/github/jeffkimbrel/jakR/src/R/plotDistances.R
# built-in fun
beta_boxplot_result_test <- beta_boxplot(physeq = GlobalPatterns, method = "bray", group = "SampleType")
beta_boxplot_result_test$data # online fun online_fun_test <- plotDistances(p = GlobalPatterns, m = "bray", s = "X.SampleID", d = "SampleType", plot = FALSE) online_fun_test[online_fun_test$Type1==online_fun_test$Type2,] # filter result to show the same comparisons  So, The plot made by the online function gives you all possible sample group pairwise comparisons. The red boxplots are for within sample group comparisons. So, If you compare the boxplots obtained with my function with the red boxplots obtained with the online function, they should be the same. I'll answer below to your second part. António ADD REPLY 2 Entering edit mode Yes, I think your first option is Ok, regarding the Wilcoxon test: p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group) I tested with the example data and it works fine: ## Wilcoxon rank test fake_data <- beta_boxplot_result$data[beta_boxplot_result$data$group %in% c("Feces","Ocean"),]

pairwise.wilcox.test(fake_data$beta_div_value, fake_data$group)


I just subset to "Feces" and "Ocean" because otherwise I would not be able to do Wilcoxon with multiple sample groups.

You might consider using ggpubr to do a plot similar to the one that you want:

# plot

library("ggpubr")
p <- ggboxplot(fake_data, x = "group", y = "beta_div_value",
color = "group", palette =c("#00AFBB", "#E7B800"))
my_comparisons <- list( c("Feces", "Ocean") )
p + stat_compare_means(comparisons = my_comparisons, label = "p.signif")+
stat_compare_means(label.y = 1)


Instead of fake_data use the data obtained with my function. The result of Wilcoxon should be the same obtained before with the function pairwise.wilcox.test().

Let me know if this worked,

António

ADD REPLY
1
Entering edit mode

Okay. I understand. But here I have 13 control and 39 obese samples. So, the calculation supposed to be: 78+741 = 819. But, it is 699 here. Doesn't match.

As far as the wilcoxon test is concerned, I am getting p value 8.7e-06 for p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group).  While pvalue 0.67 for p_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, sample_data(physeq)$type)  So, the result from first option seems very unusual. Not? However, here in this tutorial they have used the group data from sample_data(physeq)$group

ADD REPLY
0
Entering edit mode

Regarding the lack of match about 78+741 = 819, I need to think about it.

The p-values are different because you can not use sample_data(physeq)$type, the order does not match the order at beta_boxplot_result$data\$beta_div_value. Probably the first value is ok, because you have a lot of replicates and there seems to be a difference between control and obese.

You might open a new question, regarding the issue o p-value, since as @RamRS pointed out, this is too long.

António

ADD REPLY
0
Entering edit mode

Ok sir. Thanks... if you resolve the Issue regarding the sum difference, please write here. And, I will post a different question for the p value.... thanks....

ADD REPLY
0
Entering edit mode

I did the counts for the online version assuming that you've 52 samples (13 + 39), and based on (52 * 52) - 52 you get 2652 and not 2852. So, are you sure about the no. of lines. To know the exact no. you need to do nrow(data_frame), where data_frame is your data table. This is quite strange and I can't understand why is happening.

António

ADD REPLY
0
Entering edit mode

Sorry. It's 2652. That's fine. But, did you think about nrows coming from your script which was 699 in spite of 819? Also, I will draw your attention to this query related to this question only. Sorry, Sir @RamRS. Please, allow us for last a few converasations.

ADD REPLY
0
Entering edit mode

Please do not ask for upvotes or hand out votes for every comment. That is bad etiquette. There is a reason these limits exist - this chain is an indication that either your premise or the level of effort you put in the initial question is not good enough. If there were sufficient detail in your question, you'd not need to have this prolonged back-and-forth.

ADD REPLY

Login before adding your answer.

Traffic: 1788 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6