I am creating a relative abundance boxplot comparing two groups (pet, stray) using eight genera. However, the resulting plot displays not pretty in box shape. The reason is too wide variance of the Y-axis data.
I assume that 1) eliminate Y-axis outlier and 2) Use log transformation with the relative abundance data for scaling would be good solution of this.
What if I have many 0 values when I do trnasformation to log10 ?
Is there any well-used library that for automatically transformation for purpose of this kind of work in R ?
original result :
My original R code is here :
library(ggplot2) data <- read.csv("relative abundance raw data (putative pathogen).csv") p<-ggplot(data, aes(x="Genus", y="Relative_abundance", fill="Group")) + geom_boxplot(position = position_dodge(width = 0.8), alpha = 0.8) + labs(title = "Relative Abundance Comparison", x = "Genus", y = "Relative Abundance", fill = "Group") + theme_minimal() + scale_fill_manual(values = c("stray" = "blue", "pet" = "red")) p + geom_jitter(shape=16, position=position_jitter(0.2))
My raw data file can be downloaded here :
https://drive.google.com/file/d/1Dxy2EqqgC2BQK6b92gRHSI5t7DtA29YA/view?usp=sharing
Please help me for making pretty box plot by adjusting y-scale !!
You can try
scale_y_sqrt()
instead if you don't like the look of the log10 transformation. As an aside, it doesn't look like your fill variable is working as everything is the same grey...Finally, I don't think you have enough data to really show that the "outliers" are actually outliers in need of being removed. Yes, they are far outside of the distribution otherwise, but you only have ~50 data points in that genus. Depending on what stats you use, you could check the Residuals vs Leverage diagnostic plots in R to see if they have more support for being removed.
Please provide data as dput(), not via any random dropbox, that could be anything (also malware, theoretically). If log transformation creates zeros then one typically adds a pseudocounts, like 1 or 0.1 before transformation.
I agree with ATpoint - you can replace 0s with .1.
For determining skewness, I like to use the skewness() function in the moments package in R. Since your data appears to be right-skewed, you're right that a log10 transformation might give you a more normal distribution. As dthorbur mentioned, there are other transformations that you could do. Less extreme transformations would include square root, cube root, log2, and natural log. A more extreme transformation would be to take the inverse, although I don't think it would make sense to use that with fold change data. log2 fold change is commonly used in biomedical research.