Implement p-values and significance levels in boxplots for more of two groups with ggplot2 in R concerning RNA-Seq gene expression data
1
0
Entering edit mode
3.0 years ago
svlachavas ▴ 730

I have created some "grouped" boxplots in R, regarding the expression of a subset of 12 genes, for 3 cluster groups of samples, based on a previous clustering methodology result. The gene expression, is VST transformed HTSEQ counts. The code used for the creation of the included figure:

head(dat) # the data frame of the genes/features in the columns and the 1 categorical variable
TCGA-3L-AA1B-01A-11R-A37K-07 10.329101 13.32549 11.29148  9.800935
TCGA-AU-6004-01A-11R-1723-07 10.793586 12.91526 11.15353  9.919037
TCGA-T9-A92H-01A-11R-A37K-07 10.198103 13.73892 12.14109 10.518959
TCGA-CK-5913-01A-11R-1653-07 10.704988 13.59675 11.73051 10.586667
TCGA-CM-5860-01A-01R-1653-07  9.863118 13.23791 11.66066 10.566241
GSTP1      CAT      KIT     CD44     BOP1
TCGA-3L-AA1B-01A-11R-A37K-07 13.47565 10.88197 9.476305 13.50063 11.60254
TCGA-AU-6004-01A-11R-1723-07 13.63630 11.90729 7.080705 13.95125 12.05972
TCGA-T9-A92H-01A-11R-A37K-07 14.59698 12.16112 9.445610 13.15624 12.31314
TCGA-CK-5913-01A-11R-1653-07 13.39063 11.65145 7.198912 14.24373 11.97289
TCGA-AD-6889-01A-11R-1928-07 14.21625 11.38295 6.053052 13.62892 11.68580
TCGA-CM-5860-01A-01R-1653-07 14.33711 11.63726 7.670905 13.61599 11.26274
TCGA-3L-AA1B-01A-11R-A37K-07 10.67613 10.48779 14.69374           EC1
TCGA-AU-6004-01A-11R-1723-07 11.06180 11.12601 14.27851           EC3
TCGA-T9-A92H-01A-11R-A37K-07 10.18600 10.87468 14.64535           EC1
TCGA-CK-5913-01A-11R-1653-07 11.42366 11.13081 14.82264           EC3
TCGA-CM-5860-01A-01R-1653-07 11.75133 10.47124 15.10752           EC3

df.m <- melt(dat, id.var = "Cluster_Group")
p <- ggplot(data=df.m,aes(x=variable,y=value))
p <- p + geom_boxplot(aes(fill=Cluster_Group))
p <- p + geom_point(aes(y=value,group=Cluster_Group),position=position_dodge(width=0.75))
p <- p + facet_wrap(~variable,scales="free")
p <- p + xlab("gene symbol") + ylab("vst transformed counts") + ggtitle("Gene Expression Differences in 3 Patient Clusters")
p <- p + guides(fill=guide_legend(title="Cluster_Membership"))


Here is the link to the created plot:

https://www.dropbox.com/s/nkh2qmth3szsrda/3clusters.ggplot2.12genes.survival.jpeg?dl=0

My main concern here, is whether i can also include significance levels between the 3 groups in each boxplot in each gene ? In order to illustrate any significant differences in mean expression, in any of the pairwise group comparisons ? My notion for this, is to further select and prioritize some genes, based on a parallel survival analysis of these groups of clusters, as all the 12 genes were used.

boxplot R RNA-Seq ggplot2 • 9.9k views
0
Entering edit mode

If I understand the question correctly, you may want to explore the ggsignif package, which I believe extends ggplot2's functionality for the addition to significance bars as single geoms.

0
Entering edit mode

3
Entering edit mode
3.0 years ago

Update 21st March 2019: scroll through the comments to see the code that I used to generate this figure

## ---------------------------------------

In your plots, I think that you should reduce the size of the points/dots, and also make them more 'spread'. You can do this with the geom_jitter() function. Take a look here: A: Boxplot in ggplot2

You can also, of course, add pairwise comparisons to the plots and make justification for a particular gene based on a single statistically significant P value. You mean something like this, right:

That is from my recent publication: Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y

0
Entering edit mode

Dear Kevin,

thank you for your valuable comments and suggestions on this matter !! and also please excuse me for answering with delay, as this time in Greece is the Greek Easter-so, my basic questions concerning both the functionality as also the interpretation of the above boxplots:

1) Firstly, i also agree about the spread and the illustration of the relative points. But, in your opinion i should remove the function geom_point() and replace it with geom_jitter() ? as you have proposed ?

2) Yes, the plot you have posted from your relevant publication, is exactly what i had in mind. Actually, as i have mentioned about these 12 genes, a consensus clustering based on their expression in RNA-Seq data, revealed some biologically interesting clusters, with differences in survival. That's why, i would like to have a boxplot except the heatmap, in order to inspect in more detail, any significant differences in expression in any of these 12 genes.

Thus, to create a plot like your above, i should follow an older example of a customized boxplot in this link ? A: Possible methodologies for association of specific gene subsets from microarray And somehow modify it, as i would like also to add color like in my above plot ?

4) A) Moreover, one important question regarding the significance of the pairwise comparisons: in your plot in the link above, how are the p-values created ? and by which test ? that is the function stat_summary ?

B) As my RNA-Seq data are VST transformed HTSeq counts, it would be fine for the performance of pairwise statistical comparisons, correct ?

5) My final question is based on the crusial explanation and utilization of the boxplots. In detail, based on the initial boxplot i have created and uploaded in my question: in your opinion, do you agree that even with a similar spread of points/relative expression for specific genes for the compared clusters(perhaps the visualization is not the best in my case), the most important is any significance difference in their relative median values, correct ? such as the ACADM gene example ?

Efstathios-Iason

1
Entering edit mode

1) Firstly, i also agree about the spread and the illustration of the relative points. But, in your opinion i should remove the function geom_point() and replace it with geom_jitter() ? as you have proposed ?

Yes, you can use geom_jitter for that. geom_jitter just disperses the dots, making it better, visually.

2) Yes, the plot you have posted from your relevant publication, is exactly what i had in mind. Actually, as i have mentioned about these 12 genes, a consensus clustering based on their expression in RNA-Seq data, revealed some biologically interesting clusters, with differences in survival. That's why, i would like to have a boxplot except the heatmap, in order to inspect in more detail, any significant differences in expression in any of these 12 genes.

Thus, to create a plot like your above, i should follow an older example of a customized boxplot in this link ? A: Possible methodologies for association of specific gene subsets from microarray And somehow modify it, as i would like also to add color like in my above plot ?

Yes, you could use the code from my other post. It can be modified in various ways. If you need to do something specific, then you can usually find the answer on StackExchange, Biostars, or Bioconductor. If you need the specific code for my plot here in this thread, then let me know?

4) A) Moreover, one important question regarding the significance of the pairwise comparisons: in your plot in the link above, how are the p-values created ? and by which test ? that is the function stat_summary ?

B) As my RNA-Seq data are VST transformed HTSeq counts, it would be fine for the performance of pairwise statistical comparisons, correct ?

Those p-values are from an ANOVA applied to a linear regression model. For your data, with VST counts, I believe that these are not binomially-distributed. So, for p-values, you could just use the Wald Test p-value (from the DESeq2 model) or just calculate it with a non-parametric ANOVA (non-parametric ANOVA is the Kruskall-Wallis test) with a pairwise post-Dunnett's test. In this way, you would get a single p-value, and then 3 other p-values for each pairwise comparison. Or, you can do it my way, where you extract p-values from a regression fit.

5) My final question is based on the crusial explanation and utilization of the boxplots. In detail, based on the initial boxplot i have created and uploaded in my question: in your opinion, do you agree that even with a similar spread of points/relative expression for specific genes for the compared clusters(perhaps the visualization is not the best in my case), the most important is any significance difference in their relative median values, correct ? such as the ACADM gene example ?

In your data, most of the genes look interesting.

0
Entering edit mode

Kevin,

thank you for your detailed feedback, much appreciated !! Actually, based on my produced boxplots, i was a bit concerned about the similar spreading of the expression values in specific genes -for example the NPM1 gene, shows a clear difference in the median of the green and blue boxplots, however the spread of the values look a bit similar-but, as much the medians differ statistically, this should not concern me correct ?

Finally, as your methodology for the calculation of the p-values is interesting, could you provide the exact code used for your plot above, if it is possible ? concerning the vitamin D boxplots ? as from your explanation, i can use my VST transformed counts with a regression model, correct ?

Best,

Efstathios

1
Entering edit mode

Hi Efsathios, here is the exact code that was used:

#Define a custom plotting function
#Arguments:
#   ScatterMatrix, a data-frame of numbers, with one of more additional columns as factors
#   Module, the factor column whose values by which the plotting will segregate
#   i, the column index whose values will be plotted
#   pvalues, a matching vector of Kruskall-Wallace P values or symbols, whose indices will relate to the ScatterMatrix index being plotted
#   pvalues1, a matching vector of Mann-Whitney P values for group 1 Vs 2
#   pvalues2, a matching vector of Mann-Whitney P values for group 1 Vs 3
#   pvalues3, a matching vector of Mann-Whitney P values for group 2 Vs 3
ggCustom <- function(ScatterMatrix, Module, i, pvalues, pvalues1, pvalues2, pvalues3)
{
require(ggplot2)

g <- ggplot(ScatterMatrix, aes(x=Module, y=as.numeric(ScatterMatrix[,i]))) +

#Control how outliers are managed using extra parameters
geom_boxplot(position=position_dodge(width=0.5), outlier.shape=17, outlier.colour="red", outlier.size=0.1, aes(fill=Module)) +

#Choose which colours to use; otherwise, ggplot2 choose automatically
#scale_color_manual(values=c("red3", "white", "blue")) + #for scatter plot dots
#scale_fill_manual(values=c("royalblue", "pink", "red4")) + #for boxplot
scale_fill_grey(start=0.5, end=0.9) +

stat_summary(geom="crossbar", width=0.8, fatten=2, color="black", fun.data=function(x){return(c(y=median(x), ymin=median(x), ymax=median(x)))}) +

#Add the scatter points (treats outliers same as 'inliers')
geom_jitter(position=position_jitter(width=0.3), size=0.25, colour="black") +

#Set the size of the plotting window
theme_bw(base_size=24) +

#Modify various aspects of the plot text and legend
theme(
legend.position="none",
legend.background=element_rect(),
plot.title=element_text(angle=0, size=12, face="bold", vjust=1),

axis.text.x=element_text(angle=0, size=12, face="bold", vjust=0.5),
axis.text.y=element_text(angle=0, size=12, vjust=0.5),
axis.title=element_text(size=12),

#Legend
legend.key=element_blank(),     #removes the border
legend.key.size=unit(1, "cm"),  #Sets overall area/size of the legend
legend.text=element_text(size=8),   #Text size
title=element_text(size=8)) +       #Title text size

#Change the size of the icons/symbols in the legend
guides(colour=guide_legend(override.aes=list(size=2.5))) +

#Set x- and y-axes labels
xlab("Cluster group") +
ylab("Vitamin D (ng/ml)") +

ylim(0, 100.0) +

geom_segment(aes(x=1, y=83, xend=2, yend=83), size=0.7, data=ScatterMatrix) +
geom_segment(aes(x=1, y=83, xend=1, yend=79.0), size=0.7, data=ScatterMatrix) +
geom_segment(aes(x=2, y=83, xend=2, yend=79.0), size=0.7, data=ScatterMatrix) +
geom_text(x=1.5, y=87, size=3.0, family="mono", label=pvalues1[i]) +

geom_segment(aes(x=1, y=2.5, xend=3, yend=2.5), size=0.7, data=ScatterMatrix) +
geom_segment(aes(x=1, y=2.5, xend=1, yend=6.5), size=0.7, data=ScatterMatrix) +
geom_segment(aes(x=3, y=2.5, xend=3, yend=6.5), size=0.7, data=ScatterMatrix) +
geom_text(x=2, y=-1.5, size=3.0, family="mono", label=pvalues2[i]) +

geom_segment(aes(x=2, y=96, xend=3, yend=96), size=0.7, data=ScatterMatrix) +
geom_segment(aes(x=2, y=96, xend=2, yend=92.0), size=0.7, data=ScatterMatrix) +
geom_segment(aes(x=3, y=96, xend=3, yend=92.0), size=0.7, data=ScatterMatrix) +
geom_text(x=2.5, y=100, size=3.0, family="mono", label=pvalues3[i]) +

ggtitle(colnames(ScatterMatrix)[i])
return(g)
}


The p-values and lines are drawn with the code at the end of the function

I am not sure about VST counts in a regression model. I would prefer the normalised counts (prir to VST) with a negative binomial family in the model, or use the regularised log transormation (rlog) with binomial family.

0
Entering edit mode

Dear Kevin,

thank you for your updated answer and code provided-so i have to compute the p-values externally and then feed them to the function right ?

Concerning your second comment, what do you mean by "normalized counts" ? using for instance another transformation ? like the EDA normalization ?

because i have downloaded raw HTSEQ counts, and then with the VST transformation i moved on with clustering and survival analysis, as mentioned from the DESEQ2 workflow for downstream puproses except DE analysis.

Thus, because the Wald Test function from the DESeq2 R package works on raw counts, based on the VST counts, i could simply use the kruskal.test() and pairwise.wilcox.test() for the computation of a "universal" p-value and pairwise ones, respectively ? and as they are non-parametric they could be more robust in my case ? even if i have transformed the counts ?

0
Entering edit mode

Yes, with your VST counts, you would use functions like kruskal.test, etc. In DESeq2, Wald test is just taken from the coefficient from the fitted negative binomial model of the normalised counts.

0
Entering edit mode

Dear Kevin, I am in the same situation as Svlachavas, but with one extra detail, I have 2 dependent variables. Can you please tell me how could I add not one, but 2 factors to Module? My design has 3 groups and 3-time points of collection and I need to compare samples across time and groups.

Best regards Marcela

0
Entering edit mode

Hey Marcela, if you want, you can open a new question.

Which data do you have right now? - RNA-seq? Have you normalised it?