Adding p value to Vlnplot in Seurat
1
1
Entering edit mode
11 months ago
ww22runner ▴ 10

Hello everyone,

Could anyone help me understand if there is any way I can add p values to violin plots produced by Seurat? Currently, I am trying to plot several genes across all clusters, broken up by responders vs non-responders and this is my code:

vp_case1 <- function (gene_signature, file_name){
plot_case1 <- function(signature){
VlnPlot(pbmc, features = signature,
pt.size = 0.1,
group.by = "Response"
) + stat_compare_means(comparisons = pbmc$Response, label = "p.signif") } map(gene_signature, plot_case1) %>% cowplot::plot_grid(plotlist = .) file_name <- paste0(file_name, "_r.png") ggsave(file_name, width = 14, height = 8) } vp_case1(gene_sig, "gene_sig")  where gene_sig is a vector of a few genes of interest. However, I get the following error: Warning messages: 1: Computation failed in stat_signif(): not enough 'y' observations 2: Computation failed in stat_signif(): not enough 'y' observations . . .  How could I do this better? Thank you! seurat • 2.9k views ADD COMMENT 0 Entering edit mode Have you checked if VlnPlot and stat_compare_means can work together like that? ADD REPLY 0 Entering edit mode Hi igor, Please correct me if I am wrong but it says VlnPlot returns "a patchworked ggplot object if combine = TRUE; otherwise, a list of ggplot objects" while stat_compare_mean can be used to "add mean comparison p-values to a ggplot" so I thought it should work. ADD REPLY 5 Entering edit mode 11 months ago antonioggsousa ★ 2.1k Hi, The problem that you got is related with the fact that you need to provide a list of comparisons to be made to the function stat_compare_means(), and you are providing a character/factor variable. So, if you adjust your code to: vp_case1 <- function(gene_signature, file_name, test_sign, y_max){ plot_case1 <- function(signature){ VlnPlot(pbmc, features = signature, pt.size = 0.1, group.by = "Response", y.max = y_max, # add the y-axis maximum value - otherwise p-value hidden ) + stat_compare_means(comparisons = test_sign, label = "p.signif") } map(gene_signature, plot_case1) %>% cowplot::plot_grid(plotlist = .) file_name <- paste0(file_name, "_r.png") ggsave(file_name, width = 14, height = 8) }  In this adjusted code I added 2 extra parameters: • test_sign: that is a list of a vector of characters. So, in your case you have a Response variable that should assume 2 or more levels. Let's say that your Response variable can be CTRL or TRT (control or treatment). Thus test_sign = list(c("CTRL", "TRT")). • y_max: by default the function VlnPLot will adjust the y-axis to the maximum expression value per comparison made. What this means is that if you're comparing the gene A between CTRL and TRT and the max. value is 7, the maximum y-axis scale will be 7. When you apply the stat_compare_means() function, the p-value will be added above, but since the maximum y-axis value is 7, you'll not see it. Therefore, you can adjust the y_max to one more unit, let's say 8, and therefore you'll be available to see the p-value. Though as you can think this is not the best option since often you'll have distinct y maximum values, and therefore setting the same y_max will not be the best option. In that case what you can do is just a loop instead a map() giving a y_max a numeric vector. Let me know if you have problems regarding this issue. Here is an example: ## Follow the Seurat tutorial until UMAP at: https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html ## create fake 'Reponse' variable pbmc@meta.data$Response <- rep(c("CTRL", "TRT"), times = c(1500, 2638-1500))

vp_case1 <- function(gene_signature, file_name, test_sign, y_max){
plot_case1 <- function(signature){
VlnPlot(pbmc, features = signature,
pt.size = 0.1,
group.by = "Response",
y.max = y_max, # add the y-axis maximum value - otherwise p-value hidden
) + stat_compare_means(comparisons = test_sign, label = "p.signif")
}
purrr::map(gene_signature, plot_case1) %>% cowplot::plot_grid(plotlist = .)
file_name <- paste0(file_name, "_r.png")
ggsave(file_name, width = 14, height = 8)
}

gene_sig <- c("NKG7", "PF4")
comparisons <- list(c("CTRL", "TRT"))
vp_case1(gene_signature = gene_sig, file_name = "Desktop/gene_sig", test_sign = comparisons, y_max = 7)


Result:

António

0
Entering edit mode

Code updated below to give you an adjusted y-axis scale with p-value:

## Follow the Seurat tutorial until UMAP at: https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html

## create fake 'Reponse' variable
pbmc@meta.data$Response <- rep(c("CTRL", "TRT"), times = c(1500, 2638-1500)) vp_case1 <- function(gene_signature, file_name, test_sign){ plot_case1 <- function(signature, y_max = NULL){ VlnPlot(pbmc, features = signature, pt.size = 0.1, group.by = "Response", y.max = y_max # add the y-axis maximum value - otherwise p-value hidden ) + stat_compare_means(comparisons = test_sign, label = "p.signif") } plot_list <- list() y_max_list <- list() for (gene in gene_signature) { plot_list[[gene]] <- plot_case1(gene) y_max_list[[gene]] <- max(plot_list[[gene]]$data[[gene]]) # get the max no. for each gene
plot_list[[gene]] <- plot_case1(gene, y_max = (y_max_list[[gene]] + 1) )
}
cowplot::plot_grid(plotlist = plot_list)
file_name <- paste0(file_name, "_r.png")
ggsave(file_name, width = 14, height = 8)
}

gene_sig <- c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A")
comparisons <- list(c("CTRL", "TRT"))
vp_case1(gene_signature = gene_sig, file_name = "Desktop/gene_sig_2", test_sign = comparisons)


Result:

Therefore, you don't need to worry with the y_axis scale.

António

0
Entering edit mode

Hi Antonio,

Thanks so much, these graphs are exactly what I needed! May I ask if we might be able to show the same information on a split violin plot for each gene for responder vs non-responders for a particular cluster?

Thank you!

0
Entering edit mode

Hi ww22runner,

Not sure. Did you try?

Let me know if it worked.

António

0
Entering edit mode

Hi antonioggsousa I tried your code and may I ask about the basis of significance? Is the significance: (a) influenced by the number of cells in each group (CTRL vs TRT) or (b) the significance is based on the differential "averaged expression" of the gene between them?

Because in my own data, it seems like the significance is about the number of cells. In my generated plot, I noticed that some have very high significance (4 asterisk stars) even though the 2 groups have a visually/probably equal or not too different expression level, for example the "CD8A" in my plot. However, when you look at CD14, THERE'S ONLY SINGLE * (1 asterisk star).

0
Entering edit mode

Hi @jvbeltran0122,

As far as I remember by default the function stat_compare_means() (link to the page documentation) will use the Mann–Whitney U test or also called Wilcoxon rank-sum test (read more about it in the wiki). Therefore, the significance depends on the formula of this test.

I think if you want to test differential gene expression, you should apply one of the available methods in Seurat in order to get that. I don't think that is very meaningful comparing the expression level of one gene across samples with the Wilcoxon rank-sum test, because you have a lot of cells and, even if the difference is small, it'll have an high probability of being statistically significant (whatever that means).

0
Entering edit mode

Hi there, was wondering the exact same thing...

If you do DE analysis between the groups using FindAllMarkers(), you will get both a p-value and a Bonferroni adjusted p-value where you take into account the number of tests. That's also using for instance the default Wilcoxon rank-sum test. I think the p-value on these plots correspond to the unadjusted ones. I also have several genes that appear differently distributed on the plots but they are not statistically significant if you look at p adjusted in the DE analysis.

I believe it would most likely be wrong to report the unadjusted one, although Bonferroni may be more stringent than necessary- it is calculated as the unadjusted p-value multiplied by the number of features, for me it's 17729 (which is the number of tests) although a large number of them are irrelevant... False discovery rate could be better but I am not sure how to calculate it from seurat objects yet.

Please correct me if I'm wrong, this is probably a discussion you can't have too many times :)