Entering edit mode
3 months ago
yura.grabovska
▴
820
I am running a series of NMF-based clustering and marker selection on clusters from a large scRNASeq experiment across many comparisons. I want to compare the cluster output from those comparisons by seeing how similar the market sets are in order to assign a coherent clustering across the entire group.
Eg. Cluster A in Group 1 has 80% shared markers with Cluster D in Group 2 and 73% shared markers with Cluster C in Group 3 etc.
I'm currently using Jaccard similarity but I wonder if there are more elegant ways of doing this? Dummy example R code below:
library(ComplexHeatmap)
library(EnsDb.Hsapiens.v86)
library(cluster)
## get some dummy gene lists
ens.genes <- genes(EnsDb.Hsapiens.v86)
ens.genes <- ens.genes[ens.genes$gene_biotype=="protein_coding"]
ens.genes <- ens.genes[1:50]
set.seed(1)
set.1 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(2)
set.2 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(3)
set.3 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(4)
set.4 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(5)
set.5 <- sample(ens.genes$symbol, 25, replace = FALSE)
markers.list <- list(set.1, set.2, set.3, set.4, set.5)
names(markers.list) <- c("Set1", "Set2", "Set3", "Set4", "Set5")
## set up jaccard similarity
jaccard <- function(a, b) {
intersection <- length(intersect(a, b))
union <- length(a) + length(b) - intersection
return(intersection/union)
}
jaccard.mat <- matrix(data = NA,
nrow = length(markers.list),
ncol = length(markers.list),
dimnames = list(names(markers.list),
names(markers.list)))
for (i in rownames(jaccard.mat)){
for (j in colnames(jaccard.mat)){
jaccard.mat[i,j] <- jaccard(markers.list[[i]], markers.list[[j]])
}
}
## get the distance
jaccard.dist <- as.dist(1-jaccard.mat)
silhouette.res <- numeric()
## parameter search for PAM
for (k in 2:4) {
pam.fit <- pam(jaccard.dist, k, diss = TRUE)
silhouette.res[k] <- mean(silhouette(pam.fit)[,"sil_width"])
}
silhouette.res <- silhouette.res[-1]
names(silhouette.res) <- 2:4
silhouette.df <- as.data.frame(silhouette.res)
silhouette.df$k <- rownames(silhouette.df)
colnames(silhouette.df)[1] <- "silhouette"
silhouette.df$k <- factor(silhouette.df$k, levels = c(2:4))
ggplot(silhouette.df,
mapping = aes(x = k, y = silhouette, group = 1)) +
geom_line() +
geom_point() +
theme_bw() +
theme(panel.grid.major.x = element_line(linetype = "dotted"),
panel.grid.minor.y = element_line(linetype = "dotted"),
panel.grid.major.y = element_line(linetype = "dotted")) +
labs(x = "k", y = "Silhouette", title = "PAM clustering silhouette score")
## run PAM
k4.pam <- pam(jaccard.dist, k = 4, diss = TRUE, cluster.only = TRUE)
## set up heatmap
anno.df <- data.frame("PAM" = k4.pam, row.names = colnames(jaccard.mat))
anno.col <- list("PAM" = pals::kelly(4))
names(anno.col$PAM) <- 1:4
## using ComplexHeatmap::pheatmap
pheatmap(jaccard.mat,
border_color = NA,
cellwidth = 12, cellheight = 12,
fontsize_row = 10, fontsize_col = 10,
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
clustering_method = "ward.D2",
annotation_row = anno.df,
annotation_col = anno.df,
annotation_colors = anno.col)
You can try to take a look this: link_papaer
Thanks for this. It seems like this method requires a ranked gene list while I am comparing purely the presence or absence of genes. Maybe the suggestion can be to use ranked list comparions rather than simply comparing similarity of overlap.
Also it looks like ANDES is typically applied by comparing two lists of genesets:
I'm potentially wanting to compare up to 5 lists of genesets all within the same space.