Comparing sets of genes for similarity
0
1
Entering edit mode
3 months ago

I am running a series of NMF-based clustering and marker selection on clusters from a large scRNASeq experiment across many comparisons. I want to compare the cluster output from those comparisons by seeing how similar the market sets are in order to assign a coherent clustering across the entire group.

Eg. Cluster A in Group 1 has 80% shared markers with Cluster D in Group 2 and 73% shared markers with Cluster C in Group 3 etc.

I'm currently using Jaccard similarity but I wonder if there are more elegant ways of doing this? Dummy example R code below:

library(ComplexHeatmap)
library(EnsDb.Hsapiens.v86)
library(cluster)

## get some dummy gene lists
ens.genes <- genes(EnsDb.Hsapiens.v86)
ens.genes <- ens.genes[ens.genes$gene_biotype=="protein_coding"]
ens.genes <- ens.genes[1:50]

set.seed(1)
set.1 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(2)
set.2 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(3)
set.3 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(4)
set.4 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(5)
set.5 <- sample(ens.genes$symbol, 25, replace = FALSE)

markers.list <- list(set.1, set.2, set.3, set.4, set.5)
names(markers.list) <- c("Set1", "Set2", "Set3", "Set4", "Set5")

## set up jaccard similarity
jaccard <- function(a, b) {
  intersection <- length(intersect(a, b))
  union <- length(a) + length(b) - intersection
  return(intersection/union)
}

jaccard.mat <- matrix(data = NA, 
                           nrow = length(markers.list), 
                           ncol = length(markers.list),
                           dimnames = list(names(markers.list),
                                           names(markers.list)))

for (i in rownames(jaccard.mat)){
  for (j in colnames(jaccard.mat)){
    jaccard.mat[i,j] <- jaccard(markers.list[[i]], markers.list[[j]])
  }
}

## get the distance 
jaccard.dist <- as.dist(1-jaccard.mat)
silhouette.res <- numeric()

## parameter search for PAM
for (k in 2:4) {
  pam.fit <- pam(jaccard.dist, k, diss = TRUE)
  silhouette.res[k] <- mean(silhouette(pam.fit)[,"sil_width"])
}

silhouette.res <- silhouette.res[-1]
names(silhouette.res) <- 2:4

silhouette.df <- as.data.frame(silhouette.res)
silhouette.df$k <- rownames(silhouette.df)
colnames(silhouette.df)[1] <- "silhouette"
silhouette.df$k <- factor(silhouette.df$k, levels = c(2:4))

ggplot(silhouette.df,
       mapping = aes(x = k, y = silhouette, group = 1)) + 
  geom_line() +
  geom_point() +
  theme_bw() +
  theme(panel.grid.major.x = element_line(linetype = "dotted"),
        panel.grid.minor.y = element_line(linetype = "dotted"),
        panel.grid.major.y = element_line(linetype = "dotted")) +
  labs(x = "k", y = "Silhouette", title = "PAM clustering silhouette score")

## run PAM
k4.pam <- pam(jaccard.dist, k = 4, diss = TRUE, cluster.only = TRUE)

## set up heatmap 
anno.df <- data.frame("PAM" = k4.pam, row.names = colnames(jaccard.mat))
anno.col <- list("PAM" = pals::kelly(4))
names(anno.col$PAM) <- 1:4

## using ComplexHeatmap::pheatmap 
pheatmap(jaccard.mat, 
         border_color = NA,
         cellwidth = 12, cellheight = 12,
         fontsize_row = 10, fontsize_col = 10,
         clustering_distance_rows = "euclidean", 
         clustering_distance_cols = "euclidean",
         clustering_method = "ward.D2",
         annotation_row = anno.df, 
         annotation_col = anno.df,
         annotation_colors = anno.col)

Plot1 Plot2

R jaccard genes marker clustering • 612 views
ADD COMMENT
0
Entering edit mode

You can try to take a look this: link_papaer

ADD REPLY
0
Entering edit mode

Thanks for this. It seems like this method requires a ranked gene list while I am comparing purely the presence or absence of genes. Maybe the suggestion can be to use ranked list comparions rather than simply comparing similarity of overlap.

Also it looks like ANDES is typically applied by comparing two lists of genesets:

python src/andes.py --emb embedding_file.csv --genelist embedding_gene_ids.txt --geneset1 first_gene_set_database.gmt --geneset2 second_gene_set_database.gmt --out output_file.csv -n num_processor

I'm potentially wanting to compare up to 5 lists of genesets all within the same space.

ADD REPLY

Login before adding your answer.

Traffic: 5442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6