Question

Comparing sets of genes for similarity

1

Entering edit mode

5 months ago

yura.grabovska ▴ 870

I am running a series of NMF-based clustering and marker selection on clusters from a large scRNASeq experiment across many comparisons. I want to compare the cluster output from those comparisons by seeing how similar the market sets are in order to assign a coherent clustering across the entire group.

Eg. Cluster A in Group 1 has 80% shared markers with Cluster D in Group 2 and 73% shared markers with Cluster C in Group 3 etc.

I'm currently using Jaccard similarity but I wonder if there are more elegant ways of doing this? Dummy example R code below:

library(ComplexHeatmap)
library(EnsDb.Hsapiens.v86)
library(cluster)

## get some dummy gene lists
ens.genes <- genes(EnsDb.Hsapiens.v86)
ens.genes <- ens.genes[ens.genes$gene_biotype=="protein_coding"]
ens.genes <- ens.genes[1:50]

set.seed(1)
set.1 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(2)
set.2 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(3)
set.3 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(4)
set.4 <- sample(ens.genes$symbol, 25, replace = FALSE)
set.seed(5)
set.5 <- sample(ens.genes$symbol, 25, replace = FALSE)

markers.list <- list(set.1, set.2, set.3, set.4, set.5)
names(markers.list) <- c("Set1", "Set2", "Set3", "Set4", "Set5")

## set up jaccard similarity
jaccard <- function(a, b) {
  intersection <- length(intersect(a, b))
  union <- length(a) + length(b) - intersection
  return(intersection/union)
}

jaccard.mat <- matrix(data = NA, 
                           nrow = length(markers.list), 
                           ncol = length(markers.list),
                           dimnames = list(names(markers.list),
                                           names(markers.list)))

for (i in rownames(jaccard.mat)){
  for (j in colnames(jaccard.mat)){
    jaccard.mat[i,j] <- jaccard(markers.list[[i]], markers.list[[j]])
  }
}

## get the distance 
jaccard.dist <- as.dist(1-jaccard.mat)
silhouette.res <- numeric()

## parameter search for PAM
for (k in 2:4) {
  pam.fit <- pam(jaccard.dist, k, diss = TRUE)
  silhouette.res[k] <- mean(silhouette(pam.fit)[,"sil_width"])
}

silhouette.res <- silhouette.res[-1]
names(silhouette.res) <- 2:4

silhouette.df <- as.data.frame(silhouette.res)
silhouette.df$k <- rownames(silhouette.df)
colnames(silhouette.df)[1] <- "silhouette"
silhouette.df$k <- factor(silhouette.df$k, levels = c(2:4))

ggplot(silhouette.df,
       mapping = aes(x = k, y = silhouette, group = 1)) + 
  geom_line() +
  geom_point() +
  theme_bw() +
  theme(panel.grid.major.x = element_line(linetype = "dotted"),
        panel.grid.minor.y = element_line(linetype = "dotted"),
        panel.grid.major.y = element_line(linetype = "dotted")) +
  labs(x = "k", y = "Silhouette", title = "PAM clustering silhouette score")

## run PAM
k4.pam <- pam(jaccard.dist, k = 4, diss = TRUE, cluster.only = TRUE)

## set up heatmap 
anno.df <- data.frame("PAM" = k4.pam, row.names = colnames(jaccard.mat))
anno.col <- list("PAM" = pals::kelly(4))
names(anno.col$PAM) <- 1:4

## using ComplexHeatmap::pheatmap 
pheatmap(jaccard.mat, 
         border_color = NA,
         cellwidth = 12, cellheight = 12,
         fontsize_row = 10, fontsize_col = 10,
         clustering_distance_rows = "euclidean", 
         clustering_distance_cols = "euclidean",
         clustering_method = "ward.D2",
         annotation_row = anno.df, 
         annotation_col = anno.df,
         annotation_colors = anno.col)

Plot1 Plot2

R jaccard genes marker clustering • 774 views

ADD COMMENT • link 5 months ago by yura.grabovska ▴ 870

0

Entering edit mode

You can try to take a look this: link_papaer

ADD REPLY • link 5 months ago by marco.barr ▴ 180

0

Entering edit mode

Thanks for this. It seems like this method requires a ranked gene list while I am comparing purely the presence or absence of genes. Maybe the suggestion can be to use ranked list comparions rather than simply comparing similarity of overlap.

Also it looks like ANDES is typically applied by comparing two lists of genesets:

python src/andes.py --emb embedding_file.csv --genelist embedding_gene_ids.txt --geneset1 first_gene_set_database.gmt --geneset2 second_gene_set_database.gmt --out output_file.csv -n num_processor

I'm potentially wanting to compare up to 5 lists of genesets all within the same space.

ADD REPLY • link 5 months ago by yura.grabovska ▴ 870