Dissimilarity matrix calculation

Question

Best Clustering Algorithms for Mutation Data?

0

Entering edit mode

7.6 years ago

blazer9131 ▴ 20

Hey ya'll.

I have a project with about 50-60 different samples with exome sequencing data. I have genotyped these samples and there are ~150 genes which have different levels of mutation ranging from missense, nonsense, indels, amplification, deletion, etc. I tiered them in terms of biological significance such that a 3 is significant impact, 2 has an impact, and 1 would be little impact. A sample w/o mutations at that gene had a 0.

I imported this into R and a df and tried to do classic clustering using hclust and made a few heatmaps/dendrograms. I used Ward.D2 for my analysis, but I'm not very skilled in statistics. I'm not sure if there would be a better algorithm for this dataset. Would anyone know a better method/algorithm? I'm trying to classify/group these samples using the exonic data I have.

R • 3.7k views

ADD COMMENT • link updated 5.1 years ago by Hamid Ghaedi 3.3k • written 7.6 years ago by blazer9131 ▴ 20

2

Entering edit mode

Try Affinity Propagation, it's basically magic..

ADD REPLY • link 5.1 years ago by 5heikki 11k

1

Entering edit mode

Clustering is about grouping items by similarity/proximity. You need to define what similarity/proximity is relevant in your case, i.e. what should items in the same cluster share that would differentiate them from another cluster. This helps in selecting the similarity measure used for clustering. Then the selection of clustering algorithm can be dependent on some knowledge/assumption about the cluster structure.

ADD REPLY • link 7.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Please include sample data and What do you expect the result to be and what was you result when you launch your analysis? With that answer we can improve your analysis.

ADD REPLY • link 7.6 years ago by anicet.ebou ▴ 170

score 3 · Answer 1 · 2020-06-16

I am dealing with similar task. Here are my findings that may be of help for somebody else.

Somatic mutations are said to be spare and heterogeneous. So using them for clustering is not going to be straightforward task. Before jumping to clustering methods there are suggestions on how you may go to de-sparsify your data. For instance, knowledge on gene-gene network are usually considered for data de-sparsification . A detailed discussion could be find here. I am not-covering desparsification methods in this answer.
It is tricky to cluster categorical data, because it could lead to non-sense and wrong conclusion!
In contrast to classic clustering, your matrix here is not numerical. So you DONT allow to use common algorithms like k-mean clustering. If you apply , it wont complain about your data type and provide you result!
Mutational matrix is binary or categorical. These are steps needed for clustering:
Dissimilarity matrix calculation

In the first step you should calculate a dissimilarity matrix for clustering . Again there is difficulty regarding to math calculation on categorical/binary data. To do this, you would go for something called Gower distance. this method is available in cluster R base package. Also there are methods available in vegan R package appropriate to be applied on binary data: binomial, raup and jaccard . It depends on your data and your decision to chose what method.

Choosing Clustering algorithms

Choosing the clustering algorithm is the next step. For categorical data you would go for hierarchical clustering (either agglomerative or divisive approach). The final steps would be assessing the clustering result. Below I am providing what I used for my case in short.

You did not provide details on your input data, so exact code is not possible to post here. But the following are the general steps you can follow to cluster your samples.

1- Making mutation count/binary matrix: In my case, I am dealing with TCGA data, and so there are maf files and could be converted to the matrix by maftools package by mutCountMatrix function.This will provide a count matrix. you may need to convert it to binary (0,1) code.

library(maftools)
mtx <- mutCountMatrix(maf, includeSyn = FALSE, countOnly = NULL, removeNonMutated = FALSE) #maf file contains mutation infor
#transpose mtx to have genes in columns and samples in row
mtx <- t(mtx)
#Convert counts to binary
mtx.b <- apply(mtx, 2, function(x) ifelse(x > 0, 1, x)) # So 0 = no, 1 =yes

2- Making dissimilarity matrix:

#gower by cluster package
library(cluster)
gower <- daisy(mtx.b, metric = c("gower"))
# binimoal by vegan package
library(vegan)
binomial <- vegdist(mtx.b, binary = TRUE, method = "binomial")

3- Applying clustering (most common agglomerative hierarchical clustering) and plotting

#gower
gower.aggl.clust<- hclust (gower, method = "complete")
plot(gower.aggl.clust, cex = 0.6, main = "Agglomerative, complete linkages")

#binomial

binom.aggl.clust<-hclust(binomial) #agglomerative clustering using complete linkage
plot(clust.res, cex = 0.6, main = "Agglomerative, complete linkages")

There are a lots of details one should be aware of them for clustering. For instances, how you would evaluate clustering result and .... I tried to provide some practical hints toward clustering of mutational data. Any comment, modification and elaboration of this answer is appreciated in advance.