Question

Similarity Measures Appropriate For Hierarchical Clustering On Gene Content/Binary Data

1

Entering edit mode

10.7 years ago

simonalpha ▴ 10

Hi,

I've got a table of presence/absence data for a number of genes (around 100) between different samples (< 10) derived from genome sequence data, and have been playing with doing hierarchical clustering to simply illustrate the similarity between each of these strains, borrowing some ideas from microarray analysis.

However, I'm unsure about which distance measure I should utilise to construct the distance matrix for clustering. Currently, I'm using Hamming distance, given the binary nature of the data; but I'm concerned about it not being normalised, or accounting for the joint presence of genes.

Any suggestions for alternatives for this type of data, or recommendations of papers etc I could read for a better understanding of distance metric choice would be much appreciated.

Thanks,

Simon

statistics • 6.9k views

ADD COMMENT • link updated 6.6 years ago by Biostar 20 • written 10.7 years ago by simonalpha ▴ 10

score 2 · Answer 1 · 2013-07-27

2

Entering edit mode

10.7 years ago

Christian ★ 3.0k

I used the Jaccard index before as similarity measure between two gene groups.

ADD COMMENT • link 10.7 years ago by Christian ★ 3.0k

0

Entering edit mode

That is one of the metrics I'm trying to decide between. Any particular reason you chose that method?

ADD REPLY • link 10.7 years ago by simonalpha ▴ 10

0

Entering edit mode

It factors in group size.

ADD REPLY • link 10.7 years ago by Christian ★ 3.0k

score 1 · Answer 2 · 2013-07-22

1

Entering edit mode

10.7 years ago

Biojl ★ 1.7k

Hi,

I usually build this kind of clusterings from binary data (presence/absence). I'm just calculating the distance matrix using the binary method

In R:

dist_matrix<-Dist(matrix, method='binary') #Create distances matrix

I know it's quite simple but this kind of analysis are just to take a glimpse of the data. A better approximation might be to use the RPKM values instead of just presence/absence.

ADD COMMENT • link 10.7 years ago by Biojl ★ 1.7k

0

Entering edit mode

Probably should have been clearer, I'm using genomic sequence, as opposed to transcriptomics. Unless I've missed something, RPKM is for RNAseq type data, right?

ADD REPLY • link 10.7 years ago by simonalpha ▴ 10

1

Entering edit mode

Yes, RPKM is for RNAseq. The code I posted is to create a distance matrix from binary data (presence/absence) of a gene, hence it can be used for your data. I assumed it was RNAseq because you're comparing presence/absence of genes in different strains... in which species are you working that different strains have that much different genes to be able to create reliable clusters? I think it would be much more useful to construct the matrices from differences in the multiple alignments created from the orthologous genes.

ADD REPLY • link 10.7 years ago by Biojl ★ 1.7k

0

Entering edit mode

I wish I was able to do that! I'm looking at a region encoding surface antigens in bacteria that seems to undergo a fair bit of HGT, so where there are orthologous genes, they don't represent the entire region. Hence using the presence/absence approach. I'm trying to come up with alternatives, but not having much luck.

ADD REPLY • link 10.7 years ago by simonalpha ▴ 10