Question: How to cluster microarray samples based on euclidean distance and a complete linkage metric?
0
gravatar for JacobS
4.6 years ago by
JacobS900
Cleveland, Ohio
JacobS900 wrote:

I am trying to replicate some computational experiments I found in a paper. In this paper, the authors have ~30 genechip human genome u133 plus 2.0 arrays, 1 for each experimental sample and no references. They process the .CEL files into log2 RMA normalized signal intensity files. They then create a dendrogram that demonstrates that there are 2 main phenotypes within these 30 samples, based on gene expression.

I am trying to replicate their work, but I'm not sure how they went from the log2 RMA normalized signal intensity files to a clustered dendrogram. There explanation is, "Hierarchical clustering was performed using Euclidean distance and a complete linkage metric."

I've reached out to the authors, but this paper is nearly 5 years old, no I may not get a response. Does anybody know how this can be done?

dge clustering microarray • 2.4k views
ADD COMMENTlink modified 4.6 years ago by Irsan7.1k • written 4.6 years ago by JacobS900
4
gravatar for Gian
4.6 years ago by
Gian350
Canada
Gian350 wrote:

A simple example in R:

First calculate the Euclidean distance with function dist()

eucl_dist=dist(matrix(c(rnorm(100),rnorm(100)),nrow = 2,ncol = 100),method = 'euclidean')

then perform hierarchical clustering with complete linkage method

hie_clust=hclust(eucl_dist,method = 'complete')

ADD COMMENTlink written 4.6 years ago by Gian350

Will try, thanks!

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JacobS900

So I tried using these commands with my matrix. The full matrix tracks 56,000 genes, and R crashes, stating, `Error: cannot allocate vector of size 544.4 Gb.` I tried just using a subset of 100 genes, and the command executed, so I have a hie_clust object. However, when I plot this, I get a dendrogram that clusters the individual genes rather than the samples. How can I fix this? Also, is there a way to get a text list of the clustering rather than a plot? Thanks for your help, I'm not very good with R!

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JacobS900

dist() compute the distance between the rows of your matrix so you can just transpose your_matrix using t(your_matrix)

hie_clust is an object with the clustering information if you type hie_clust$ you can access the ordering, the height etc

You can perform different operations on the hclust object, like cutting it into a k number of clusters Ex: cutree(hie_clust,k = 10)

ADD REPLYlink written 4.6 years ago by Gian350

Thanks for the reply Gian. I've tried transposing my matrix, but for some reason the terminal dendrogram branches still do not represent samples (there are far more of them than input samples)

ADD REPLYlink written 4.6 years ago by JacobS900

Solved my program, I was accidentally calling as.matrix on a matrix. It works great now, thanks! 

ADD REPLYlink written 4.6 years ago by JacobS900
1
gravatar for Irsan
4.6 years ago by
Irsan7.1k
Amsterdam
Irsan7.1k wrote:
Source this file. https://github.com/Irsan88/SeqTools/blob/master/RNA/Expression/countMatrixTools.R Then do: plot(dendrogramOnSamples(yourData, clustComplete,distEucledian))
ADD COMMENTlink written 4.6 years ago by Irsan7.1k

Thanks for your reply! I see this tool is expecting a counts matrix. Can I just provide the RMA normalized signal intensity scores in place of traditional RNA-Seq counts? And should they be log2 transformed?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JacobS900
It expects a matrix so it will work. It does the same as Gian's answer. Yes they should be log2 transformed and (RMA) normalized
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Irsan7.1k

Perfect, thanks! I'll give it a try and report back.

ADD REPLYlink written 4.6 years ago by JacobS900

Ran into a snag... any thoughts?

> source("countMatrixTools.R", local=TRUE)
> myMatrix <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
> plot(dendrogramSamples(myMatrix, clustComplete,distEucledian))
Error in if is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
  missing value where TRUE/FALSE needed

 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by JacobS900
Run summary stats on your matrix, I think there are strange values in there
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Irsan7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1673 users visited in the last hour