Question: Clustering and extracting gene IDs with same expression profiles
gravatar for lessismore
16 months ago by
lessismore640 wrote:

Hey all,

i have a specific need to cluster more than 1K genes based on their expression profile. I want to extract then just the genes in clusters with the same expression profile. My first attempt was to set the optimal K with gapstat and using kmeans to mark my expression atlas with the cluster identifier and then extract the genes from a specific cluster by subsetting the dataframe. This didnt work because kmeans tries to put together genes even with different expression profile.

So do you have any suggestion to accomplish this?
Summarizing :

  1. i want to cluster a big expression atlas
  2. extract the gene IDs with similar expression profile

thanks in advance


i saw this post about pheatmap which would be ideal but i cannot figure out how to check the clusters identifier in order to use cutree function

ADD COMMENTlink modified 16 months ago by Jake Warner730 • written 16 months ago by lessismore640


in the example from Bioconductor the object after cutree execution contains the genes with the cluster identifier (a number from one to n - an integer you have chosen for cutree).

ADD REPLYlink written 16 months ago by e.rempel770
gravatar for Jake Warner
16 months ago by
Jake Warner730
Jake Warner730 wrote:

Hi, You could use your kMeans approach then score the individual genes by comparing them to the centroid.
First get the centroids:

# function to find centroid in cluster i
clust.centroid = function(i, dat, clusters) {
  ind = (clusters == i)
kClustcentroids <- sapply(levels(factor(clusterdata$cluster)), clust.centroid, scaledata, clusterdata$cluster)

where clusterdata is the result of kmeans and scaledata is your expression dataframe.

Then compare the genes to the cluster cores:

#get just cluster 2
K2 <- (scaledata[clusterdata$cluster==2,])
#get cluster 2 core
core <- kClustcentroids[2,]

#compare them with cor
corscore <- function(x){cor(x,core)}
score <- apply(K2, 1, corscore)

The scores will relate to how close they match the cluster core (from 0 to 1). Here's an example of plotting them:

enter image description here

Then you could just take the genes with a score above a certain cutoff (like 0.75). Complete workflow here

Good luck!

ADD COMMENTlink written 16 months ago by Jake Warner730
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour