Question

Kmeans in R for clustering on significant gene list by sample

0

Entering edit mode

4.5 years ago

myyid68 ▴ 30

Hello,

I'm completely new to bioinformatics and machine learning. I have a dataframe with pre-processed data where rows are genes and columns are samples (column 1 is probe ID, then the rest of columns are cancer samples and normal samples). I want to use kmeans in R to do clustering on my data by samples with 2 as the initial number of clusters. So far I have been doing some research on kmeans clustering and came up with the code below which seems to be working but since I'm new to this, not sure if this is correct? Also I want to draw a line chart to indicate the profile of the 2 clusters by using the center of each cluster but dont know how to do that.. Perhaps someone can help me with some guidance, examples of how to do this properly? Thank you!

clustering<- kmeans(df[ ,2:21], 2)
clustering$cluster
new <- cbind(df, cluster = clustering$cluster)
View(new)

R RNA-Seq sequencing K-means Clustering • 3.0k views

ADD COMMENT • link updated 4.5 years ago by Kevin Blighe 87k • written 4.5 years ago by myyid68 ▴ 30

score 1 · Answer 1 · 2019-11-04

1

Entering edit mode

4.5 years ago

Kevin Blighe 87k

Hey,

Yes, this is the correct way to do k-means, and, indeed, the cluster assignments will be stored in the cluster variable.

Reproducible example:

randomdata <- matrix(rexp(200, rate=.1), ncol=20)
rownames(randomdata) <- paste0('gene', 1:nrow(randomdata))
colnames(randomdata) <- paste0('sample', 1:ncol(randomdata))
randomdata[1:5, 1:5]
        sample1   sample2   sample3   sample4   sample5
gene1 21.787478  2.392901  9.403012 24.550271 12.001397
gene2  5.789860  3.443809 23.343598  6.380007  4.524190
gene3 20.624568  3.497594 18.989897 11.694150  4.254307
gene4  5.117409  6.021590  2.004472  5.939098  1.801671
gene5 10.875161 12.953176 20.717515  3.199971  8.246331

k <- kmeans(randomdata, centers = 2)
k$cluster
 gene1  gene2  gene3  gene4  gene5  gene6  gene7  gene8  gene9 gene10 
     2      2      2      2      2      2      2      1      2      2

You can then access the centers like this, for example:

plot(k$centers[1,], type = 'l', lwd = 2, col = 'red2')
lines(k$centers[2,], type = 'l', lwd = 2, col = 'royalblue')

ADD COMMENT • link 4.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you so much, Kevin, that's very helpful!

ADD REPLY • link 4.5 years ago by myyid68 ▴ 30

0

Entering edit mode

Hi Kevin,

I actually have 100 interesting genes and would like to use them for classifying the samples. In this case, as I want to classify samples the matrix should have samples as rows and genes as columns. Am I right?

ADD REPLY • link 3.4 years ago by newbie ▴ 120

0

Entering edit mode

I cannot recall, but, irrespective, you just need to transpose the matrix via the t() function to get what you want

ADD REPLY • link 3.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Would you actually use those counts (as in transformed CPM), or would you rather conduct feature scaling using using the z-score prior to clustering (as in base::scale)? Thank you!

ADD REPLY • link 3.4 years ago by ponganta ▴ 590

1

Entering edit mode

For k-means, I would use the normalised + transformed expression levels, i.e., log2 (CPM + pseudocount), or, indeed, the Z-scaled version of these, i.e., scale(log2 (CPM + pseudocount))

ADD REPLY • link 3.4 years ago by Kevin Blighe 87k