Question: How to make k-means clustering plot for relative expression?
3
gravatar for ahmad mousavi
2.3 years ago by
ahmad mousavi520
Royan Institute, Tehran, Iran
ahmad mousavi520 wrote:

Hi

I have a question about how to make cluster for relative expression (on read counts) using k-means clustering, I couldn't find any good manual for solving this problem.

I need to make a plot like below: to have different cluster on genes count:

Does anybody know how to make it?

enter image description here

rna-seq deseq2 R • 4.3k views
ADD COMMENTlink modified 2.3 years ago by Chirag Parsania1.9k • written 2.3 years ago by ahmad mousavi520
1

Please elaborate on what you have so far. Presumably:

  • a data-matrix of expression values (samples as columns; genes as rows)
  • a cluster solution from k-means, with a single value (cluster number) assigned to each gene or sample

?

Elaborate further.

ADD REPLYlink written 2.3 years ago by Kevin Blighe69k

Hi

1- Yes, I have a count matrix from 15 samples and 5 group, 3 replicates per group, and rows are gene name. The relative expression could be made from log2 counts.

2-Exactly right, I need to make 3-6 cluster, which each gene got a cluster number.

ADD REPLYlink written 2.3 years ago by ahmad mousavi520
1

Great, and from where have you taken that figure that you put in your initial question? The term 'standardized expression' could be interpreted in an innumerable number of ways in the context of k-means clustering.

There are also generally many different ways of presenting k-means cluster information:

signature-1

complexheatmap-1

ADD REPLYlink written 2.3 years ago by Kevin Blighe69k

Thanks Kevin.

Would you please share the R code for my attached image and also above two plots?

ADD REPLYlink written 2.3 years ago by ahmad mousavi520
1

The initial plot that you showed was generated with ggplot2, I believe, and using the facet_wrap() function. I give a quick example of this, but for histograms, here: A: Network/Pathway Analysis from Mass Spec data

The code for the plots that I showed is here: https://github.com/kevinblighe/cytofNet

You have not answered my other question: "from where have you taken that figure that you put in your initial question?"

It does not appear that you have invested much time in actually understanding what the plot (the one that you posted) is showing.

ADD REPLYlink written 2.3 years ago by Kevin Blighe69k

Dear Kevin,

Thanks for answer and good example. I have found the image from this paper. Yes you are right I've made the code by using ggplot2.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by ahmad mousavi520

Updated - there is now a tutorial here: https://github.com/kevinblighe/E-MTAB-6141

ADD REPLYlink written 4 months ago by Kevin Blighe69k

How exactly did you standardize your gene expession levels?

ADD REPLYlink written 3 months ago by ponganta50
6
gravatar for ahmad mousavi
2.3 years ago by
ahmad mousavi520
Royan Institute, Tehran, Iran
ahmad mousavi520 wrote:

I have solved the problem, the code should be something like below:

You need to run DESeq2 before processing rowMeans:

library(ggplot2)
library(DESeq2)
#suppose you have one column per each time/tissue condition

cds <- DESeq(cds)
vst <- getVarianceStabilizedData(cds)
df <- vst
df <- df - rowMeans(df)
df <- as.data.frame(df)
colnames(df)
res_km <- kmeans(df, 5, nstart = 10)
data_plot <- data.table(melt(data.table(class = as.factor(res_km$cluster), df)))
data_plot[, Time := rep(1:ncol(df), each = nrow(df))]
data_plot[, ID := rep(1:nrow(df), ncol(df))]
head(data_plot)
# prepare centroids
centers <- data.table(melt(res_km$centers))
setnames(centers, c("Var1", "Var2"), c("class", "Time"))
centers[, ID := class]
centers[, gr := as.numeric(as.factor(Time))]
head(centers)
head(data_plot)
# plot the results
ggplot(data_plot, aes(variable, value, group = ID)) +
  facet_wrap(~class, ncol = 2, scales = "free_y") +
  geom_line(color = "grey10", alpha = 0.65) +

  geom_line(data = centers, aes(gr, value),
            color = "firebrick1", alpha = 0.80, size = 1.2) +
  labs(x = "Time", y = "Load (normalised)") +
  theme_bw()
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by ahmad mousavi520

Thank you for your code~ I get Figure like this. https://ibb.co/wrZvYbh

ADD REPLYlink written 9 months ago by wchluo20110

~~ your welcome !!!!

ADD REPLYlink written 9 months ago by ahmad mousavi520

Hi Ahmad, Thanks for your code. I', relatively new to this. Can you please share a code for highlighting selected genes and their trend in the K-means cluster ! attached the example image link

Thanks a lot

ibb.co/bmkFJCk

ADD REPLYlink written 4 months ago by madhokayush0
5
gravatar for Chirag Parsania
2.3 years ago by
Chirag Parsania1.9k
University of Macau
Chirag Parsania1.9k wrote:

Hi, here is the solution using dplyr. In the script below, you need to set the path of tab delim .txt file at file_path variable. In the given files, columns should be variables and rows should be observations (E.g. genes). Each column and row must have unique identity given as first row and first column respectively.

Below is the sample plot generated from one of my time series data.

Rplot

    library(tidyverse)
    file_path  = "path_to_tab_delim_file"  ## file must have colnames, where first column is geneName and all other columns are relative expression in given sample 
    data <- read_delim(file_path ,delim = "\t") %>% 
rename_if(is.numeric ,  ~(paste("hr", seq(0,8,by = 2),sep = "_"))) ## rename column names if require otherwise you can comment this line 

    ## prepare data for cluster 
    for_clust  <- data %>% 
      select(-1) ## remove first column which is gene id 

    ### kmeans
    max_itr <-  50
    n_clust  <-  6  ## number of cluster 
    set.seed(123) ## reproduce the cluster 
    kmeans_out  <- kmeans(for_clust,n_clust,iter.max = max_itr)

    ## add cluster info to orig matrix 
    data_with_cust_info <- data %>% 
      mutate(clust = paste("clust_", kmeans_out$cluster,sep = ""))


    ## visualise  each cluster 
    data_with_cust_info %>% 
      gather(key = "variable" , value = "value", -c(1,7)) %>%  ### 1 is the index of column 'geneName' and 7 is the index of column 'clust'
      group_by(variable) %>%  
      mutate(row_num =  1:n()) %>% 
      ggplot(aes(x =  variable , y = value , group = row_num)) +   
      geom_point() +  
      geom_line(alpha = 1 , aes(col = as.character(clust))) + 
      theme_bw() +  
      theme(legend.position = "none" , axis.text.x = element_text(angle = 90 , vjust = 0.4)) +
      facet_wrap(~clust)
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Chirag Parsania1.9k
1

Good work, Chirag!

ADD REPLYlink written 2.3 years ago by Kevin Blighe69k

Hi Chirag and Kevin,

A simple and naive question on this thread. How do you choose a particular number of cluster (k) before performing kmeans clustering?. I know there are several methods such as pvclust or nbclust to find optimal number of k, but in my experience those doesn't always work better with gene expression datasets. can you please explain your approach

Thanks

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Prakash2.1k
1

Hi Prakash, as of now I don't have any systematic way to do that. But I use crude way to decide number of K. For example, I check visually (using heatmap or lineplot shown above) if observations in particular cluster behaves similarly. If not, I keep on changing K value till I get convincing clusters.

ADD REPLYlink written 2.3 years ago by Chirag Parsania1.9k
1

To find optimum k, you can try Gap Statistic, Elbow Method, and Silhouette Method. It can take a long time to run these algorithms, though, because they cycle through a range of values of k, testing each value, and also bootstrapping your data in each cycle.

ADD REPLYlink written 2.3 years ago by Kevin Blighe69k

This post has been deleted by the author

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ecg1g1510
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2029 users visited in the last hour
_