Tutorial:Unsupervised clustering on gene expression data
Entering edit mode
2.6 years ago

Clustering is a data mining method to identify unknown possible groups of items solely based on intrinsic features and no external variables. Basically, clustering includes four steps:

1) Data preparation and Feature selection,

2) Dissimilarity matrix calculation,

3) applying clustering algorithms,

4) Assessing cluster assignment

I use an RNA-seq dataset on 476 patients with non-muscle invasive bladder cancer. To make this tutorial reproducible, clinical data can be downloaded from GitHub and normalized RNA-seq data can be obtained through this link. So if you like to use these materials, start your analysis from the 'feature selection' section onward. Also, the GitHub repo could provide more details on each step of the analysis.

Data preparation and Feature selection

In this step, we need to: filter out incomplete cases and low expressed genes, deal with missing values (removing or value imputation), and then transform/normalize gene expression values.

In the dataset, there is no missing value. For transformation/normalization, I use variance stabilizing transformation (VST) implemented as vst function from DESeq2 packages which at the same time will normalize the raw count also. Using other types of transformation like Z score, log transformation is also common. If the expression matrix contains estimated transcript counts (like RSEM) or counts normalized for sequencing depth (like FPKM) , log2 transformation of the values would prepare the data for clustering.

Feature selection: A feature selection could be helpful to limit the analysis to those genes that possibly explain variation between samples in the cohort. I select 2k, 4k, and 6k top genes based on median absolute deviation (MAD). A number of other approaches like feature selection based on the most variance, feature dimension reduction and extraction based on Principal Component Analysis (PCA), and feature selection based on the Cox regression model could be applied.

#_________________________________reading data and processing_________________________________#
# reading count data
rna <- read.table("Uromol1_CountData.v1.csv", header = T, sep = ",", row.names = 1)

# [1] 60483   476  this is a typical output from hts-seq count matrix with more than 60,000 genes

# dissecting dataset based on gene-type
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- getBM(attributes= c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), 
               mart= mart)

# see gene types returned by biomaRt
data.frame(table(genes$gene_biotype))[order(- data.frame(table(genes$gene_biotype))[,2]), ]

# we will continue with protein coding here:
rna <- rna[substr(rownames(rna),1,15) %in% genes$ensembl_gene_id[genes$gene_biotype == "protein_coding"],]

#[1] 19581   476 These are protein coding genes

# reading associated clinical data
clinical.exp <- read.table("uromol_clinic.csv", sep = ",", header = T, row.names = 1)

# making sure about sample order in rna and clincal.exp dataset
all(rownames(clinical.exp) %in% colnames(rna))
#[1] TRUE
all(rownames(clinical.exp) == colnames(rna))
#[1] FALSE
# reordering rna dataset
rna <- rna[, rownames(clinical.exp)]
#______________________ Data tranformation & Normalization ______________________________#
dds <- DESeqDataSetFromMatrix(countData = rna,
                              colData = clinical.exp,
                              design = ~ 1) # 1 passed to the function because of no model
# pre-filteration, 
# Keeping genes with expression in 10% of samples with a count of 10 or higher
keep <- rowSums(counts(dds) >= 10) >= round(ncol(rna)*0.1)
dds <- dds[keep,]

# vst tranformation
vsd <- assay(vst(dds))
#_________________________________# Feature Selection _________________________________#
# check data distribution
#hist(mads, breaks=nrow(vsd)*0.1)
# selecting features

Dissimilarity matrix calculation and clustering algorithms:

Dissimilarity matrix There is a diverse list of dissimilarity matrix calculation methods (distance measures). To see a list of available distance measures please check ?stats::dist. For log-transformed gene expression, Euclidean-based measures can be applied. For RNA-seq normalized counts, correlation-based measures (Pearson, Spearman) or a Poisson-based distance can be used.

Clustering algorithms: The most widely used algorithms are partitional clustering and hierarchical clustering.

Partitional clustering is a clustering method used to classify samples into multiple clusters based on their similarity. The algorithms are required to specify the number of clusters to be generated. The commonly used partitional clustering includes K-means clustering (KM): each cluster is represented by the center or means of the data points belonging to the cluster. This method is sensitive to outliers.

In K-medoids clustering or PAM (Partitioning Around Medoids), each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers.

In contrast to partitional clustering, hierarchical clustering (HC) does not require pre-specifying the number of clusters to be generated. To compute distances between clusters in HC, there are different ways to do cluster agglomeration (i.e, linkage ) like “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”. Generally, complete linkage and Ward’s method are preferred for gene expression analysis.

In practice, HC, KM, and PAM are commonly used for gene expression data.

It would be quite challenging to decide between several clusters and the membership of samples in each cluster. To quantitatively address these two issues, I use Consensus Clustering (CC) approach. Applying this method has proved to be effective in new cancer subclass discoveries. For more information on the methodology please refer to the seminal paper by Monti et al. (2003) and the ConsensusClusterPlus package manual.

#________________________# Clustering & Cluster assignment validation __________________________#
# finding optimal clusters by CC
results = ConsensusClusterPlus(mad2k, # normalized expression matrix
                               maxK=6, # maximum k to make clusters
                               reps=50, # resampling
                               pItem=0.8, # item resampling
                               pFeature=1, # gene resampling
                               title= "geneExp", # output title 
                               clusterAlg="pam", # clustering algorithm
                               distance="spearman", # distance measure
                               seed=1262118388.71279, # random number
                               plot="pdf") # output file format

The above command returns several plots which are helpful in deciding on cluster number:

1- Color-coded heatmap corresponding to the consensus matrix that represents consensus values from 0–1; white corresponds to 0 (never clustered together) and dark blue to 1 (always clustered together). Here I only posted a heatmap for k = 4.

enter image description here

2- Consensus Cumulative Distribution Function (CDF) plot, This graphic lets one determine at what number of clusters,k, the CDF reaches an approximate maximum: here 4 or 5.

enter image description here

3- Based on the CDF plot, a chart for the relative change in area under the CDF curve can be plotted. This plot allows a user to determine the relative increase in consensus and determine the k at which there is no appreciable increase:

enter image description here

Assessing cluster assignment

This is the procedure of assessing the goodness of clustering results. Here I will use the Silhouette method for cluster assessment and survival analysis to see how technically and biologically the clusters that we defined make sense.

The silhouette method can be used to investigate the separation distance between the obtained clusters. The Silhouette plot reflects a measure of how close each data point in one cluster is to a point in the neighboring clusters. Silhouette width has a range of -1 to +1. A value near +1 shows that the sample is far away from the closest data point from the neighboring cluster. A negative value may indicate a wrong cluster assignment and a value close to 0 means an arbitrary cluster assignment to that data point.

#__________________________#Assessing cluster assignment ______________________________#
#Silhouette width analysis
cc4 = results[[4]]

# calcultaing Silhouette width using the cluster package 
cc4Sil = silhouette(x = cc4[[3]], 
                    dist = as.matrix(1- cc4[[4]])) 

#For visualization:
fviz_silhouette(cc4Sil, palette = "jco",
                 ggtheme = theme_classic())

The above function returns a summary of the Silhouette width for each cluster:

  cluster size ave.sil.width
1       1  130          0.75
2       2  199          0.83
3       3   66          0.65
4       4   81          0.72

enter image description here

Since by clustering, one should expect to find clusters of samples that are significantly different in terms of survival probability, I perform survival analysis between clusters (subtypes) to see significant differences, if any. As the result suggest it seems that clusters 2 and 4 have a significant difference in terms of survival probability from cluster 1:

#________________________#Assessing cluster assignment _________________________________#
#Survival analysis
### preparing dataset for survival analysis
cc4Class = data.frame(cc4$consensusClass)
cc4Class$ID = rownames(cc4Class)
cc4Class = data.frame(cc4Class[match(rownames(clinical.exp),cc4Class$ID),])
all(cc4Class$ID == rownames(clinical.exp))

# new encoding for status, time, and cluster
clinical.exp$status = ifelse(clinical.exp$Progression.to.T2. == "NO", 0,1)
clinical.exp$time = as.numeric(clinical.exp$Progression.free.survival..months. * 30)
clinical.exp$cluster = as.factor(cc4Class$cc4.consensusClass)


res.cox <- coxph(Surv(time, status) ~ cluster, data = clinical.exp)

coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp)

             coef exp(coef) se(coef)      z       p
cluster2 -1.47638   0.22846  0.48357 -3.053 0.00226
cluster3  0.22532   1.25272  0.42213  0.534 0.59351
cluster4 -2.39949   0.09076  1.03301 -2.323 0.02019

Likelihood ratio test=21.88  on 3 df, p=6.908e-05
n= 476, number of events= 31 


coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp)

  n= 476, number of events= 31 

             coef exp(coef) se(coef)      z Pr(>|z|)   
cluster2 -1.47638   0.22846  0.48357 -3.053  0.00226 **
cluster3  0.22532   1.25272  0.42213  0.534  0.59351   
cluster4 -2.39949   0.09076  1.03301 -2.323  0.02019 * 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
cluster2   0.22846     4.3771   0.08855    0.5894
cluster3   1.25272     0.7983   0.54769    2.8653
cluster4   0.09076    11.0175   0.01198    0.6874

Concordance= 0.714  (se = 0.039 )
Likelihood ratio test= 21.88  on 3 df,   p=7e-05
Wald test            = 16.52  on 3 df,   p=9e-04
Score (logrank) test = 22.14  on 3 df,   p=6e-05

Final notes: Unsupervised clustering and defining subtypes in expression data is a task for which both methodological aspect and biological sense should match. For example, the first publication on this dataset concluded that there are three subtypes in the cohort, while their CDF and CDF delta-area plots clearly showed there might be more than three. Later the same group, same disease, and modified bioinformatic pipeline proposed four distinct subtypes in the RNA-seq data from patients with non-invasive bladder cancer (ref).

other References

most of the references were cited in the tutorial body

1- Biostar posts: 281161, 321773, 225315

2- Partitional Clustering in R: The Essentials

3-Clustering RNAseq data using K-means: how many clusters?

gene-expression clustering RNA-seq • 5.8k views
Entering edit mode

Thanks for the tutorial. I would ask you how can I identify the samples (or genes) for each cluster. I did not find any information about that in the bioconductor manual. I have used the argument writeTable, but the matrix (e.g. geneExp.k=3.consensusMatrix.csv) I get is something like that:

V1  V2  V3  V4
1   1   0.285714286 0.176470588 0.21875
2   0.285714286 1   0.578947368 0.621621622
3   0.176470588 0.578947368 1   0.971428571

And the geneExp.k=3.consensusClass.csv is something like:

10A_int 1
10A-002 2
10A-012 2
10A-015 2
10A-17MTB   2
11A-006 2
11A-08R 2
11S-002 3

Is there a way to see exactly the order of the samples (X-axis) of the cluster plot?


Entering edit mode

I would ask you how can I identify the samples (or genes) for each cluster.

There may be another way around this, but following the code below you can find which sample belongs to which cluster and also their silhouette width :

tmp1 <- data.frame(unclass(cc4Sil))
tmp2 <- data.frame(cc4$consensusClass)
tmp2$sample <- rownames(tmp2)
tmp = cbind(tmp1, tmp2).

And the result is like:

   cc4.consensusClass sample cluster neighbor sil_width
U0603                  1  U0603       1        3 0.7618998
U0497                  2  U0497       2        3 0.8654204
U0839                  1  U0839       1        3 0.8304333
U1043                  2  U1043       2        4 0.8478005
U0566                  2  U0566       2        3 0.8920452
U0134                  1  U0134       1        3 0.7842918
Entering edit mode

Excuse me. I am studying how to do K-means clustering of gene expression from your Github platform. I have one problem at the early step. I can't access your RNA data as "Uromol1_CountData.v1.csv", because your link is outdated and can't access. Could you tell me about this link and how can I get the data I am going to try your platform and will be applied to my project.

Thank you Wathunya H.

Entering edit mode

Thank you for showing interest. I repaired the link and now you should be able to download the file.


Login before adding your answer.

Traffic: 1320 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6