Tutorial:Unsupervised clustering on gene expression data
Entering edit mode
9 months ago
Hamid Ghaedi ★ 2.6k

Clustering is a data mining method to identify unknown possible groups of items solely based on the intrinsic features and no external variables. Basically, clustering includes four steps:

1) Data preparation and Feature selection,

2) Dissimilarity matrix calculation,

3) applying clustering algorithms,

4) Assessing cluster assignment

I use an RNA-seq dataset on 476 patients with non-muscle invasive bladder cancer. To make this tutorial reproducible, clinical data can be downloaded from GitHub and normalized RNA-seq data could be obtained through this link . So if you like to use these materials, start your analysis from the 'feature selection' section onward. Also, the GitHub repo could provide more details on each step of the analysis.

Data preparation and Feature selection

In this step we need to: filter out incomplete cases and low expressed genes, deal with missing values (removing or value imputation), then transforming/normalizing gene expression values.

In the dataset, there is no missing value. For transformation/normalization, I use variance stabilizing transformation (VST) implemented as vst function from DESeq2 packages which at the same time will normalize the raw count also. Using other types of transformation like Z score, log transformation is also common. If the expression matrix contains estimated transcript counts (like RSEM) or counts normalized for sequencing depth (like FPKM) , log2 transformation of the values would prepare the data for clustering.

Feature selection: A feature selection could be helpful to limit the analysis to those genes that possibly explain variation between samples in the cohort. I select 2k, 4k, and 6k top genes based on median absolute deviation (MAD) . A number of other approaches like feature selection based on the most variance, feature dimension reduction and extraction based on Principal Component Analysis (PCA), and feature selection based on the Cox regression model could be applied.

#_________________________________reading data and processing_________________________________#
# reading count data
rna <- read.table("Uromol1_CountData.v1.csv", header = T, sep = ",", row.names = 1)

# [1] 60483   476  this is a typical output from hts-seq count matrix with more than 60,000 genes

# dissecting dataset based on gene-type
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- getBM(attributes= c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), 
               mart= mart)

# see gene types returned by biomaRt
data.frame(table(genes$gene_biotype))[order(- data.frame(table(genes$gene_biotype))[,2]), ]

# we will continue with protein coding here:
rna <- rna[substr(rownames(rna),1,15) %in% genes$ensembl_gene_id[genes$gene_biotype == "protein_coding"],]

#[1] 19581   476 These are protein coding genes

# reading associated clinical data
clinical.exp <- read.table("uromol_clinic.csv", sep = ",", header = T, row.names = 1)

# making sure about sample order in rna and clincal.exp dataset
all(rownames(clinical.exp) %in% colnames(rna))
#[1] TRUE
all(rownames(clinical.exp) == colnames(rna))
#[1] FALSE
# reordering rna dataset
rna <- rna[, rownames(clinical.exp)]
#______________________ Data tranformation & Normalization ______________________________#
dds <- DESeqDataSetFromMatrix(countData = rna,
                              colData = clinical.exp,
                              design = ~ 1) # 1 passed to the function because of no model
# pre-filteration, 
# Keeping genes with expression in 10% of samples with a count of 10 or higher
keep <- rowSums(counts(dds) >= 10) >= round(ncol(rna)*0.1)
dds <- dds[keep,]

# vst tranformation
vsd <- assay(vst(dds))
#_________________________________# Feature Selection _________________________________#
# check data distribution
#hist(mads, breaks=nrow(vsd)*0.1)
# selecting features

Dissimilarity matrix calculation and clustering algorithms:

Dissimilarity matrix There is a diverse list of dissimilarity matrix calculation methods (distance measures). To see a list of available distance measures please check ?stats::dist. For log-transformed gene expression, Euclidean-based measures can be applied. For RNA-seq normalized counts, correlation-based measures (Pearson, Spearman) or a Poisson-based distance can be used.

Clustering algorithms: Most widely used algorithms are partitional clustering and hierarchical clustering.

Partitional clustering is a clustering method used to classify samples into multiple clusters based on their similarity. The algorithms are required to specify the number of clusters to be generated. The commonly used partitional clustering includes K-means clustering (KM): each cluster is represented by the center or means of the data points belonging to the cluster. This method is sensitive to outliers. K-medoids clustering or PAM (Partitioning Around Medoids), in which, each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers. Hierarchical clustering (HC) in contrast to partitional clustering, this method does not require pre-specify the number of clusters to be generated. To compute distances between clusters in HC, there are different ways to do cluster agglomeration (i.e, linkage ) like “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”. Generally, complete linkage and Ward’s method are preferred for gene expression analysis.

In practice, HC, KM, and PAM are commonly used for gene expression data.

It would be pretty challenging to decide between several clusters and the membership of samples to each cluster. To quantitatively address these two issues, I use Consensus Clustering (CC) approach. Applying this method has proved to be effective in new cancer subclasses discoveries. For more information on the methodology please refer to the seminal paper by Monti et al. (2003) and ConsensusClusterPlus package manual.

#________________________# Clustering & Cluster assignmnet validation __________________________#
# finding optimal clusters by CC
results = ConsensusClusterPlus(mad2k, # normalized expression matrix
                               maxK=6, # maximum k to make clusters
                               reps=50, # resampling
                               pItem=0.8, # item resampling
                               pFeature=1, # gene resampling
                               title= "geneExp", # output title 
                               clusterAlg="pam", # clustering algorithm
                               distance="spearman", # distance measure
                               seed=1262118388.71279, # random number
                               plot="pdf") # output file format

The above command returns several plots which are helpful to decide on cluster number:

1- Color-coded heatmap corresponding to the consensus matrix that represents consensus values from 0–1; white corresponds to 0 (never clustered together) and dark blue to 1 (always clustered together). Here I only posted a heatmap for k = 4.

enter image description here

2- Consensus Cumulative Distribution Function (CDF) plot, This graphic lets one determine at what number of clusters,k, the CDF reaches an approximate maximum: here 4 or 5.

enter image description here

3- Based on the CDF plot, a chart for the relative change in area under the CDF curve can be plotted. This plot allows a user to determine the relative increase in consensus and determine k at which there is no appreciable increase:

enter image description here

Assessing cluster assignment

This is the procedure of assessing the goodness of clustering results. Here I will use the Silhouette method for cluster assessment and survival analysis to see how technically and biologically the clusters that we defined make sense.

The silhouette method can be used to investigate the separation distance between the obtained clusters. The Silhouette plot reflects a measure of how close each data point in one cluster is to a point in the neighboring clusters. Silhouette width has a range of -1 to +1. A value near +1 shows that the sample is far away from the closest data point from the neighboring cluster. A negative value may indicate wrong cluster assignment and a value close to 0 means an arbitrary cluster assignment to that data point.

#__________________________#Assessing cluster assignment ______________________________#
#Silhouette width analysis
cc4 = results[[4]]

# calcultaing Silhouette width using the cluster package 
cc4Sil = silhouette(x = cc4[[3]], 
                    dist = as.matrix(1- cc4[[4]])) 

#For visualization:
fviz_silhouette(cc4Sil, palette = "jco",
                 ggtheme = theme_classic())

The above function return a summary of Silhouette width for each cluster:

  cluster size ave.sil.width
1       1  130          0.75
2       2  199          0.83
3       3   66          0.65
4       4   81          0.72

enter image description here

Since by clustering one should expect to find cluster of samples who are significantly different in terms of survival probability,I perform survival analysis between clusters (subtypes) to see significant differences, if any. As the result suggest it seems the clusters 2 and 4 hvae significant survival probability from cluster 1:

#________________________#Assessing cluster assignment _________________________________#
#Survival analysis
### preparing dataset for survival analysis
cc4Class = data.frame(cc4$consensusClass)
cc4Class$ID = rownames(cc4Class)
cc4Class = data.frame(cc4Class[match(rownames(clinical.exp),cc4Class$ID),])
all(cc4Class$ID == rownames(clinical.exp))

# new encoding for status, time and cluster
clinical.exp$status = ifelse(clinical.exp$Progression.to.T2. == "NO", 0,1)
clinical.exp$time = as.numeric(clinical.exp$Progression.free.survival..months. * 30)
clinical.exp$cluster = as.factor(cc4Class$cc4.consensusClass)


res.cox <- coxph(Surv(time, status) ~ cluster, data = clinical.exp)

coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp)

             coef exp(coef) se(coef)      z       p
cluster2 -1.47638   0.22846  0.48357 -3.053 0.00226
cluster3  0.22532   1.25272  0.42213  0.534 0.59351
cluster4 -2.39949   0.09076  1.03301 -2.323 0.02019

Likelihood ratio test=21.88  on 3 df, p=6.908e-05
n= 476, number of events= 31 


coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp)

  n= 476, number of events= 31 

             coef exp(coef) se(coef)      z Pr(>|z|)   
cluster2 -1.47638   0.22846  0.48357 -3.053  0.00226 **
cluster3  0.22532   1.25272  0.42213  0.534  0.59351   
cluster4 -2.39949   0.09076  1.03301 -2.323  0.02019 * 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
cluster2   0.22846     4.3771   0.08855    0.5894
cluster3   1.25272     0.7983   0.54769    2.8653
cluster4   0.09076    11.0175   0.01198    0.6874

Concordance= 0.714  (se = 0.039 )
Likelihood ratio test= 21.88  on 3 df,   p=7e-05
Wald test            = 16.52  on 3 df,   p=9e-04
Score (logrank) test = 22.14  on 3 df,   p=6e-05

Final notes: Unsupervised clustering and defining subtypes in expression data is a task for which both methodological aspect and biological sense should match. For example, the first publication on this dataset concluded that there are three subtypes in the cohort, while their CDF and CDF delta-area plots clearly showed there might be more than three. Later the same group, same disease, modified bioinformatic pipeline proposed four distinct subtypes in the RNA-seq data from patients with non-invasive bladder cancer (ref).

other References

most of the references were cited in the tutorial body

1- Biostar posts: 281161, 321773, 225315

2- Partitional Clustering in R: The Essentials

3-Clustering RNAseq data using K-means: how many clusters?

clustering RNA-seq expression gene • 1.2k views

Login before adding your answer.

Traffic: 1728 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6