Clustering is a data mining method to identify unknown possible groups of items solely based on intrinsic features and no external variables. Basically, clustering includes four steps:
1) Data preparation and Feature selection,
2) Dissimilarity matrix calculation,
3) applying clustering algorithms,
4) Assessing cluster assignment
I use an RNA-seq dataset on 476 patients with non-muscle invasive bladder cancer. To make this tutorial reproducible, clinical data can be downloaded from GitHub and normalized RNA-seq data could be obtained through this link . So if you like to use these materials, start your analysis from the 'feature selection' section onward. Also, the GitHub repo could provide more details on each step of the analysis.
Data preparation and Feature selection
In this step we need to: filter out incomplete cases and low expressed genes, deal with missing values (removing or value imputation), then transform/normalize gene expression values.
In the dataset, there is no missing value. For transformation/normalization, I use variance stabilizing transformation (VST) implemented as
vst function from
DESeq2 packages which at the same time will normalize the raw count also. Using other types of transformation like Z score, log transformation is also common. If the expression matrix contains estimated transcript counts (like RSEM) or counts normalized for sequencing depth (like FPKM) , log2 transformation of the values would prepare the data for clustering.
Feature selection: A feature selection could be helpful to limit the analysis to those genes that possibly explain variation between samples in the cohort. I select 2k, 4k, and 6k top genes based on median absolute deviation (MAD) . A number of other approaches like feature selection based on the most variance, feature dimension reduction and extraction based on Principal Component Analysis (PCA), and feature selection based on the Cox regression model could be applied.
#_________________________________reading data and processing_________________________________# # reading count data rna <- read.table("Uromol1_CountData.v1.csv", header = T, sep = ",", row.names = 1) dim(rna) #  60483 476 this is a typical output from hts-seq count matrix with more than 60,000 genes # dissecting dataset based on gene-type library(biomaRt) mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl")) genes <- getBM(attributes= c("ensembl_gene_id","hgnc_symbol", "gene_biotype"), mart= mart) # see gene types returned by biomaRt data.frame(table(genes$gene_biotype))[order(- data.frame(table(genes$gene_biotype))[,2]), ] # we will continue with protein coding here: rna <- rna[substr(rownames(rna),1,15) %in% genes$ensembl_gene_id[genes$gene_biotype == "protein_coding"],] dim(rna) # 19581 476 These are protein coding genes # reading associated clinical data clinical.exp <- read.table("uromol_clinic.csv", sep = ",", header = T, row.names = 1) # making sure about sample order in rna and clincal.exp dataset all(rownames(clinical.exp) %in% colnames(rna)) # TRUE all(rownames(clinical.exp) == colnames(rna)) # FALSE # reordering rna dataset rna <- rna[, rownames(clinical.exp)] #______________________ Data tranformation & Normalization ______________________________# library(DESeq2) dds <- DESeqDataSetFromMatrix(countData = rna, colData = clinical.exp, design = ~ 1) # 1 passed to the function because of no model # pre-filteration, # Keeping genes with expression in 10% of samples with a count of 10 or higher keep <- rowSums(counts(dds) >= 10) >= round(ncol(rna)*0.1) dds <- dds[keep,] # vst tranformation vsd <- assay(vst(dds)) #_________________________________# Feature Selection _________________________________# mads=apply(vsd,1,mad) # check data distribution #hist(mads, breaks=nrow(vsd)*0.1) # selecting features mad2k=vsd[rev(order(mads))[1:2000],] #mad4k=vsd[rev(order(mads))[1:4000],] #mad6k=vsd[rev(order(mads))[1:6000],]
Dissimilarity matrix calculation and clustering algorithms:
There is a diverse list of dissimilarity matrix calculation methods (distance measures). To see a list of available distance measures please check
?stats::dist. For log-transformed gene expression, Euclidean-based measures can be applied. For RNA-seq normalized counts, correlation-based measures (Pearson, Spearman) or a Poisson-based distance can be used.
Clustering algorithms: Most widely used algorithms are partitional clustering and hierarchical clustering.
Partitional clustering is a clustering method used to classify samples into multiple clusters based on their similarity. The algorithms are required to specify the number of clusters to be generated. The commonly used partitional clustering includes K-means clustering (KM): each cluster is represented by the center or means of the data points belonging to the cluster. This method is sensitive to outliers.
In K-medoids clustering or PAM (Partitioning Around Medoids), each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers.
In contrast to partitional clustering, hierarchical clustering (HC) does not require pre-specify the number of clusters to be generated. To compute distances between clusters in HC, there are different ways to do cluster agglomeration (i.e, linkage ) like “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” or “centroid”. Generally, complete linkage and Ward’s method are preferred for gene expression analysis.
In practice, HC, KM, and PAM are commonly used for gene expression data.
It would be quite challenging to decide between several clusters and the membership of samples in each cluster.
To quantitatively address these two issues, I use Consensus Clustering (CC) approach. Applying this method has proved to be effective in new cancer subclasses discoveries. For more information on the methodology please refer to the seminal paper by Monti et al. (2003) and the
ConsensusClusterPlus package manual.
#________________________# Clustering & Cluster assignment validation __________________________# # finding optimal clusters by CC library(ConsensusClusterPlus) results = ConsensusClusterPlus(mad2k, # normalized expression matrix maxK=6, # maximum k to make clusters reps=50, # resampling pItem=0.8, # item resampling pFeature=1, # gene resampling title= "geneExp", # output title clusterAlg="pam", # clustering algorithm distance="spearman", # distance measure seed=1262118388.71279, # random number plot="pdf") # output file format
The above command returns several plots which are helpful to decide on cluster number:
1- Color-coded heatmap corresponding to the consensus matrix that represents consensus values from 0–1; white corresponds to 0 (never clustered together) and dark blue to 1 (always clustered together). Here I only posted a heatmap for k = 4.
2- Consensus Cumulative Distribution Function (CDF) plot, This graphic lets one determine at what number of clusters,k, the CDF reaches an approximate maximum: here 4 or 5.
3- Based on the CDF plot, a chart for the relative change in area under the CDF curve can be plotted. This plot allows a user to determine the relative increase in consensus and determine the k at which there is no appreciable increase:
Assessing cluster assignment
This is the procedure of assessing the goodness of clustering results. Here I will use the Silhouette method for cluster assessment and survival analysis to see how technically and biologically the clusters that we defined make sense.
The silhouette method can be used to investigate the separation distance between the obtained clusters. The Silhouette plot reflects a measure of how close each data point in one cluster is to a point in the neighboring clusters. Silhouette width has a range of -1 to +1. A value near +1 shows that the sample is far away from the closest data point from the neighboring cluster. A negative value may indicate a wrong cluster assignment and a value close to 0 means an arbitrary cluster assignment to that data point.
#__________________________#Assessing cluster assignment ______________________________# #Silhouette width analysis cc4 = results[] # calcultaing Silhouette width using the cluster package library(cluster) cc4Sil = silhouette(x = cc4[], dist = as.matrix(1- cc4[])) #For visualization: library(factoextra) fviz_silhouette(cc4Sil, palette = "jco", ggtheme = theme_classic())
The above function returns a summary of the Silhouette width for each cluster:
cluster size ave.sil.width 1 1 130 0.75 2 2 199 0.83 3 3 66 0.65 4 4 81 0.72
Since by clustering one should expect to find clusters of samples that are significantly different in terms of survival probability, I perform survival analysis between clusters (subtypes) to see significant differences, if any. As the result suggest it seems that clusters 2 and 4 have a significant difference in terms of survival probability from cluster 1:
#________________________#Assessing cluster assignment _________________________________# #Survival analysis ### preparing dataset for survival analysis cc4Class = data.frame(cc4$consensusClass) cc4Class$ID = rownames(cc4Class) cc4Class = data.frame(cc4Class[match(rownames(clinical.exp),cc4Class$ID),]) all(cc4Class$ID == rownames(clinical.exp)) # new encoding for status, time, and cluster clinical.exp$status = ifelse(clinical.exp$Progression.to.T2. == "NO", 0,1) clinical.exp$time = as.numeric(clinical.exp$Progression.free.survival..months. * 30) clinical.exp$cluster = as.factor(cc4Class$cc4.consensusClass) library(survival) res.cox <- coxph(Surv(time, status) ~ cluster, data = clinical.exp) res.cox Call: coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp) coef exp(coef) se(coef) z p cluster2 -1.47638 0.22846 0.48357 -3.053 0.00226 cluster3 0.22532 1.25272 0.42213 0.534 0.59351 cluster4 -2.39949 0.09076 1.03301 -2.323 0.02019 Likelihood ratio test=21.88 on 3 df, p=6.908e-05 n= 476, number of events= 31 summary(res.cox) Call: coxph(formula = Surv(time, status) ~ cluster, data = clinical.exp) n= 476, number of events= 31 coef exp(coef) se(coef) z Pr(>|z|) cluster2 -1.47638 0.22846 0.48357 -3.053 0.00226 ** cluster3 0.22532 1.25272 0.42213 0.534 0.59351 cluster4 -2.39949 0.09076 1.03301 -2.323 0.02019 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 cluster2 0.22846 4.3771 0.08855 0.5894 cluster3 1.25272 0.7983 0.54769 2.8653 cluster4 0.09076 11.0175 0.01198 0.6874 Concordance= 0.714 (se = 0.039 ) Likelihood ratio test= 21.88 on 3 df, p=7e-05 Wald test = 16.52 on 3 df, p=9e-04 Score (logrank) test = 22.14 on 3 df, p=6e-05
Final notes: Unsupervised clustering and defining subtypes in expression data is a task for which both methodological aspect and biological sense should match. For example, the first publication on this dataset concluded that there are three subtypes in the cohort, while their CDF and CDF delta-area plots clearly showed there might be more than three. Later the same group, same disease, and modified bioinformatic pipeline proposed four distinct subtypes in the RNA-seq data from patients with non-invasive bladder cancer (ref).
most of the references were cited in the tutorial body