Unsupervised subtype discovery
3
1
Entering edit mode
5.0 years ago
tucanj ▴ 90

Attempting to discover subtypes of a disease from gene expression data (20 000 genes x 80 samples). I do not know the number of subtypes.

I can only find 1 review comparing methods: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-497

Are there any accepted protocols or methods? I see:

  1. Differing methods of filtering genes beforehand (eg. top 5000 genes by median absolute deviation). Is there an optimal number of genes to include?
  2. Different algorithms (k-means, consensus, Mclust)

Any input appreciated!

microarray R • 1.3k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
chris86 ▴ 370

To update this. For a single platform, I have found empirically on TCGA RNA-seq data the best current algorithms (available in R) for this are M3C and CLEST. This data is not published yet. M3C is an improved version of the Monti et al. consenus clustering algorithm and CLEST has been around for ages and it works well. It is best to try a few different ones and see what works best on your data.

I would use the most variable genes only and try a few thresholds.

ADD COMMENT
0
Entering edit mode
5.0 years ago
Ahill ★ 1.9k

I'd say there are no hard and fast rules about how many genes to include, or what algorithm to pick, it's data-dependent. If data quality is good and there are truly sub-types to be found, you could probably succeed using any one of algorithms you list. If you are looking for a case study, this paper (https://www.ncbi.nlm.nih.gov/pubmed/20129251) is an example of a successful approach to to unsupervised sub-type discovery (in glioblastoma). They used consensus average linkage hierarchical clustering on 1740 genes.

ADD COMMENT
0
Entering edit mode
3.0 years ago
Min Dai ▴ 160

I'd say you'd better filter out some genes, because if some genes don't contribute to the subtype identification, they may add noises to your data. The assumption behind is that there should be some common features among different subtypes, but due to the measurement, there are variations in the common features. Therefore, I recommend applying feature selection or feature extraction before clustering. For example, you can try singular value decomposition or nonnegative matrix factorization.

ADD COMMENT

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6