Question: Clustering RNA Seq - No Conditions, No Replicates
0
gravatar for mbio.kyle
5.2 years ago by
mbio.kyle360
United States
mbio.kyle360 wrote:

I have 25 samples from the TCGA, which contain RNA sequencing expression data from 25 different clinical cancer tumor biopsies. I want to cluster them based on similar expression. The problem is that there are no conditions or replicates to build an experiment design to feed into DESeq or edgeR. I also tried things like perls KMeans library, the R built in kmeans() etc. The problem is that for each sample I have 25K expresssion values (25K genes), so the feature vectors are very large and I don't think I am getting anything that is useful.

Does anyone have any advice on clustering data sets with very large feature vectors and/or clustering expression data without biological conditions.

Thanks,

Kyle

rna-seq rna R • 2.7k views
ADD COMMENTlink modified 5.2 years ago by Ahill1.8k • written 5.2 years ago by mbio.kyle360
2
gravatar for Ahill
5.2 years ago by
Ahill1.8k
United States
Ahill1.8k wrote:

I'd suggest - first, have a hypothesis.  Given what you know about the TCGA biopsies and where they came from, what's your expectation of what a clustering would look like?  To get a sense of the data, first make sure the counts are suitably normalized.  You could filter the genes to a smaller subset with the highest variance across samples on the log-scale and start clustering with a small set of those genes.  Perhaps a few hundred or a thousand, something that is easy to visualize.  I would start with using correlation as the similarity measure.  Do you you know if the samples came from different clinical sites?  Are those sites reflected in the initial clusters?  Or the biopsy tissue source?  Once you have an initial picture based on a subset of the most variable genes, depending on what you find, you may wish to expand out to include more of the genes, to see what new clusters emerge, if any.  Be aware that low-count mRNAs may contribute more noise than signal to your clustering.

 

ADD COMMENTlink written 5.2 years ago by Ahill1.8k

Thank you for your response. I have two follow up questions.

1. Should I be working with the RPKM values or with counts? (I am using counts since I initally built a count table for DEseq)

2. Should I just use common sense for cutoffs (what is low-count, what is low variance) or is there some standards that people use?

Otherwise I am paring down the list as moving forward with your suggestions. I feel a lot more sane working with a couple thousands genes.

Thanks again!

ADD REPLYlink written 5.2 years ago by mbio.kyle360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 717 users visited in the last hour