Question: Complex clustering of RNA-seq data
gravatar for Yrinky
5.3 years ago by
United Kingdom
Yrinky0 wrote:

Even though I'm working with RNA-seq data, my question is more of a statistics and machine learning nature. I do all my work in R.

So, what I have is expression data from wild types (controls) and mutants over four different conditions, where a condition is a combination of two cell types and time points. Basically, my expression matrix looks like this:


Expression values are initially in counts-per-million, but I have tried using different transformations before clustering attempts.

What I am trying to achieve is to cluster the genes based on the direction of the change between mutant and control (upregulated or downregulated) and absolute values of expression.

So far I was able to roughly group the genes based solely on the direction of the change, but I would also need to retain the information of absolute expression values. Are there any methods that could help me with this? This would be something like combination of quantitative and categorical data. I tried using daisy(), but it didn't seem to do what I'm trying to achieve.

One other idea was to split the dataset by conditions (as conditions are independent) and cluster the genes separately. This would mean each gene would be clustered four times. Is there any way to determine which genes are being clustered together?

Thank you for any input on this. I realize it's a bit vague but I am not extremely experienced with this



clustering rna-seq R • 1.6k views
ADD COMMENTlink modified 5.3 years ago by Jean-Karim Heriche23k • written 5.3 years ago by Yrinky0

It seems to me that you are looking for a (semi-)supervised biclustering approach. To perform biclustering In R you could use the Iterative Signature Algorithm (ISA) by using the package isa2 (developed for microarray data), but I am not aware of any approach available in R to add a priori knowledge. However, some approaches have been described in literature.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by alesssia560

Thank you for the input, I will look into it!

ADD REPLYlink written 5.3 years ago by Yrinky0
gravatar for Jean-Karim Heriche
5.3 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

You could try representing your data as a three way array (e.g. genes x cells x time points) and do a PARAFAC/CANDECOMP tensor factorization. This is implemented in R in the PTAk package.

ADD COMMENTlink written 5.3 years ago by Jean-Karim Heriche23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1426 users visited in the last hour