Even though I'm working with RNA-seq data, my question is more of a statistics and machine learning nature. I do all my work in R.
So, what I have is expression data from wild types (controls) and mutants over four different conditions, where a condition is a combination of two cell types and time points. Basically, my expression matrix looks like this:
time1_loc1_control time1_loc1_mutant time1_loc2_control time1_loc2_mutant gene1 gene2 .. ..
Expression values are initially in counts-per-million, but I have tried using different transformations before clustering attempts.
What I am trying to achieve is to cluster the genes based on the direction of the change between mutant and control (upregulated or downregulated) and absolute values of expression.
So far I was able to roughly group the genes based solely on the direction of the change, but I would also need to retain the information of absolute expression values. Are there any methods that could help me with this? This would be something like combination of quantitative and categorical data. I tried using
daisy(), but it didn't seem to do what I'm trying to achieve.
One other idea was to split the dataset by conditions (as conditions are independent) and cluster the genes separately. This would mean each gene would be clustered four times. Is there any way to determine which genes are being clustered together?
Thank you for any input on this. I realize it's a bit vague but I am not extremely experienced with this