I'm a novice in analysing gene expression data and have some difficulties going forward. Especially because I have no profound knowledge of modelling. Following the identfication of differentially expressed genes (DEGs) using DESeq2, I want to classify DEGs according to their expression patterns across tissues. The goal is to classify genes into groups of "interesting" candidates, based on the strength of expression in the tissue of interest. I then want to conduct GO-term enrichments for the interesting clusters.
Although I was able to extract some patterns, I'm unsure if my procedure has any merit.
A dataframe with rlog-transformed counts (transformed using DESeq2::rlog())
- Per gene, compute tissue-means from biological triplicates
- Per gene, scale the mean expression via base::scale()
- cluster these vectors of scaled, mean expression levels using base::kmeans()
- Per gene, build a linear model based on biological triplicates
- Per gene, Zscore-scale the models
- Cluster scaled models using kmeans
- Would you use any of these techniques, or are more sophisticated methods necessary?
- If I should be using linear models, is it kosher to build a linear model of zscore-scaled linear models per cluster (just for the purpose of visualisation)
Thanks in advance for any advice,