I am trying to understand the gene-based clustering, and what strategy can be applied to see what genes fall into the same cluster based on their expression profile.
I have RNA-seq data that has been analyzed with DESeq2 package for differentially expressed genes. I selected rlog normalization as advised at "the RNA-seq workflow", (from 11 April 2018), due to a small dataset size (3KOs vs 3WTs). I'd like to explore my data more and to use gene-based clustering to see what genes are fall into the same cluster. Section 6.3 of "the RNA-seq workflow", suggest visualization of 20 genes with the highest variance across.
- I'd like to cluster a list of 4000 genes, so I am wondering if it is acceptable to use the same strategy
- Also, I am trying to understand how a dramatic difference in signal between WT and KO experiments (pls see Fig.dummy example, row two) would affect clustering process? Basically, can I use matrix with KO and WT samples as an input for clustering, to see what genes fall into the same cluster group? KO and WT are different type of data that gives us an idea if there is change in KO experiment or not. Therefore, I am bit confused how would it effect the clustering.
- Or, alternatively, could logFC values (results from DESeq2) be used as input for clustering, so we see what genes have similar expression profile?
Fig. dummy example:
KO1 KO2 KO3 WT1 WT2 WT3 2.2 3.2 3.8 4 4.5 4.6 1.7 1.8 1.4 10.3 11.2 10.9 2.2 3.2 3.8 4 4.5 4.6
I guess I have a confusion of using both KO and WT as an input for clustering, and how reliable are results from this kind of matrix.
Would appreciate any help in making these questions more clear. Thank you!