My goal is to compare transcriptome between different condition. For example, I KD gene A, gene B, genes C. And I want to know whether the consequence of KD gene A is more close to gene B or gene C. The first way I adopted is to compare CPM of KD gene A Control, KD gene A, KD gene B Control, KD gene B .... But the result is KD gene A Control and KD gene A is more close. So I think I should consider the effect of the background. I next compared the log2foldchange from DESeq2 result. But I lose the p-value information. So, what is the best way to compare the transcriptome of RNA-seq?
Then you can either:
- Do the hierarchical clustering on the log2FC produced by DESeq2
- You can batch correct the entire expression matrix (using sva::ComBat (see section 7 here) or limma::removeBatchEffect (see page 190 here)) and do the hierarchical clustering on the corrected matrix.
Btw for doing a global comparison of which are more/less similar I would not use p-values (or only significant features) but rather the entire transcriptome.
The question of your setting is basically find which change between treated gene vs control gene is closer acroos gene, right? In that case you need to measure the change between the group, then you measure the change acrross gene. Clustering log2FC is okay I guess but I think it will not show any direct relationship because 2 genes up regulation/down regulation can be caused by many things.
I think calculating correleation between 2 genes expression is better. Calculate using normalized expression from CPM function from Limma or EdgeR I forget or VST from DESeq2.
Why I think it is better? Correlation for expression of 2 genes basically check if gene A is affected by gene B or vice versa. If a gene is affecting another gene, it will affect both in control condition and in treatment condition. It means that no matter the condition, there would be an effect of gene A to gene B.