Hi, I have questions regarding data analyses of total RNA seq as follows: (1) I know the classical goal of RNA seq experiment is to identify Differentially Expressed (DE) genes between two conditions (for eg, case vs control). However, is there a way to predict the class (i.e case vs control) given the expression of the genes? For instance, given the set of genes classify the condition as case or control.
(2) Is there a way to perform feature selection from DEseq2 normalized read counts to be able to include in the downstream analysis mentioned in (1). Meaning, instead of Differential Expression analysis, could I include a select set of genes based on certain criteria of expression threshold?
Your responses are highly appreciated. Thanks, Bhumi
A common feature selection method independent of DE testing would be to use the gene variance (rowVars) based on the normalized expression values on the log scale or after transformations such as
vst
orrlog
(fromDESeq2
). You could also model the variance and then select genes significantly deviating from it. With these genes you could then perform clustering or whatever classification method you have in mind. There are awefully many methods for classification of RNA-seq data available, I suggest to dive into the available literature.Thank you for your response. I really appreciate it. I was thinking about using the gene variance feature. The reason I thought of posting here is that a lot of literature is on the DE and I was not able to find very many wherein there are studies that have classified samples based on any clustering algorithms. Even if I use dimensionality reduction, I won't be able to have the sample labels. Hence, was curious to ask for any insights. Thanks!