I want to train a binary classifier using cancer RNA-seq data. After differential expression analysis, I selected some top DEGs as features for glmnet training, however the performances were quite bad, I think this may be due to tumor heterogeneity.
Therefore I plan to do feature engineering, like using the sum of expression values of genes in one pathway as one feature.
I saw some paper used the raw counts (but microarray data) and some paper used normalized RNA-Seq counts. I feel a bit confused, what kind of counts should be added up here? raw counts? TPM or CPM or log-CPM? need to do normalization like TMM? or depends on which algorithm?
Please note that selecting your genes using DE analysis and performing machine learning is a common mistake and widely refuted approach in ML community. Please use all your genes in your classifier without selection. For more detail, see: (Ambroise et al. 2002) https://doi.org/10.1073/pnas.102102699
Well I can't get your conclusion from this paper, it just says it's important to correct the selection bias and the testing data should be independent from the gene selection procedure. The problem proposed by them is due to the fact that sample sizes of datasets are often not big enough back to early 2000s, I think.
Actually the performance of a ML model using all genes is usually much worse than that using gene selection in my experience.