I want to train a binary classifier using cancer RNA-seq data. After differential expression analysis, I selected some top DEGs as features for glmnet training, however the performances were quite bad, I think this may be due to tumor heterogeneity.
Therefore I plan to do feature engineering, like using the sum of expression values of genes in one pathway as one feature.
I saw some paper used the raw counts (but microarray data) and some paper used normalized RNA-Seq counts. I feel a bit confused, what kind of counts should be added up here? raw counts? TPM or CPM or log-CPM? need to do normalization like TMM? or depends on which algorithm?