Which RNA-seq unit should be used in machine learning models?
2
2
Entering edit mode
4.2 years ago
darklings ▴ 480

I want to train a binary classifier using cancer RNA-seq data. After differential expression analysis, I selected some top DEGs as features for glmnet training, however the performances were quite bad, I think this may be due to tumor heterogeneity.

Therefore I plan to do feature engineering, like using the sum of expression values of genes in one pathway as one feature.

I saw some paper used the raw counts (but microarray data) and some paper used normalized RNA-Seq counts. I feel a bit confused, what kind of counts should be added up here? raw counts? TPM or CPM or log-CPM? need to do normalization like TMM? or depends on which algorithm?

RNA-Seq machine learning counts • 3.5k views
0
Entering edit mode

Please note that selecting your genes using DE analysis and performing machine learning is a common mistake and widely refuted approach in ML community. Please use all your genes in your classifier without selection. For more detail, see: (Ambroise et al. ‎2002) https://doi.org/10.1073/pnas.102102699

0
Entering edit mode

Well I can't get your conclusion from this paper, it just says it's important to correct the selection bias and the testing data should be independent from the gene selection procedure. The problem proposed by them is due to the fact that sample sizes of datasets are often not big enough back to early 2000s, I think.

Actually the performance of a ML model using all genes is usually much worse than that using gene selection in my experience.

4
Entering edit mode
3.6 years ago
darklings ▴ 480

I also found this post: Difference between CPM and TPM and which one for downstream analysis?.

The answers say TPM is better than CPM since TPM accounts for transcript length differences including more information, while the better one is values after normalization and transformation using vst or rlog.

0
Entering edit mode

In a library prep that is biased towards one end of the transcript, transcript length won't affect the number of read counts you get, so correcting for it would be wrong.

1
Entering edit mode
4.2 years ago
darklings ▴ 480

I found the posts:

FPKM or rlog from DEseq2 for machine learning analysis

Log expression for machine learning input

The cross-sample normalized log counts were recommended for downstream machine learning techniques, however I need to add the counts of sets of genes as features, maybe I can add the tpm values within one sample then do the between sample normalization rather than just sum the log values?

1
Entering edit mode

Agreed - something which is normalized for feature length, sequencing depth and log transformed (to be approximately normal distributed). Furthermore a lot of ML methods require scaled (both center and variance) data.

Please also note at most ML methods cannot handle highly correlated data (which RNASeq data is) so you either need to use a method that can handle that, remove them prior to analysis or transform the data.

0
Entering edit mode

Thanks for the reminder, but what should be removed priorly for highly correlated data?

0
Entering edit mode

For highly correlated genes you should only keep 1 of them - but I would suggest transformation instead.