Question: What kind of read counts of RNA-seq should be used in machine learning models?
0
gravatar for magicpants
5 months ago by
magicpants60
magicpants60 wrote:

I want to train a binary classifier using cancer RNA-seq data. After differential expression analysis, I selected some top DEGs as features for glmnet training, however the performances were quite bad, I think this may be due to tumor heterogeneity.

Therefore I plan to do feature engineering, like using the sum of expression values of genes in one pathway as one feature.

I saw some paper used the raw counts (but microarray data) and some paper used normalized RNA-Seq counts. I feel a bit confused, what kind of counts should be added up here? raw counts? TPM or CPM or log-CPM? need to do normalization like TMM? or depends on which algorithm?

ADD COMMENTlink modified 5 months ago • written 5 months ago by magicpants60
1
gravatar for magicpants
5 months ago by
magicpants60
magicpants60 wrote:

I found the posts:

FPKM or rlog from DEseq2 for machine learning analysis

Log expression for machine learning input

The cross-sample normalized log counts were recommended for downstream machine learning techniques, however I need to add the counts of sets of genes as features, maybe I can add the tpm values within one sample then do the between sample normalization rather than just sum the log values?

ADD COMMENTlink modified 5 months ago • written 5 months ago by magicpants60

Agreed - something which is normalized for feature length, sequencing depth and log transformed (to be approximately normal distributed). Furthermore a lot of ML methods require scaled (both center and variance) data.

Please also note at most ML methods cannot handle highly correlated data (which RNASeq data is) so you either need to use a method that can handle that, remove them prior to analysis or transform the data.

ADD REPLYlink modified 5 months ago • written 5 months ago by kristoffer.vittingseerup1.7k

Thanks for the reminder, but what should be removed priorly for highly correlated data?

ADD REPLYlink written 5 months ago by magicpants60

For highly correlated genes you should only keep 1 of them - but I would suggest transformation instead.

ADD REPLYlink written 5 months ago by kristoffer.vittingseerup1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1515 users visited in the last hour