Question: What kind of unit of RNA-seq should be used in machine learning models?
0
gravatar for magicpants
9 months ago by
magicpants60
magicpants60 wrote:

I want to train a binary classifier using cancer RNA-seq data. After differential expression analysis, I selected some top DEGs as features for glmnet training, however the performances were quite bad, I think this may be due to tumor heterogeneity.

Therefore I plan to do feature engineering, like using the sum of expression values of genes in one pathway as one feature.

I saw some paper used the raw counts (but microarray data) and some paper used normalized RNA-Seq counts. I feel a bit confused, what kind of counts should be added up here? raw counts? TPM or CPM or log-CPM? need to do normalization like TMM? or depends on which algorithm?

ADD COMMENTlink modified 5 weeks ago • written 9 months ago by magicpants60

Please note that selecting your genes using DE analysis and performing machine learning is a common mistake and widely refuted approach in ML community. Please use all your genes in your classifier without selection. For more detail, see: (Ambroise et al. ‎2002) https://doi.org/10.1073/pnas.102102699

ADD REPLYlink written 3 months ago by amin8w0

Well I can't get your conclusion from this paper, it just says it's important to correct the selection bias and the testing data should be independent from the gene selection procedure. The problem proposed by them is due to the fact that sample sizes of datasets are often not big enough back to early 2000s, I think.

Actually the performance of a ML model using all genes is usually much worse than that using gene selection in my experience.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by magicpants60
1
gravatar for magicpants
5 weeks ago by
magicpants60
magicpants60 wrote:

I also found this post: Difference between CPM and TPM and which one for downstream analysis?.

The answers say TPM is better than CPM since TPM accounts for transcript length differences including more information, while the better one is values after normalization and transformation using vst or rlog.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by magicpants60

In a library prep that is biased towards one end of the transcript, transcript length won't affect the number of read counts you get, so correcting for it would be wrong.

ADD REPLYlink written 5 weeks ago by swbarnes26.1k
1
gravatar for magicpants
9 months ago by
magicpants60
magicpants60 wrote:

I found the posts:

FPKM or rlog from DEseq2 for machine learning analysis

Log expression for machine learning input

The cross-sample normalized log counts were recommended for downstream machine learning techniques, however I need to add the counts of sets of genes as features, maybe I can add the tpm values within one sample then do the between sample normalization rather than just sum the log values?

ADD COMMENTlink modified 9 months ago • written 9 months ago by magicpants60

Agreed - something which is normalized for feature length, sequencing depth and log transformed (to be approximately normal distributed). Furthermore a lot of ML methods require scaled (both center and variance) data.

Please also note at most ML methods cannot handle highly correlated data (which RNASeq data is) so you either need to use a method that can handle that, remove them prior to analysis or transform the data.

ADD REPLYlink modified 9 months ago • written 9 months ago by kristoffer.vittingseerup2.2k

Thanks for the reminder, but what should be removed priorly for highly correlated data?

ADD REPLYlink written 9 months ago by magicpants60

For highly correlated genes you should only keep 1 of them - but I would suggest transformation instead.

ADD REPLYlink written 9 months ago by kristoffer.vittingseerup2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1337 users visited in the last hour