Question

RNA-SEQ count data as input for Machine Learning

0

Entering edit mode

4.2 years ago

Alberto • 0

Hi! This may be a silly question but I have been unable to find an answer that works for me. I am working with TCGA data for prostate cancer. I downloaded level 3 data which consists in raw counts for the expression of 549 samples. 52 of them where paired (healthy/unhealthy tissue for the same patient). Using edgeR I was able to construct a list of differentially expressed genes between healthy/unhealthy tissue, hopefully leaving out the "specific sample bias".

Using edgeR I calculated the log2(counts) for the expression of the top 1000 differentially expressed genes and applied them to the 549 samples (70% for learning/30% for test). I got pretty good results (around 90% accuracy in the testing set). I used kmeans, knn and random forest approaches.

I wanted to see if these results where applicable to other datasets, so I downloaded expression data at ICGC (200 cancer samples) and GTEX (around another 200 healthy samples). To my surprise both healthy and unhealthy samples are classified as healthy!!

Does anybody have any clue about where my problem might be? I guess I am not normalizing the counts properly and that's why my ML methods don't work in different databases. Thanks everybody in advance!

machine learning rnaseq normalize • 1.7k views

ADD COMMENT • link updated 4.2 years ago by Dunois ★ 2.9k • written 4.2 years ago by Alberto • 0

0

Entering edit mode

have you tried to look at the ML model too? What processing did you do for ICGC and GTEx?

ADD REPLY • link 4.2 years ago by davidenoma ▴ 50

score 4 · Answer 1 · 2021-04-08

4

Entering edit mode

4.2 years ago

ponganta ▴ 590

Did you conduct feature scaling prior to ML? Its basically a must, because else expression strength will be decisive for your classification. Just try z-score normalisation. In R, this could simply be done using base::scale(). Also, I'm not sure whether log2-transformation will be enough. You could try using DESeq2::rlog() for this, it performs a regularised log transform (a modified log2 transformation). Good luck!