Question

Classification using publicly available RNA-seq dataset

1

Entering edit mode

5.2 years ago

bioinfraML ▴ 10

Does it make sense to use publicly available RNA- Seq data from GEO to train a machine learning model to classify subjects into cases and controls and use the model to predict cases and controls in a completely different dataset? For instance let's say I have a coronary artery disease RNA- Seq data that I want to classify NASH/ noNASH can I use GEO NASH related dataset to train a Random Forest classifier and test it on my coronary artery disease RNA- Seq data?

RNA-Seq machine learning classification • 1.7k views

ADD COMMENT • link updated 5.2 years ago by i.sudbery 19k • written 5.2 years ago by bioinfraML ▴ 10

score 1 · Answer 1 · 2019-01-21

1

Entering edit mode

5.2 years ago

i.sudbery 19k

We have successfully used RNA-seq data from one large consortium to train a classifier, which we then use to classify samples from another consortium. This worked pretty well - where we have a good idea of which class a sample should fall in, it generally does, and where it doesn't fall where we expected it to, we've leveraged that to identify novel biology.

Word of warning though - we had to reprocess the data from one of the sources to match the precise processing pipeline for data from the other source. For a large dataset this is not a trivial undertaking.

ADD COMMENT • link 5.2 years ago by i.sudbery 19k

0

Entering edit mode

Instead of reprocessing, why not use one of the reprocessed data sources such as recount (others also exist) https://jhubiostatistics.shinyapps.io/recount/