Question: Classification using publicly available RNA-seq dataset
1
gravatar for bioinfraML
4 weeks ago by
bioinfraML10
bioinfraML10 wrote:

Does it make sense to use publicly available RNA- Seq data from GEO to train a machine learning model to classify subjects into cases and controls and use the model to predict cases and controls in a completely different dataset? For instance let's say I have a coronary artery disease RNA- Seq data that I want to classify NASH/ noNASH can I use GEO NASH related dataset to train a Random Forest classifier and test it on my coronary artery disease RNA- Seq data?

ADD COMMENTlink modified 27 days ago by i.sudbery3.7k • written 4 weeks ago by bioinfraML10
1
gravatar for i.sudbery
27 days ago by
i.sudbery3.7k
Sheffield, UK
i.sudbery3.7k wrote:

We have successfully used RNA-seq data from one large consortium to train a classifier, which we then use to classify samples from another consortium. This worked pretty well - where we have a good idea of which class a sample should fall in, it generally does, and where it doesn't fall where we expected it to, we've leveraged that to identify novel biology.

Word of warning though - we had to reprocess the data from one of the sources to match the precise processing pipeline for data from the other source. For a large dataset this is not a trivial undertaking.

ADD COMMENTlink written 27 days ago by i.sudbery3.7k

Instead of reprocessing, why not use one of the reprocessed data sources such as recount (others also exist) https://jhubiostatistics.shinyapps.io/recount/

ADD REPLYlink written 27 days ago by colindaven1.0k

These generally only exist for things in GEO or public SRA of course.

ADD REPLYlink written 27 days ago by i.sudbery3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 964 users visited in the last hour