Does it make sense to use publicly available RNA- Seq data from GEO to train a machine learning model to classify subjects into cases and controls and use the model to predict cases and controls in a completely different dataset? For instance let's say I have a coronary artery disease RNA- Seq data that I want to classify NASH/ noNASH can I use GEO NASH related dataset to train a Random Forest classifier and test it on my coronary artery disease RNA- Seq data?
We have successfully used RNA-seq data from one large consortium to train a classifier, which we then use to classify samples from another consortium. This worked pretty well - where we have a good idea of which class a sample should fall in, it generally does, and where it doesn't fall where we expected it to, we've leveraged that to identify novel biology.
Word of warning though - we had to reprocess the data from one of the sources to match the precise processing pipeline for data from the other source. For a large dataset this is not a trivial undertaking.