Traditional machine learning workflows on gene expression data apply several filtering approaches and feature selection and apply a classifier (RF, SVM or an ensemble) on the selected features. They deal with each dataset from scratch.
I am trying -not to combine different datasets to avoid batch effect- but rather to develop several models (collect subsets of genes that predict well together in several models) and keep training on several datasets until I end up with a set of models (each one containing a different subset of genes, different parameter and probably a different model) to be used in any classification problem on gene expression data. Is this approach new ? Do you have any suggested ideas to try ? This is not typical transfer learning but any help on how to inherit information from one gene expression dataset to another is really helpful .
My question is not about integration of different genomic data types from the same person. It's about developing a model on a dataset and -to overcome the small sample sizes- we retrain the model (by retrain here, I mean I take the model name (e.g. RF) and the set of genes (names of the ten genes) that worked together and use them) on another classification problem from another dataset, and we keep doing this until we have a set of mature heuristics. All the datasets I am refering to are gene expression data (e.g. from TCGA). For example, my algorithm is like that :
1- I find top 100 important genes using RF on a breast cancer dataset. and I run several RF classifiers, each with only ten genes.
2- I repeat the same step on another dataset e.g. colon cancer dataset and on several other datasets
3- I take the best five classifiers (here I mean I take the model name and genes used as heuristics not the model itself) from each dataset and run all these classifiers on a new dataset. and keep iterating and improving.
The expected outcome should be : a specific way to use gene expression data for classification. Instead of filtering, we will say e.g.: take genes 40, 671, 899 and apply RF on them and take genes 55, 1000, 242 and apply a logistic regression on them and take genes 44, ...,555 and apply blablabla algorithm on them. And on a new classification problem, run cross validation on these models to find the accurate models.
Since these models (heuristics) are based on information from several previous training, they should overcome models based only on the dataset at hand especially if this dataset is very small.
Is this approach valid ? Is it popular in bioinformatics under another name ?
My wording isn't good, please ask for clarification if any point is unclear.