Hi all, I am not an expert in machine learning (ML) and have a few specific questions regarding the design of a binary classifier. I have bulk RNA-seq data for the samples from 6 different cancer types. These samples belong to either class A or B. So, for each cancer type I have 10 samples from class A and 10 samples for class B. I have, therefore, total 120 samples (20 samples for each of the six cancer types; these 20 samples are evenly split between classes A and B).
I would like to create a classifier that can classify samples into either class A or B. I can divide 120 samples randomly into training and test set and follow a regular ML workflow on scikit learn and try different models (Logistic regression, SVM, and so on). One issue with that is how to do feature selection. I could do differential expression (DE) analysis using DEseq2 and get the set of DE genes between classes A and B for each cancer type and then use the common DE genes across the 6 cancer types as input features for the binary classifier. But that would lead to leakage between the training and test sets as features should come from training set and not the test set. If I mix the 120 samples randomly into training and test set, the test will have samples that were used to define the input features (common DE genes across the 6 cancer types).
I could use samples from any of the 4 cancer types as training and remaining samples from the 2 cancer types as test. Then, I can use the common DE genes across the 4 cancer types as input features for the training step and later use the trained model to check the prediction on the samples in the test set. But how can I make it unbiased? Which 4 cancer types to use for training? Is there a better way to design this classifier? Or better ways to select the features?
So sorry for the long description. Thanks in advance for any suggestions or comments. I would really appreciate any help.