I have gene expression data from 2 cohorts of Case and Control, The number of control is much more than Case (4 times more) I would like to run Random forest to select genes (features) that can strongly classify case vs control.
My plan is that, due to the abundance of control samples, I intend to run n times random sampling of Control cohort (Case cohort is kept the same), and obtain n lists of feature importance. The sum rank of those features can be used as a conclusive result.
Is this approach feasible and is there any previously published study that did the same? I am very new to machine learning, so detail explanation or suggestions are greatly welcomed.
Thank you very much.