Question: Take any CD_NN (cluster of differentiation) gene , are there some pathways (or just sets of genes) which RNA transcripts might be good predictors for protein levels of that chosen CD gene ? Or: for what CD_NN genes one knows some genes which are kind of related to it/influence its protein level ?
In the other words - assume you need to predict protein levels of CD_NN proteins via the whole transcriptome (Kaggle competition) - but you want to restrict the whole transcriptome to some smaller set of say 100 genes, such that these 100 genes would have some biological meaningful relation with the chose CD. So the question do we have some biologically motivated choices for such "100" genes at least for some CD_NN genes ?
Some kind of exercises of that kind be found in notebook: https://www.kaggle.com/code/alexandervc/mmscel-bio-motivated-feature-selection-citeseq You are welcome to add any idea !
PS
Of course, one can go in opposite direction - first choose good predictors by the data science methods, and then look for their biological interpretation. But that would 1) have bias to one particular dataset 2) interpretation might be questionable.