I have at my disposal some scRNAseq samples containing cancer cells from a same patient at two stages (Diagnosis and Relapse). The idea would be to be able to determine a gene signature that would allow us to say that the patient is in relapse or not.
I have already selected genes that are supposed to be at the heart of the pathways involved in the two states but now I need to filter them to keep only a minimum number of genes that would allow us to establish a relapse signature.
However, I had thought of trying to build a machine learning model to see which genes would have the most weight in the model but I'm not sure of this way of proceeding knowing that the cells of each scRNAseq sample all come from the same patient who was sampled in both states (so i my mind RandomForest is not possible for this reason). Moreover, the genes that I have previously identified come from correlation networks of their expression so models that do not accept multicollinearity are not possible (so Logistic regression seems to not be valid too).
Would you have by any chance some ideas of methods to recover a signature of genes allowing to characterize the two states possibly with a score as we could have it with Machine Learning methods?