I'm using python scikit-learn package so any demonstration using scikit learn function will be really appreciated :)
Now I have several types of biomedical data: Clinical data, DNA methylation data, miRNA and RNA expression data. Each data type contains roughly 300 patient samples and 50-ish normal (control) samples. I want to use several machine learning algorithm to input these data together, and training a model so that it can predict the survival of a patient based on data given. Now I have some important questions:
1. Since the size of the samples are very different, how can I group these data and feed the algorithm? For instance, if do clustering, how can I align them?
2. There are many probs for methylation, miRNA and RNA, over a thousand for each. Is there a way to filter out the important features(probs) and only train the model based on these data? Or even better, after training the model using all the data, can the model tell me which features are important among large amounts of features? Is scikit-learn preprocessing method enough to do this step?
3. Is there a way to combine several algorithms together? For instance, using clustering to classify all features, and then input the results in random forest/PCA algorithms together get the model?
I haven't learned machine learning systematically, so I got really confused when trying to use them. I think I should use unsupervised algorithms. Is that correct?