Hi,
We are working on cancer mutation data and we found that upon TF binding, there has been an enrichment of mutation occurrence happened on these TF binding regions. Since we couldnt a strong reason for this occurrence we wanted to use feature selection methods.
We thought that say;
Y ( mutation no on these regions) ~ DHSites + Histone Marks + Other TF binding events (such as CTCT, EP300 etc) + RNAseq Reads
So in this matrix, we will have a single row for each binding event of our TF and all the variables will be either categorical(such as DHSsite) or numerical ( such RNAseq reads).
I have seen that people have applied random forest algorithms to predict mutation occurrence in specific regions. But our aim is not to predict anything but simple ask " What is the cause of the mutation occurrence". Therefore, I want to separate my data in to two subsets ( train vs test).
Please forgive my ignorance in the terminology and consider me as a frustrated grad student.
Best regards,
Tunc.
There are many ways in which you can do it. Random forest are a good choice, after training you can look at the "variable importance" which will rank the variables of your model according to their contribution to the prediction. You can check the section of variable importance section of the Caret package.
Another choice is using the lasso regression, which try to set to zero the non-important variables. Just maybe one thing to consider is the normalization of your variables if they have different scales so you can get normalized factors. There are some good tutorials on lasso, for examples here and here
Hope it helps.
@Sirus thank you very much for your comment. The part where dividing data two train and test seems to confuse me a lot. Can only train my data ? and not do any prediction ?? Like a said in the question, i dont wan to predict anything. Is this possible with random forest?