Question: Could you suggest me a proper feature selection method for mixed type variable data?
gravatar for morovatunc
11 months ago by
morovatunc360 wrote:


We are working on cancer mutation data and we found that upon TF binding, there has been an enrichment of mutation occurrence happened on these TF binding regions. Since we couldnt a strong reason for this occurrence we wanted to use feature selection methods.

We thought that say;

Y ( mutation no on these regions) ~ DHSites + Histone Marks + Other TF binding events (such as CTCT, EP300 etc) + RNAseq Reads

So in this matrix, we will have a single row for each binding event of our TF and all the variables will be either categorical(such as DHSsite) or numerical ( such RNAseq reads).

I have seen that people have applied random forest algorithms to predict mutation occurrence in specific regions. But our aim is not to predict anything but simple ask " What is the cause of the mutation occurrence". Therefore, I want to separate my data in to two subsets ( train vs test).

Please forgive my ignorance in the terminology and consider me as a frustrated grad student.

Best regards,


machine learning • 360 views
ADD COMMENTlink modified 10 months ago by cfay0 • written 11 months ago by morovatunc360

There are many ways in which you can do it. Random forest are a good choice, after training you can look at the "variable importance" which will rank the variables of your model according to their contribution to the prediction. You can check the section of variable importance section of the Caret package.

Another choice is using the lasso regression, which try to set to zero the non-important variables. Just maybe one thing to consider is the normalization of your variables if they have different scales so you can get normalized factors. There are some good tutorials on lasso, for examples here and here

Hope it helps.

ADD REPLYlink modified 11 months ago • written 11 months ago by Sirus770

@Sirus thank you very much for your comment. The part where dividing data two train and test seems to confuse me a lot. Can only train my data ? and not do any prediction ?? Like a said in the question, i dont wan to predict anything. Is this possible with random forest?

ADD REPLYlink written 11 months ago by morovatunc360
gravatar for Sirus
10 months ago by
Sirus770 wrote:

@morovatunc , to avoid over-fitting, you can use all your data but by doing for example 10-fold cross-validation (the Caret package can do that for you). Then you'll get your variable importance. Because theoretically, the signal that you'll find important is supposed to be important in any subset of the genome. A 10-fold CV will help eliminate some of the noise.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Sirus770
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 792 users visited in the last hour