Question: Could you suggest me a proper feature selection method for mixed type variable data?
0
gravatar for morovatunc
14 months ago by
morovatunc360
Turkey
morovatunc360 wrote:

Hi,

We are working on cancer mutation data and we found that upon TF binding, there has been an enrichment of mutation occurrence happened on these TF binding regions. Since we couldnt a strong reason for this occurrence we wanted to use feature selection methods.

We thought that say;

Y ( mutation no on these regions) ~ DHSites + Histone Marks + Other TF binding events (such as CTCT, EP300 etc) + RNAseq Reads

So in this matrix, we will have a single row for each binding event of our TF and all the variables will be either categorical(such as DHSsite) or numerical ( such RNAseq reads).

I have seen that people have applied random forest algorithms to predict mutation occurrence in specific regions. But our aim is not to predict anything but simple ask " What is the cause of the mutation occurrence". Therefore, I want to separate my data in to two subsets ( train vs test).

Please forgive my ignorance in the terminology and consider me as a frustrated grad student.

Best regards,

Tunc.

machine learning • 431 views
ADD COMMENTlink modified 14 months ago by cfay0 • written 14 months ago by morovatunc360
1

There are many ways in which you can do it. Random forest are a good choice, after training you can look at the "variable importance" which will rank the variables of your model according to their contribution to the prediction. You can check the section of variable importance section of the Caret package.

Another choice is using the lasso regression, which try to set to zero the non-important variables. Just maybe one thing to consider is the normalization of your variables if they have different scales so you can get normalized factors. There are some good tutorials on lasso, for examples here and here

Hope it helps.

ADD REPLYlink modified 14 months ago • written 14 months ago by Sirus770

@Sirus thank you very much for your comment. The part where dividing data two train and test seems to confuse me a lot. Can only train my data ? and not do any prediction ?? Like a said in the question, i dont wan to predict anything. Is this possible with random forest?

ADD REPLYlink written 14 months ago by morovatunc360
2
gravatar for Sirus
14 months ago by
Sirus770
Boston/USA
Sirus770 wrote:

@morovatunc , to avoid over-fitting, you can use all your data but by doing for example 10-fold cross-validation (the Caret package can do that for you). Then you'll get your variable importance. Because theoretically, the signal that you'll find important is supposed to be important in any subset of the genome. A 10-fold CV will help eliminate some of the noise.

ADD COMMENTlink modified 14 months ago • written 14 months ago by Sirus770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1054 users visited in the last hour