Question: feature selection using random forest
0
gravatar for newbie
4 months ago by
newbie70
newbie70 wrote:

Hi,

Need small help. I have some hundred samples and I have already classified them into four different classes (clusters). Now, I'm interested in identifying the best set of genes that classify the samples into different classes. Both up and down genes in each class.

For this I have already used t-test. But I'm interested in applying random forest for selecting features. My data looks like below. Just posting some example data here.

enter image description here

Can anyone please tell me how I can use the above data and apply random forest to know which genes classify the samples into different classes. thanq.

ADD COMMENTlink modified 4 months ago by dsull1.4k • written 4 months ago by newbie70

which type of data : RNA-Seq, microarray ?

ADD REPLYlink written 4 months ago by Nicolas Rosewick8.8k

It is RNA-seq data with 100 samples

ADD REPLYlink written 4 months ago by newbie70
0
gravatar for dsull
4 months ago by
dsull1.4k
UCLA
dsull1.4k wrote:

You have four classes. Why are you using a t-test? You should be using ANOVA.

Second, as random forest can tell you feature importances, you can use randomforest with recursive feature elimination (Look up: Recursive feature elimination with cross validation) to figure out a set of features with the best predictive value.

ADD COMMENTlink written 4 months ago by dsull1.4k

Thank you. If you Donn't mind could you please give me an example how to do this. I'm very new to this type of analysis. With above data please give me an examples. thanks again.

ADD REPLYlink written 4 months ago by newbie70

Here's an example:

https://topepo.github.io/caret/recursive-feature-elimination.html

If you're new, unfortunately, it's going to take some effort for you to read tutorials and write code. Using advanced supervised machine learning methods properly is not trivial (e.g. you'll need to understand hyperparameter tuning, metrics to measure model performance, cross-validation, multilabel classification, etc.). Also 100 samples is quite small so I wonder why you want to use random forests in the first place as opposed to selecting features using simpler generalized linear models (e.g. DESeq2).

ADD REPLYlink written 4 months ago by dsull1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1195 users visited in the last hour