Question: Is sampling for training and testing, necessary in random forest algorithms
0
gravatar for arronar
2.4 years ago by
arronar240
Austria
arronar240 wrote:

Hi.

I'm trying to apply, random forest algorithm onto microarray results in order to get a list with the most significant predictors. So far I was executing the following pipeline.

buildForest <- function (RF.data, storeDir)
{
  acc = numeric()

  for(i in 1:50){

    # Random Sampling with 70-30% for training and validation respectively 
    y = z = 0
    while(y != 9 || z != 9){
      sample = sample(x = 1:nrow(RF.data) , size = 0.7 * nrow(RF.data) )

      train = RF.data[sample,]
      test = RF.data[-sample,]

      y = length(unique(train$classes))
      z = length(unique(test$classes))

    }

    # Calculating the model with
    # mtry : number of variables randomly sampled as candidates at eash split
    # ntee : number of trees to grow
    rf = randomForest(classes~., data=train, mtry=7, ntree=2000, importance=TRUE)

    p = predict(rf, test)

    acc = mean(test$classes == p)
    print(acc)
    # Keep track and save the models that have high accuracy
    if(acc > 0.65){
      saveRDS(rf , paste(storeDir,"/rf_", i, "_", acc, ".rds", sep=""))
    }
  }
}

As you can see, I was resampling 50 times (70% training, 30% testing) and kept only the models that had accuracy over 65%.

And then I read this article that doesn't use sampling at all. Its author fed all his data into the algorithm.

So is my method wrong and not necessary, or it is ok?

Thanks.

ADD COMMENTlink modified 2.4 years ago by Jean-Karim Heriche23k • written 2.4 years ago by arronar240
0
gravatar for Jean-Karim Heriche
2.4 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

Why do you keep multiple models ? What you're doing is a form of cross-validation and usually, one only keeps the best model. The need for cross-validation with random forests is debatable. The algorithm already samples the data and randomizes the variables so it is in principle robust to overfitting. You seem to be using the R package randomForest so did you read the doc and the linked algorithm documentation (in particular the section on test set error rate) ? Or check the random forests page by the authors of the algorithm, in particular the section "How random forests work".

ADD COMMENTlink written 2.4 years ago by Jean-Karim Heriche23k

I'm keeping the ones with accuracy >65% and then the best one of them. I'm doing CV 50 times because I realized that each time, models have different accuracy. E.g the first could have 34% accuracy while the second 72%. If I don't use CV will the algorithm keep the best one?

ADD REPLYlink written 2.4 years ago by arronar240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1497 users visited in the last hour