**240**wrote:

Hi.

I'm trying to apply, random forest algorithm onto microarray results in order to get a list with the most significant predictors. So far I was executing the following pipeline.

```
buildForest <- function (RF.data, storeDir)
{
acc = numeric()
for(i in 1:50){
# Random Sampling with 70-30% for training and validation respectively
y = z = 0
while(y != 9 || z != 9){
sample = sample(x = 1:nrow(RF.data) , size = 0.7 * nrow(RF.data) )
train = RF.data[sample,]
test = RF.data[-sample,]
y = length(unique(train$classes))
z = length(unique(test$classes))
}
# Calculating the model with
# mtry : number of variables randomly sampled as candidates at eash split
# ntee : number of trees to grow
rf = randomForest(classes~., data=train, mtry=7, ntree=2000, importance=TRUE)
p = predict(rf, test)
acc = mean(test$classes == p)
print(acc)
# Keep track and save the models that have high accuracy
if(acc > 0.65){
saveRDS(rf , paste(storeDir,"/rf_", i, "_", acc, ".rds", sep=""))
}
}
}
```

As you can see, I was resampling 50 times (70% training, 30% testing) and kept only the models that had accuracy over 65%.

And then I read this article that doesn't use sampling at all. Its author fed all his data into the algorithm.

So is my method wrong and not necessary, or it is ok?

Thanks.

**23k**• written 2.4 years ago by arronar •

**240**