Question: Random Forest returns "New factor levels not present in the training data"
2
gravatar for arronar
9 months ago by
arronar150
Austria
arronar150 wrote:

Hi.

I'm trying to run a random forest on some microarrays data using the following code, but I'm getting the titled error back. As you will see below, I tried to surpass this issue by following the link at stack overflow commented in the code, but without any success.

acc = numeric()

for(i in 1:20){

  # Random Sampling with 70-30% for training and validation respectively
  y = z = 0
  while(y != 9 || z != 9){
    sample = sample(x = 1:nrow(data) , size = 0.7 * nrow(data) )

    train = data[sample,]
    test = data[-sample,]

    y = length(unique(train$classes))
    z = length(unique(test$classes))

  }

  print(paste(y , z))

  # https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html
  # https://stackoverflow.com/questions/17059432/random-forest-package-in-r-shows-error-during-prediction-if-there-are-new-fact
  test$classes <- as.character(test$classes)
  train$classes <- as.character(train$classes)
  test$isTest <- rep(1,nrow(test))
  train$isTest <- rep(0,nrow(train))

  fullSet <- rbind(test,train)
  fullSet$classes <- as.factor(fullSet$classes)

  test.new <- fullSet[fullSet$isTest==1,]
  train.new <- fullSet[fullSet$isTest==0,]

  test.new$isTest = NULL
  train.new$isTest = NULL

  print(levels(test.new$classes))
  print(levels(train.new$classes))

  # Calculating the model with
  # mtry : number of variables randomly sampled as candidates at eash split
  # ntee : number of trees to grow
  rf = randomForest(classes~., data=as.matrix(train.new), mtry=5, ntree=2000, importance=TRUE)

  p = predict(rf, test.new)

  acc = mean(test.new$classes == p)
  print(acc)
  # Keep track and save the models that have high accuracy
  if(acc > 0.65){
    print(acc)
    saveRDS(rf , paste("./rf_models/rf_", i, "_", acc, ".rds", sep=""))
  }
}

The error I'm getting is :

Error in predict.randomForest(rf, test.new) : New factor levels not present in the training data Calls: predict -> predict.randomForest Execution halted

And thus I added the

print(levels(test.new$classes))
print(levels(train.new$classes))

in order to see if the levels of the training and testing set were different.

The results of these lines returned me :

[1] "Treat1" "Treat2" "Treat3" "Treat4" "Treat5" "Treat6" "Treat7" "Tg" "Wt"

[1] "Treat1" "Treat2" "Treat3" "Treat4" "Treat5" "Treat6" "Treat7" "Tg" "Wt"

Is something that I'm doing wrong? How can I approach such an issue?

microarrays machine learning R • 1.3k views
ADD COMMENTlink written 9 months ago by arronar150

I get this error for ForestDNM on new versions of GATK haplotype caller VCFs. If someone figures this out I would love to know the answer!

ADD REPLYlink written 9 months ago by QVINTVS_FABIVS_MAXIMVS2.2k

One thing that I see is that you coerce your training data into a matrix with as.matrix(train.new) in the randomForest() function. Using this may have unexpected consequences, one being that factors in train.new will be converted into 1, 2, 3, 4, etc., based on how they are ordered. Thus, they will differ already from the testing data.

Another thing: when you split a data-frame that has categorical variables / factors, it's good practice to relevel those factors in the new objects, with, in your case:

test$classes <- factor(as.character(test$classes))
train$classes <- factor(as.character(train$classes))
ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe33k

I just realized that all predictors columns are factors. Could be this the cause of the problem? Should I convert them into numeric?

ADD REPLYlink written 9 months ago by arronar150

Yes, that will also create an issue - they should be numeric and the best way to avoid a situation like that is to go back through each step in order to determine where the numbers are being converted into factors.

Another problem, I believe, is with this piece of code:

randomForest(classes~., data=as.matrix(train.new), ...)

This will mean that classes (encoded as integers) is going to be included as both a predictor and the outcome. Your data should be the original data without the outcome variable, something like:

data=train.new[,-which(colnames(train.new) %in% "classes")]

I do something similar here with lasso (see the step 'Perform 10-fold cross validation'): A: How to exclude some of breast cancer subtypes just by looking at gene expressio

ADD REPLYlink written 9 months ago by Kevin Blighe33k

I removed the as.matrix() and also converted the factors to numeric with the following code:

# Convert dataframe into numeric
data[,-20040] <- sapply(data[,-20040], as.character)
data[,-20040] <- sapply(data[,-20040], as.numeric)

They should become factors while I was reading that file

data <- read.table("genes.tsv",sep = "\t",header = TRUE, stringsAsFactors = FALSE)

As for the classes column you said, it is not encoded as integers but as factors (this is the way the randomForest want it to be) and there is no need to be excluded from the training data itself. Once again, as I remember randomForest() can handle this.

Anyway. It seems that by removing the as.matrix() and converting gene expressions from factors to numeric, is now working.

ADD REPLYlink written 9 months ago by arronar150

Great that it is now resolved. On the conversion from factors to numerical values, please just double check that it has done this as you expected. This is R 'Programming', it's messy, and therefore things turn unexpected frequently!

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe33k
1

Yeah. I noticed that. Everything seems to be right.

ADD REPLYlink written 9 months ago by arronar150

QVINTVS_FABIVS_MAXIMVS, if your problem is different, then please post a new question.

ADD REPLYlink written 9 months ago by Kevin Blighe33k

I figured it out. The VCFs I was working on had different Tranche levels. So it was a factor that was not trained on

ADD REPLYlink written 8 months ago by QVINTVS_FABIVS_MAXIMVS2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2240 users visited in the last hour