Random Forest returns "New factor levels not present in the training data"
0
2
Entering edit mode
4.8 years ago
arronar ▴ 270

Hi.

I'm trying to run a random forest on some microarrays data using the following code, but I'm getting the titled error back. As you will see below, I tried to surpass this issue by following the link at stack overflow commented in the code, but without any success.

acc = numeric()

for(i in 1:20){

# Random Sampling with 70-30% for training and validation respectively
y = z = 0
while(y != 9 || z != 9){
sample = sample(x = 1:nrow(data) , size = 0.7 * nrow(data) )

train = data[sample,]
test = data[-sample,]

y = length(unique(train$classes)) z = length(unique(test$classes))

}

print(paste(y , z))

# https://stat.ethz.ch/pipermail/r-help/2008-March/156608.html
# https://stackoverflow.com/questions/17059432/random-forest-package-in-r-shows-error-during-prediction-if-there-are-new-fact
test$classes <- as.character(test$classes)
train$classes <- as.character(train$classes)
test$isTest <- rep(1,nrow(test)) train$isTest <- rep(0,nrow(train))

fullSet <- rbind(test,train)
fullSet$classes <- as.factor(fullSet$classes)

test.new <- fullSet[fullSet$isTest==1,] train.new <- fullSet[fullSet$isTest==0,]

test.new$isTest = NULL train.new$isTest = NULL

print(levels(test.new$classes)) print(levels(train.new$classes))

# Calculating the model with
# mtry : number of variables randomly sampled as candidates at eash split
# ntee : number of trees to grow
rf = randomForest(classes~., data=as.matrix(train.new), mtry=5, ntree=2000, importance=TRUE)

p = predict(rf, test.new)

acc = mean(test.new$classes == p) print(acc) # Keep track and save the models that have high accuracy if(acc > 0.65){ print(acc) saveRDS(rf , paste("./rf_models/rf_", i, "_", acc, ".rds", sep="")) } }  The error I'm getting is : Error in predict.randomForest(rf, test.new) : New factor levels not present in the training data Calls: predict -> predict.randomForest Execution halted And thus I added the print(levels(test.new$classes))
print(levels(train.new$classes))  in order to see if the levels of the training and testing set were different. The results of these lines returned me : [1] "Treat1" "Treat2" "Treat3" "Treat4" "Treat5" "Treat6" "Treat7" "Tg" "Wt" [1] "Treat1" "Treat2" "Treat3" "Treat4" "Treat5" "Treat6" "Treat7" "Tg" "Wt" Is something that I'm doing wrong? How can I approach such an issue? R Machine Learning microarrays • 8.4k views ADD COMMENT 0 Entering edit mode I get this error for ForestDNM on new versions of GATK haplotype caller VCFs. If someone figures this out I would love to know the answer! ADD REPLY 0 Entering edit mode One thing that I see is that you coerce your training data into a matrix with as.matrix(train.new) in the randomForest() function. Using this may have unexpected consequences, one being that factors in train.new will be converted into 1, 2, 3, 4, etc., based on how they are ordered. Thus, they will differ already from the testing data. Another thing: when you split a data-frame that has categorical variables / factors, it's good practice to relevel those factors in the new objects, with, in your case: test$classes <- factor(as.character(test$classes)) train$classes <- factor(as.character(train\$classes))

0
Entering edit mode

I just realized that all predictors columns are factors. Could be this the cause of the problem? Should I convert them into numeric?

0
Entering edit mode

Yes, that will also create an issue - they should be numeric and the best way to avoid a situation like that is to go back through each step in order to determine where the numbers are being converted into factors.

Another problem, I believe, is with this piece of code:

randomForest(classes~., data=as.matrix(train.new), ...)


This will mean that classes (encoded as integers) is going to be included as both a predictor and the outcome. Your data should be the original data without the outcome variable, something like:

data=train.new[,-which(colnames(train.new) %in% "classes")]


I do something similar here with lasso (see the step 'Perform 10-fold cross validation'): A: How to exclude some of breast cancer subtypes just by looking at gene expressio

0
Entering edit mode

I removed the as.matrix() and also converted the factors to numeric with the following code:

# Convert dataframe into numeric
data[,-20040] <- sapply(data[,-20040], as.character)
data[,-20040] <- sapply(data[,-20040], as.numeric)


They should become factors while I was reading that file

data <- read.table("genes.tsv",sep = "\t",header = TRUE, stringsAsFactors = FALSE)


As for the classes column you said, it is not encoded as integers but as factors (this is the way the randomForest want it to be) and there is no need to be excluded from the training data itself. Once again, as I remember randomForest() can handle this.

Anyway. It seems that by removing the as.matrix() and converting gene expressions from factors to numeric, is now working.

0
Entering edit mode

Great that it is now resolved. On the conversion from factors to numerical values, please just double check that it has done this as you expected. This is R 'Programming', it's messy, and therefore things turn unexpected frequently!

1
Entering edit mode

Yeah. I noticed that. Everything seems to be right.

0
Entering edit mode

QVINTVS_FABIVS_MAXIMVS, if your problem is different, then please post a new question.

0
Entering edit mode

I figured it out. The VCFs I was working on had different Tranche levels. So it was a factor that was not trained on