Question

random forests regression permuation importance

1

Entering edit mode

18 months ago

bsp017 ▴ 50

I have been using a random forests regression model to correlate a measured and predicted phenotype (R). The input data is SNPs for sites across a genome for multiple individuals. I then measure the importance of each SNP using permutation importance. Some of these SNPs are given a negative value. If I then remove these negative SNPs and re-enter a reduced number of SNPs to correlate a measured and predicted phenotype I find the R value is improved. I can go through several iterations of this until reaching a maximal R value.

For example, I used 50000 SNPs to correlate measured and predicted phenotype which gives an R2 value of 0.7. I then measure the permutation importance for each of the 50000 SNPs. If I then remove SNPs with a negative permutation importance I am left with 37000 SNPs which are used as random forests input giving an R2 value of 0.75. This process continues until reaching a maximal R2 value of 0.8 with 25000 SNPs.

Is there any issue with this approach?

importance random forests permutation regression • 624 views

ADD COMMENT • link updated 18 months ago by LChart 3.9k • written 18 months ago by bsp017 ▴ 50

0

Entering edit mode

In any kind of data analysis, removing outliers will result in a better correlation.

ADD REPLY • link 18 months ago by Mensur Dlakic ★ 27k

score 0 · Answer 1 · 2022-10-17

0

Entering edit mode

18 months ago

LChart 3.9k

This is a straightforward procedure; but you are very obviously overfitting by using importance scores to perform feature selection and then re-training. Based on your description I also suspect that you may be using too few trees. What is the ratio of (# trees)/(# features)?

ADD COMMENT • link 18 months ago by LChart 3.9k

0

Entering edit mode

I have around 50k features and 500 trees. I have almost 500 observations which I split 50/50 into test and training sets. I found that increasing the number of trees reduced the % Var explained.

This is how I run the model:

rf_reg <- randomForest(x = train[, colnames(train) != "Trait"],
               y = train$Trait, ntree=500, mtry=2, importance=TRUE,keep.inbag=TRUE, do.trace=100,proximities=TRUE)

and then to calculate feature importance:

rf1 <- randomForest(x = table[, colnames(table) != "Trait"],
               y = table$Trait, ntree = 500,mtry=2,nodesize = 1,replace = FALSE,importance = TRUE)


imp <- importance(rf1, type = 1, scale = F)

ADD REPLY • link 18 months ago by bsp017 ▴ 50

score 0 · Answer 2 · 2022-10-17

I notice you're setting mtry=2. This means that for every tree, at every split, only 2 predictors are even tested. With 500 trees (and say) 8 splits, that's 8,000 predictors that even get tested.

At the very least, you need a setting of ntree and mtry such that each predictor is expected to be tested at the root level at least 1 time. Otherwise your results will be very unstable. The only reason this is even effective in your case is the high degree of correlation between predictors (LD).