random forests regression permuation importance
2
1
Entering edit mode
18 months ago
bsp017 ▴ 50

I have been using a random forests regression model to correlate a measured and predicted phenotype (R). The input data is SNPs for sites across a genome for multiple individuals. I then measure the importance of each SNP using permutation importance. Some of these SNPs are given a negative value. If I then remove these negative SNPs and re-enter a reduced number of SNPs to correlate a measured and predicted phenotype I find the R value is improved. I can go through several iterations of this until reaching a maximal R value.

For example, I used 50000 SNPs to correlate measured and predicted phenotype which gives an R2 value of 0.7. I then measure the permutation importance for each of the 50000 SNPs. If I then remove SNPs with a negative permutation importance I am left with 37000 SNPs which are used as random forests input giving an R2 value of 0.75. This process continues until reaching a maximal R2 value of 0.8 with 25000 SNPs.

Is there any issue with this approach?

importance random forests permutation regression • 624 views
ADD COMMENT
0
Entering edit mode

In any kind of data analysis, removing outliers will result in a better correlation.

ADD REPLY
0
Entering edit mode
18 months ago
LChart 3.9k

This is a straightforward procedure; but you are very obviously overfitting by using importance scores to perform feature selection and then re-training. Based on your description I also suspect that you may be using too few trees. What is the ratio of (# trees)/(# features)?

ADD COMMENT
0
Entering edit mode

I have around 50k features and 500 trees. I have almost 500 observations which I split 50/50 into test and training sets. I found that increasing the number of trees reduced the % Var explained.

This is how I run the model:

rf_reg <- randomForest(x = train[, colnames(train) != "Trait"],
               y = train$Trait, ntree=500, mtry=2, importance=TRUE,keep.inbag=TRUE, do.trace=100,proximities=TRUE)

and then to calculate feature importance:

rf1 <- randomForest(x = table[, colnames(table) != "Trait"],
               y = table$Trait, ntree = 500,mtry=2,nodesize = 1,replace = FALSE,importance = TRUE)


imp <- importance(rf1, type = 1, scale = F)
ADD REPLY
0
Entering edit mode
18 months ago
LChart 3.9k

I notice you're setting mtry=2. This means that for every tree, at every split, only 2 predictors are even tested. With 500 trees (and say) 8 splits, that's 8,000 predictors that even get tested.

At the very least, you need a setting of ntree and mtry such that each predictor is expected to be tested at the root level at least 1 time. Otherwise your results will be very unstable. The only reason this is even effective in your case is the high degree of correlation between predictors (LD).

ADD COMMENT

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6