Question

Training A Random Forest Model On Unbalanced Training Datasets

1

Entering edit mode

11.3 years ago

mario.messih ▴ 30

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.

My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.

The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:

Distance  Observations
0          330
1          1903
2          12210
3          35486
4          54640
5          62193
6          60728
7          47874
8          33666
9          21640
10         12535
11         6592
12         3159
13         1157
14         349
15         86
16         12

statistics prediction protein-structure • 2.7k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 11.3 years ago by mario.messih ▴ 30

0

Entering edit mode

I think that's a regular question.

ADD REPLY • link 11.3 years ago by Michael 56k