Training A Random Forest Model On Unbalanced Training Datasets
Entering edit mode
9.2 years ago
mario.messih ▴ 30

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.

My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.

The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:

Distance  Observations
0          330
1          1903
2          12210
3          35486
4          54640
5          62193
6          60728
7          47874
8          33666
9          21640
10         12535
11         6592
12         3159
13         1157
14         349
15         86
16         12
statistics prediction protein-structure • 2.4k views
Entering edit mode

I think that's a regular question.


Login before adding your answer.

Traffic: 2289 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6