Question: Training A Random Forest Model On Unbalanced Training Datasets
1
gravatar for mario.messih
6.3 years ago by
mario.messih30
mario.messih30 wrote:

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.

My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.

The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:

Distance  Observations
0          330
1          1903
2          12210
3          35486
4          54640
5          62193
6          60728
7          47874
8          33666
9          21640
10         12535
11         6592
12         3159
13         1157
14         349
15         86
16         12
ADD COMMENTlink modified 6.3 years ago by Michael Dondrup47k • written 6.3 years ago by mario.messih30

I think that's a regular question.

ADD REPLYlink written 6.3 years ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1099 users visited in the last hour