I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.
My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.
The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:
Distance Observations 0 330 1 1903 2 12210 3 35486 4 54640 5 62193 6 60728 7 47874 8 33666 9 21640 10 12535 11 6592 12 3159 13 1157 14 349 15 86 16 12