I would appreciate it if you share your inputs on the following issue:
I am trying to make binary classifier models to classify variants into tow diffrent classes. The dataset is an annotated variant file with dimensions as 187,643 x 203. The first column contains class labels with no NA. The rest of the dataset hold allele frequency data from different population/sub-populations. Here is a snapshot of the dataset :
To make classifier models, NAs issue should be resolved before training the model. The challenge that I am facing is how to deal with high number of NAs in the allele frequency data. The dataset contains high number of NAs and min and max number of NAs in the columns are as 24% and 90%, respectively.
I was thinking of setting a cut-off for NAs count(let say 30%), dropping columns with NAs count greater than that and then replacing NAs in each remaining columns with class specific mean. Alternatively some deep learning library (like
Datawig) might be helpful to impute the missing values.
However, by considering a cut-off thereshold, I am going to lose some features that seems to be really important for this classification job. As an example, that column hodling 1KG project allele frequency data (AF_TGP) has 64% missing value, but it can classify samples with AUC [95% CI]: 73.23 [72.86 - 73.61] into two groups. I do want to keep this kind of features in the dataset.
On the other hand, keeping missing values will require to use algorithms that can handle them like k-NN , naive Bayes and random forest. But it seems
sklearn implementation of these algorithms do not support presence of missing values.
update on 2021-08-25: I understand that there are recommendations to avoid discretizing a numerical features for most of ML tasks, but in this case if I convert AF_TGP into a categorical variable and then assign a group name to NAs cases, I will be able to keep the feature. How does that sounds to you?
P.S, a high-level version of this post, has been posted elsewhere as well.