Genotypic and clinical phenotypic feature selection methods for machine learning (in python)?
1
0
Entering edit mode
3 months ago
Tom ▴ 40

I have a set of genotypic and phenotypic features like this:

                  SNP1   SNP2   SNP3  survival blood_pressure gender
patient1     0          1           1         23          24      0
patient2     1          0           2         34          4       1
patient3     1          1           2         43          23      1
patient4     2          1           0         23          3       2


I want to do feature selection on these mixed continuous and categorical data features before inputting into a machine learning algorithm. Would someone know of a library in python (or python code) that is suitable for this?

python machine-learning • 399 views
1
Entering edit mode
3 months ago
Mensur Dlakic ★ 20k

Assuming that IDs are in the first column, I don't see any categorical features.

Generally speaking, one wants to remove the features that are the same in all samples (zero variance means zero signal) or different in all samples (various IDs, private identification numbers such as SSNs, zip codes, etc).

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

If you plan to use linear models downstream, it may be a good idea to remove correlated variables.

Finally, there are many feature selection approaches. I recommend RFECV and FFSCV which can be found in these two packages:

If working with regression targets, these could be useful as well:

https://scikit-learn.org/stable/modules/classes.html#regressors-with-variable-selection

Modern tree-based ML methods usually don't need any feature selection as they automatically eliminate features that are not used in decision splits during training.

0
Entering edit mode

Amazing links and advice, just on categorical variables, SNPs and Gender are categorical, aren't they?

0
Entering edit mode

The way data is formatted in the original post, all features are numerical. As you correctly noted, SPNs and Gender are discreetly numerical. It is safe to assume that the Gender column doesn't have many unique states, but impossible to know how many there are for SNPs.

The tree methods I mentioned above, for example gradient boosting machines, can be instructed specifically to consider columns categorical even when their contents are purely numerical. To them it probably wouldn't make much of a difference.

To linear models, however, it does matter when numerical values are meant to represent categories rather than smaller/greater relationships. In such a case some kind feature encoding is needed, such as weight of evidence or frequency encoding.