Genotypic and clinical phenotypic feature selection methods for machine learning (in python)?
1
0
Entering edit mode
23 months ago
Tom ▴ 40

I have a set of genotypic and phenotypic features like this:

                  SNP1   SNP2   SNP3  survival blood_pressure gender
patient1     0          1           1         23          24      0
patient2     1          0           2         34          4       1  
patient3     1          1           2         43          23      1
patient4     2          1           0         23          3       2

I want to do feature selection on these mixed continuous and categorical data features before inputting into a machine learning algorithm. Would someone know of a library in python (or python code) that is suitable for this?

python machine-learning • 889 views
ADD COMMENT
1
Entering edit mode
23 months ago
Mensur Dlakic ★ 27k

Assuming that IDs are in the first column, I don't see any categorical features.

Generally speaking, one wants to remove the features that are the same in all samples (zero variance means zero signal) or different in all samples (various IDs, private identification numbers such as SSNs, zip codes, etc).

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

If you plan to use linear models downstream, it may be a good idea to remove correlated variables.

Finally, there are many feature selection approaches. I recommend RFECV and FFSCV which can be found in these two packages:

If working with regression targets, these could be useful as well:

https://scikit-learn.org/stable/modules/classes.html#regressors-with-variable-selection

Modern tree-based ML methods usually don't need any feature selection as they automatically eliminate features that are not used in decision splits during training.

ADD COMMENT
0
Entering edit mode

Amazing links and advice, just on categorical variables, SNPs and Gender are categorical, aren't they?

ADD REPLY
0
Entering edit mode

The way data is formatted in the original post, all features are numerical. As you correctly noted, SPNs and Gender are discreetly numerical. It is safe to assume that the Gender column doesn't have many unique states, but impossible to know how many there are for SNPs.

The tree methods I mentioned above, for example gradient boosting machines, can be instructed specifically to consider columns categorical even when their contents are purely numerical. To them it probably wouldn't make much of a difference.

To linear models, however, it does matter when numerical values are meant to represent categories rather than smaller/greater relationships. In such a case some kind feature encoding is needed, such as weight of evidence or frequency encoding.

ADD REPLY

Login before adding your answer.

Traffic: 2565 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6