I have calculated around 1000 molecular structure descriptors (2d and 3d) and fingerprints (such as MACCS, Pubchem fingerprint and Substructure fingerprint) for protein binders from non-binders. I need to create a classifier that could discriminate binders from non-binders. Before that, I need to remove redundant features (descriptors and fingerprints) and reduce data to only significant features those could differentiate binders from non-binders. Now, I have few queries to be addressed:
- I know non-redundant features could be removed by computing PCC and removing highly correlated features. While computing PCC, is it necessary to treat descriptors and fingerprints independently, or can PCC be computed over the whole dataset?
- Can Logistic regression modeling (using elastic net) be used for descriptors and fingerprints selection?
- Other possible methods used for descriptors and fingerprints selection?
All suggestions and helpful comments are appreciated.