I am seeking to run a binary SVM classifier on a dataset of about 10,000 small molecule drugs. I want the classifier to determine which of these drugs are structural analogues to a specific drug. Each instance, or drug, will have five attributes containing some type of molecular descriptor.
For the training set, should I run the classifier on a dataset of drugs that are already known to be structural analogues of the drug? More specifically, a dataset with an even number of known structural analogues and non-analogues? And then do I test my classifier on the 10,000 small molecule drugs which have not yet been determined to be analogues?