Question: Confused about data sets for SVM algorithm
0
gravatar for spacegeek1212
3.3 years ago by
spacegeek121230 wrote:

I am seeking to run a binary SVM classifier on a dataset of about 10,000 small molecule drugs. I want the classifier to determine which of these drugs are structural analogues to a specific drug. Each instance, or drug, will have five attributes containing some type of molecular descriptor.

For the training set, should I run the classifier on a dataset of drugs that are already known to be structural analogues of the drug? More specifically, a dataset with an even number of known structural analogues and non-analogues? And then do I test my classifier on the 10,000 small molecule drugs which have not yet been determined to be analogues?

ADD COMMENTlink modified 16 months ago by Biostar ♦♦ 20 • written 3.3 years ago by spacegeek121230
2
gravatar for Asaf
3.3 years ago by
Asaf6.5k
Israel
Asaf6.5k wrote:

You better start with defining a question and then see which tool (or tools) is the most appropriate. You might find out that unsupervised learning (clustering) would work better for you
Having that said, the input of SVM (training set) is a table of features of the observations and a binary vector specifying the group of each observation. The number of rows in the table and the vector should be the same. You can train your classifier with this data, people usually do a "leave X out cross validation" to test the performances of the classifier. The procedure uses a part of the training set to train the classifier and test it on the part left aside.
If you are convinced that your classifier is good you can predict the results for the 10'000 small molecules you are testing.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Asaf6.5k

Thanks! So would I have to manually specify the group of each observation?

ADD REPLYlink written 3.3 years ago by spacegeek121230
1

You should have this information. The groups are: analogues and non analogues. If you don't have this data for a set of molecules then you don't have a training set.

ADD REPLYlink written 3.3 years ago by Asaf6.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1749 users visited in the last hour