Question

Output file generated by SVM-train has less lines than that of the input file

0

Entering edit mode

7.1 years ago

Doc • 0

I am currently building a binary classification model and have created an input file for svm-train. The input file has 453 lines, 4 No. features and 2 No. classes [0,1].

i.e

0 1:15.0 2:40.0 3:30.0 4:15.0 1 1:22.73 2:40.91 3:36.36 4:0.0 1 1:31.82 2:27.27 3:22.73 4:18.18 0 1:22.73 2:13.64 3:36.36 4:27.27 1 1:30.43 2:39.13 3:13.04 4:17.39 ......................

My problem is that when I count the number of lines in the output model generated by svm-train, it has 12 fewer lines than that of the input file. The line count here shows 450, although there are also 9 lines at the beginning showing the various parameters generated i.e.

svm_type c_svc kernel_type rbf gamma 1 nr_class 2 total_sv 441 rho -0.156449 label 0 1 nr_sv 228 213 SV

Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?

Thanks in advance

svm-train python svmlib • 1.4k views

ADD COMMENT • link 7.1 years ago by Doc • 0

0

Entering edit mode

Do you know what the output file is supposed to be/contain ? This depends on the software implementation you're using and not telling us. My guess is that it contains the model and hence the vectors are the support vectors. If so, this means that your model needs almost all the training data to represent the training set which suggests there's room for improvement or that there's not much structure in the data.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Yeah the output file is the model however, I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same. To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......

0 1:22 2:30 3:14 4:16 1 1:26 2:15 3:17 4:25 0 1:22 2:30 3:14 4:16

Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?

ADD REPLY • link 7.1 years ago by Doc • 0

0

Entering edit mode

In the output line you gave, I noticed this: total_sv 441 which I think means that your model has 441 support vectors. If your training set had 453 vectors then 453 - 441 = 12 and there is nothing missing. I don't see why you would want to have the whole training set in the model. Now if your input has duplicates, they are usually removed (but again this is software dependent). So if the support vectors are the whole training set minus the duplicates, the model is probably not very good or your data doesn't have enough structure to separate the two classes. Also the formats of the input and output files depend on the software you're using.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

My apologies, I don't follow what you are saying. The training set (input file) has 453 vectors and the model (output file) has 441 support vectors (i.e 453 - 12).

NOTE: These are not actually duplicates. Each vector of the input file is in fact a different miRNA, however the vectors that svm-train has removed when generating the output model do have identical labels and also identical values for each feature. My features are A,U,G & C percentage content of each individual miRNA.

Is there any way around the removal of these "duplicates"? as i will want to classify all the miRNAs. Or are you saying that the features (i.e A,U,G and C percentage content) are insufficient and i may need to use different features?

ADD REPLY • link 7.1 years ago by Doc • 0

0

Entering edit mode

Now I don't understand what the problem is. Do you mean that the support vectors in the model are all identical ? I think it would be more useful if you told what software you use and showed what you've done. As a first step you could check whether nucleotide composition has any chance of separating the two classes. Because your data has only 4 dimensions, you could try visualizing the classes in a scatterplot matrix (e.g. the pairs() function in R) or in a MDS or PCA projection. Another consideration is that for compositional data, a transformation of the data (e.g. logratio) may be useful. This is because you have redundant information due to the constraint that the sum of the vector features is 100%.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

score 0 · Answer 1 · 2017-03-23

0

Entering edit mode

7.1 years ago

Doc • 0

Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.

NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).

the format of the input file is:

Where: Column 0 - label (i.e 1 or 0): 1=Yes & 0=No Column 1 - Feature 1 = Percentage Content "A" Column 2 - Feature 2 = Percentage Content "U" Column 3 - Feature 3 = Percentage Content "G" Column 4 - Feature 4 = Percentage Content "C"

The input file actually looks something like (See the very first two lines as they appear identical, however each line represents a different miRNA):

1 1:23 2:36 3:23 4:18 1 1:23 2:36 3:23 4:18 0 1:36 2:32 3:5 4:27 1 1:14 2:41 3:36 4:9 1 1:18 2:50 3:18 4:14 0 1:36 2:23 3:23 4:18 0 1:15 2:40 3:30 4:15

In terms of software, I am using libsvm-3.22 and python 2.7.5

ADD COMMENT • link 7.1 years ago by Doc • 0

1

Entering edit mode

This is the libsvm format so I understand the fiel format and also that you have duplicate vectors but I still don't get what your problem is. The vectors in the output files are the support vectors and it looks like your model uses all the training vectors as support vectors. To see why duplicates are irrelevant in the model, consider that support vectors are points in a multidimensional space that define the separating plane between the classes. You can always pile up more points at the same position but that won't help improve the separating plane. Also if all the labels are the same, it means all the support vectors belong to the same class which means your model will classify all new data point as belonging to this class (which usually is the one that is most represented in the training set).

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I understand what you are saying now thanks. Although, in terms of the output model generated, what is the definition of the first number (i.e At the position which held my label in the input file, before the 4 No. features and after the 9 lines of parameters)....

i.e

svm_type c_svc

kernel_type rbf

gamma 1

nr_class 2

total_sv 441

rho -0.156449

label 0 1

nr_sv 228 213

SV

1 1:22.73 2:36.36 3:22.73 4:18.18

0.8440298363016394 1:15 2:40 3:30 4:15

0.8432684850210745 1:22.73 2:13.64 3:36.36 4:27.27

1 1:36.36 2:22.73 3:18.18 4:22.73

0.8437937842957023 1:18.18 2:27.27 3:13.64 4:40.91

0.8433405480851992 1:27.27 2:27.27 3:45.45 4:0

0.8432450144926207 1:30.43 2:21.74 3:39.13 4:8.7

0.8432816602975477 1:36.36 2:18.18 3:36.36 4:9.09

0.772155842228632 1:27.27 2:27.27 3:36.36 4:9.09

0.8437329060131125 1:22.73 2:18.18 3:36.36 4:22.73

1 1:22.73 2:22.73 3:27.27 4:27.27

0.8436862252980618 1:20.83 2:29.17 3:33.33 4:16.67

0.843774281314054 1:27.27 2:22.73 3:31.82 4:18.18

0.8433097825100732 1:28.57 2:14.29 3:33.33 4:23.81

0.8433253710290639 1:36.36 2:22.73 3:31.82 4:9.09

0.8437677525474466 1:10 2:15 3:40 4:35

0.8432259096687112 1:36.36 2:22.73 3:36.36 4:4.55

1 1:30.43 2:21.74 3:30.43 4:17.39

0.8432113206458789 1:0 2:31.82 3:18.18 4:50

0.8439159682970283 1:31.82 2:31.82 3:4.55 4:31.82

Is this a predicted score? or a prediction of what the label for that particular miRNA should be i.e 0.8439159682970283 is predicting a 1 etc.?? Or am i way off ??

ADD REPLY • link 7.1 years ago by Doc • 0

1

Entering edit mode

Those are the coefficients of the support vectors which you need to make prediction on test points.

ADD REPLY • link 7.1 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

My initial thinking was that if i had input 453, then i should get 453 back out the other end. Thank you for all your help. I really do appreciate it.

ADD REPLY • link 7.1 years ago by Doc • 0