LIBSVM: Get support vectors from model file - machine-learning

This may be a weird request so some explanation first. I recently had a sudden hd crash and lost a data file I was using to generate model files with libSVM. I do have the SVM model and scaling file that I generated from this data file and I was wondering if there is a way to generate a data file from the Support Vectors in the model file, something like model_sv_to_instances(model, &instances) since thhe process for obtaining instances is very costly. (I know it won't be the same as the original but still it's better than nothing) I'm using a probabilistic SVM with RBF kernel.

If you open a given model file in any text editor you would find something like this:
svm_type c_svc
kernel_type sigmoid
gamma 0.5
coef0 0
nr_class 2
total_sv 4
rho 0
label 0 1
nr_sv 2 2
SV
1 1:0 2:0
1 1:1 2:1
-1 1:1 2:0
-1 1:0 2:1
Where the interesting thing for you is after the line with SV.
1 1:0 2:0
1 1:1 2:1
-1 1:1 2:0
-1 1:0 2:1
Those are data points that were selected as support vectors, so you just have to parse the file. The format is as follows :
[label] [index1]:[value1] [index2]:[value2] ... [indexn][valuen]
For instance, from my example you can conclude that my training set was:
x y desired val
0 0 -1
0 1 1
1 0 1
1 1 -1
A few considerations and warnings. The ratio between number of SVs and data points depends on the parameters that you used. In some cases the ratio is big and you would have very few SVs in comparison with your data.
Another thing to keep in mind is that this reduction is likely to change the problem and if you train again just with SVs as data points you would probably get a complete different model with a complete different set of parameters.
Good luck!

In the case of RBF you are lucky. According to the libsvm FAQ you can extract the support vectors from the model file:
In the model file, after parameters and other informations such as labels , each line represents a support vector.
But remember, these are only the support vectors, which are only a fraction of your original input data.

To the best of my knowledge, SVM models in general, and libSVM models in particular, consist of only the support vectors. These vectors represent the borderline between the classes; most probably, they don't represent the vast majority of your data points. So, unfortunately, I don't think there's a way to regenerate your data from the model.
Having said that, I can think of an esoteric case where there might be some value to the model: there are companies specializing in recovering data in such cases (e.g. from crashed HDs). However, the recovered data sometimes has gaps; in certain cases, the model might be reverse-engineered to fill-in some missing spots. However, this is very theoretic.
EDIT: as the other answers state, the proportion of data points represented by the support vectors might vary, depending on the specific problem and parameters. However, as stated above, in most common cases, you'll be able to reconstruct only a small fraction of your original data set.

Related

Should I normalize my features before throwing them into RNN?

I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)
It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.
Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.
I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.

understanding "Deep MNIST for Experts"

I am trying to understand Deep MNIST for Experts. I have a quite clear idea of how Neural networks and deep learning works on a high level, but I struggle to understand the details.
In the tutorial the first write and run a simple one layer model. This includes defining the model x*W+b, calculating the entropy, minimizing the entropy by gradient decent and evaluating the result.
The first part I found quite easy to run and understand.
In the second part the build a simple multi level network, and apply some convolutions and pooling. However, here things start to get tricky. They write:
We can now implement our first layer. It will consist of convolution, followed by max pooling. The convolutional will compute 32 features for each 5x5 patch.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
I think a visual model can greatly reduce the difficulty of understanding, so perhaps this can help you understand better:
http://scs.ryerson.ca/~aharley/vis/conv/
This is a 3D visualization of a convolutional neural network, it has two convolution layers and followed with two max pooling layers, you can click a 3D cube in each layer to check the value.
So in general you have to read a lot about CNNs/NN before trying to understand what is really going on. These examples are not really supposed to be introduction course to NN, these do assume you know what CNNs are.
A 5x5 patch should equal 25 pixels. Right? Why would you extract 32 features from 25 pixels? Why do you want more features than you have datapoints? How does this even make sense? It feels like they are "upscaling" a problems from 25 dimensions to 32 dimensions. It feels like that 7 of the 32 dimensions should be redudant.
This is completely different 'level of abstraction', you are comparing unrelated objects to each other, which is obviously confusing. They are creating 32 filters, each will linearly map your whole image, through a 5x5 convolution kernel moving through your image. For example one such filter could be an edge detector:
0 0 0 0 0
0 0 0 0 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
another can detect diagonal lines
1 0 0 0 0
1 1 0 0 0
1 1 1 0 0
1 1 1 1 0
1 1 1 1 1
etc. Why 32? Just a magical number, tested empirically. This is actually quite small number in terms of CNNs (notice that just to detect basic edges in greyscale images you already need 8 different filters!).
Secondly. The convolution uses the function truncated_normal which just picks random values close to the mean. Why is this a good model for modelling handwritten numbers?
This is initializer of the weights. This is not a "model for modelling handrwitten numbers", this is simply a starting point for optimization of this part of the parameters space. Why normal distribution? We have some mathematical intuition how to initalize neural nets, especially assuming ReLU activations. It is important to initialize in a random way, which ensures that many of your neurons will be initially active, so you do not get 0 derivatives (thus lack of ability to learn using typical optimizers).
Thirdly. The second layer in the network seems to do the same thing again. Are more layers just better, could I have achieved the same results with a single layer?
In principle you can model everything with a single-hidden layer feed forward net, even without convolutions. However, it might require exponentialy as many hidden units, and perfect optimization strategies which we do not have (and maybe they do not even exist!). Depth of the network gives you ability to express more complex (and for same cases more useful) features with less parameters, plus we know more or less how to optimize it. However you should avoid an often pitfall of assuming "deeper is better". This is not true in general. This is true if important features of your data can be efficiently expressed as a hierarchical structure of abstraction. It is true for images (more and more complex patterns, first edges, then some lines and curves, then patches, then more complex concepct etc.) as well as text, sound etc. but before you try to apply DL for your new task you should ask yourself whether this is (or at least might be) true for your case. Using too complex model is usually way worse than too simple.

libsvm not giving support vectors / no support vectors

I am using jlibsvm to do SVM for regression .My data set is very small (42 samples) . When I use the dataset to create the model using epsilon SVR with sigmoid kernel then no support vectors are generated.
This is what I get in my model file :
svm_type epsilon_svr
kernel_type sigmoid
gamma 0.02380952425301075
coef0 0.0
label
rho -66.42803
total_sv 0
probA -1.0
SV
When I use some other data set on the libsvm website I get a model file with support vectors fine.
Can someone please suggest why no support vectors are being generated for my data set ?
My data set file is formatted right so no issues there...
This could mean that the best found classification, given your data and the hyperparameters, is to assign the same label to all samples.
Are your samples unbalanced? What's the number of positive and negative samples? You might want to try to add a weighting to positive/negative samples to account for that
It could also be the samples are hard to separate given their structure and the kernel type. Have you tried a different structure?
With only 42 data samples, maybe you could add them to your question and get better answers.

representing data for usage with libsvm (sparse or not)

I'd like to do some data mining on functions stacktraces, for this I am using libsvm
and representing the data in a sparse format for speed of processing, each stacktrace is an instance and the variables are functions, i.e:
class1 F1,F2,F1,F456,F3
class2 F4,F4,F4,F56,F3000
...
somewhere I have an ever growing registry of seen unique functions, this is where the funcions indexes come from. Ideally, I'd like to represent the aforementioned instances using a sparse format and partitioned in 5 variables like this:
1 1:1 2:1 1:1 456:1 3:1
2 4:1 4:1 4:1 56:1 3000:1
this is not possible in libsvm's format, so I am adding the length of the total functions registry to each group to avoid index clashes, if we suppose there are 3000 functions in total:
1 1:1 3002:1 6001:1 9456:1 12003:1, this is how the first instance looks now
this works if the amount of functions doesn't change, but that is not the case, as there are new functions added every time, would have to redo whole thing.
I am using a sparse format, but suggestions are welcome on other formats too, I am able to use the data with Weka in a dense format using the function names as variables and it works, just muuch slower than with libsvm
thanks!
You have few options:
a) redo the whole thing each time (I think generating inputs for libsvm is faster than libsvm itself :))
b) use even numbers for the first thing and odd numbers for the other thing. So your example would look like:
1 2:1 3:1 1:1 911:1 5:1
This avoids collisions and you don't have to redo the whole thing :)

svm scaling input values

I am using libSVM.
Say my feature values are in the following format:
instance1 : f11, f12, f13, f14
instance2 : f21, f22, f23, f24
instance3 : f31, f32, f33, f34
instance4 : f41, f42, f43, f44
..............................
instanceN : fN1, fN2, fN3, fN4
I think there are two scaling can be applied.
scale each instance vector such that each vector has zero mean and unit variance.
( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
scale each colum of the above matrix to a range. for example [-1, 1]
According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.
Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?
The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.
This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:
The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial ker-
nel, large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1,+1] or [0,1].
I believe that it comes down to your original data a lot.
If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].
Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.
If you scale this data linearly, then you'll have:
-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data
100 -> -0.06666666666666665
234 -> -0.007111111111111068
500 -> 0.11111111111111116
You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.
At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.
At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.
I hope someone finds this useful!
The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

Resources