SVM parameter tuning - opencv

I am newbie to using svm for classification. I want to tune svm parameters by .TrainAutofunction in EmguCV. But I don't know what are the range(min-max value) of below parameters that I should give to this function to search:
1- C (for poly and RBF kernels)
2- Gamma (for poly and RBF kernels)
3- Coeficient (for poly kernel)
4- Degree (for poly kernel)
What are the range of these parameters?
Do these ranges depends on the number of samples?
As I receive the out of memory allocation error, can I set step parameter to a large value and when I found optimal values approximately, set these range to a smaller one and search on the second range with smaller step?

In general, you should consider the RBF kernel first, unless you have a good reason not to. Here are some guides on how to choose the parameters C and gamma:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

Related

OpenCV4Android SVM is not giving the correct prediction

I am new to machine learning and openCV. I have taken a set of 10 images for each emotion(neutral and happy) from Cohn-Kanade face database. Then I have extracted the facial features from each image and put them in my trainingData Matrix and assigned the label for the respective emotion (Example: 0 for neutral and 1 for happy).
I have used the RBF kernel with gamma = 0.1 and C = 1. Once trained, I am passing the facial features extracted from the live camera frames from a smartphone camera for prediction. The prediction always returns 1.
If I increase the number of training samples for neutral expression(example: 15 neutral expression images and 10 happy expression images), then the prediction always returns 0 and if there are equal number of images for each expression in the training samples, then SVM prediction always returns 1.
Why is the SVM behaving this way? How to check if I am using the right values for gamma and C? Also, does SVM depend on the resolution of training images and testing images?
I would request you to upload the SVM function so we can understand your code. Secondly, I have used SVM before and you need to normalize the training data and the labels. You should also make sure you are using the correct classifier as not all classifiers are supported. Follows this link for some tutorials http://docs.opencv.org/3.0-beta/modules/ml/doc/support_vector_machines.html
For answering your other questions, unfortunately you have to find the best combination for gamma and C yourself, which is kind of the drawback of SVM. https://www.quora.com/What-are-C-and-gamma-with-regards-to-a-support-vector-machine
Yes, the SVM does depend on the resolution as your features/feature vectors would change depending on the resolution and hence the inputs and the labels.
P.S. This should ideally be in comments but unfortunately i don't have enough points to do that.

Support Vector Machine: Feature Transformation

How do I do the transformation in the test data when I have the trained SVM model in hand? I am trying to simulate the SVM output from mathematical equations and the trained SVM model (using RBF kernel). How do I do that?
In SVM, some of the common kernels used are:
Here xi and xj represent two samples. Now if the data has, say 5 samples, does this transformation include all the combination of two samples to generate the transformed feature space, like, x1 and x1, x1 and x2, x1 and x3,..., x4 and x5, x5 and x5.
If data has two features, then a polynomial transformation of order 2 transforms the input to 3 dimensions, as explained her in slide 15 http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
Now how can be find the similar explantion for the transformation using the RBF kernel? I am trying to write a code for transforming the test data so that i can apply the trained SVM model on it.
This is way more complex than that. In short - you do not map your data directly into feature space. You simply change the dot product to the one induced by the kernel. What happens "inside" SVM when you work with polynomial kernel, each point is actually (indirectly) transformed to O(d^p) dimensional space (where d-input data dimension, p-degree of polynomial kernel). From mathematical perspective you work with some (often unknown) projection phi_K(x) which has the property that K(x, y) = <phi_K(x), phi_K(y)>, and nothing more. In SVM implementations, you do not need actual data representation (as phi_K(x) is usually huge, sometimes even infinite, like in RBF case) but instead it needs vector of dot product of your point will each element of the training set.
Thus what you do (in implementations, not from math perspective) is you provide:
During training whole Gram matrix, G defined as G_ij = K(x_i, x_j) where x_i is i'th training sample
During testing, when you get new point y you provide it to SVM as a vector of dot products H such that H_i = K(y, x_i), where again x_i are your training points (in fact you just need values for support vectors, but many implementations, like libsvm, actually require vector of the size of the training set - you can simply put 0's for K(y, x_j) if x_j is not a training vector)
Just remember, that this is not the same as training linear SVM "on top" of the above representation. This is just a way implementations usually accept your data, as they need a definition of dot product (function) and it is often easier to pass numbers than functions (but some of them, like scikit-learn SVC module, actually accepts functions as kernel parameter).
So what is RBF kernel? It is actually a mapping from points to functions space of normal distributions with means in your training points. And then dot product is just an integral from -inf to +inf from the product of such two functions. Sounds complex? It is at first sight, but it is a really nice trick, worth understanding!

Support Vector Machines understanding

Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.
In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
We do this with the help of kernel.
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!
Thanks in advance :)
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so
<f,g> = integral (f*g) (x) dx
where f,g are Gaussian distributions.
Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.
just a bit of introduction about svm before i start answering the question. This will help you get overview about svm. Svm task is to find the best margin-maximing hyperplane that best separates the data. We have soft margin representation of svm which is also known as primal form and its equivalent form is dual form of svm. Dual form of svm makes the use of kernel trick.
kernel trick is partially replacing the feature engineering which is the most important step in machine learning when we have datasets that are not linear (eg. datasets in shape of concentric circles).
Now you can transform this dataset from non-linear to linear by both FE and kernel trick. By FE you can square each of the features in this dataset and it will transform into linear dataset and then you can apply techniques like logisitic regression which work best for linear data.
In kernel trick you can use the polynomial kernel whose general form is (a + x_i(transpose)x_j)^d where a and d are constants and d specifies the degree, suppose if the degree is 2 then we can say it is quadratic and likewise. now lets say we apply d =2 now our equation becomes (a + x_i(transpose)x_j)^2. lets the we 2 features in our original dataset (eg. for x_1 vector is [x_11,x__12] and x_2 the vector is [x_21,x_22]) now when we apply polynomial kernel on this we get we get 6-d vectors. Now we have transformed the features from 2-d to 6-d.Now you can see intuitively that higher your dimension of data better would svm work because it will eventually transform that features to a higher space. Infact the best case of svm is if you have high dimensionality, then go for svm.
Now you can see both kernel trick and Feature Engineering solve and transforms the dataset(concentric circle one) but the difference is we are doing FE explicitly but kernel trick comes implicitly with svm. There is also a general purpose kernel know as Radial basis function kernel which can be used when you don't know the kernel in advance.
RBF kernel has a parameter (sigma) if the value of sigma is set to 1, then you get a curve that looks like gaussian curve.
You can consider kernel as a similarity metric only and can interpret if the distance between the points is less, higher will be the similarity.

why the number of the support vector is zero?

I use the libsvm tool to do the classification. but the number of the support vector turns out to be zero, which means there is no support vector. Why? what does it mean and what are the main reasons?
It means, that parameters and data used for training led to the trivial model (always equal to one of the classes). So what are the most probable reasons?
Bad choice of SVM parameters (C, gamma, degree, depending on used kernel)
Wrong data (inconsistent, check what exactly are you feeding as X and Y matrices)

Query about SVM mapping of input vector? And SVM optimization equation

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

Resources