New to machine learning and I was going over the topic of support vector machines. Can someone verify if I'm correct in saying that dual representations relate to support vectors in the way that if the weights over the training data is not equal to zero, we can then deduce it as a support vector and the fewer the support vectors there are, the more sparse the solution?
Thanks so much in advance.
The dual representation is the expression of a solution as a linear combination of training point locations (their actual location in input space if the kernel is linear; or their location in a high-dimensional feature space induced by the kernel, if non-linear). So the dual representation consists of a bunch of weights—one number corresponding to each data point. Those data points for which the corresponding weight is non-zero? yep, they're the support vectors.
Related
PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!
If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?
You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.
Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004
This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )
Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.
In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
We do this with the help of kernel.
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!
Thanks in advance :)
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so
<f,g> = integral (f*g) (x) dx
where f,g are Gaussian distributions.
Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.
just a bit of introduction about svm before i start answering the question. This will help you get overview about svm. Svm task is to find the best margin-maximing hyperplane that best separates the data. We have soft margin representation of svm which is also known as primal form and its equivalent form is dual form of svm. Dual form of svm makes the use of kernel trick.
kernel trick is partially replacing the feature engineering which is the most important step in machine learning when we have datasets that are not linear (eg. datasets in shape of concentric circles).
Now you can transform this dataset from non-linear to linear by both FE and kernel trick. By FE you can square each of the features in this dataset and it will transform into linear dataset and then you can apply techniques like logisitic regression which work best for linear data.
In kernel trick you can use the polynomial kernel whose general form is (a + x_i(transpose)x_j)^d where a and d are constants and d specifies the degree, suppose if the degree is 2 then we can say it is quadratic and likewise. now lets say we apply d =2 now our equation becomes (a + x_i(transpose)x_j)^2. lets the we 2 features in our original dataset (eg. for x_1 vector is [x_11,x__12] and x_2 the vector is [x_21,x_22]) now when we apply polynomial kernel on this we get we get 6-d vectors. Now we have transformed the features from 2-d to 6-d.Now you can see intuitively that higher your dimension of data better would svm work because it will eventually transform that features to a higher space. Infact the best case of svm is if you have high dimensionality, then go for svm.
Now you can see both kernel trick and Feature Engineering solve and transforms the dataset(concentric circle one) but the difference is we are doing FE explicitly but kernel trick comes implicitly with svm. There is also a general purpose kernel know as Radial basis function kernel which can be used when you don't know the kernel in advance.
RBF kernel has a parameter (sigma) if the value of sigma is set to 1, then you get a curve that looks like gaussian curve.
You can consider kernel as a similarity metric only and can interpret if the distance between the points is less, higher will be the similarity.
What the meaning of the word "support" in the context of Support Vector Machine, which is a supervised learning model?
Copy-pasted from Wikipedia:
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.
In SVMs the resulting separating hyper-plane is attributed to a sub-set of data feature vectors (i.e., the ones that their associated Lagrange multipliers are greater than 0). These feature vectors were named support vectors because intuitively you could say that they "support" the separating hyper-plane or you could say that for the separating hyper-plane the support vectors play the same role as the pillars to a building.
Now formally, paraphrasing Bernhard Schoelkopf 's and Alexander J. Smola's book titled "Learning with Kernels" page 6:
"In the searching process of the unique optimal hyper-plane we consider hyper-planes with normal vectors w that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce computational cost of evaluating the decision function. The hyper-plane will then only depend on a sub-set of the training patterns called Support Vectors."
That is, the separating hyper-plane depends on those training data feature vectors, they influence it, it's based on them, consequently they support it.
In a kernel space, the simplest way to represent the separating hyperplane is by the distance to data instances. These data instances are called "support vectors".
The kernel space could be infinite. But as long as you can compute the kernel similarity to the support vectors, you can test which side of the hyperplane an object is, without actually knowing what this infinite dimensional hyperplane looks like.
In 2d, you could of course just produce an equation for the hyperplane. But this doesn't yield any actual benefits, except for understanding the SVM.
I have some problems with understanding the kernels for non-linear SVM.
First what I understood by non-linear SVM is: using kernels the input is transformed to a very high dimension space where the transformed input can be separated by a linear hyper-plane.
Kernel for e.g: RBF:
K(x_i, x_j) = exp(-||x_i - x_j||^2/(2*sigma^2));
where x_i and x_j are two inputs. here we need to change the sigma to adapt to our problem.
(1) Say if my input dimension is d, what will be the dimension of the
transformed space?
(2) If the transformed space has a dimension of more than 10000 is it
effective to use a linear SVM there to separate the inputs?
Well it is not only a matter of increasing the dimension. That's the general mechanism but not the whole idea, if it were true that the only goal of the kernel mapping is to increase the dimension, one could conclude that all kernels functions are equivalent and they are not.
The way how the mapping is made would make possible a linear separation in the new space.
Talking about your example and just to extend a bit what greeness said, RBF kernel would order the feature space in terms of hyperspheres where an input vector would need to be close to an existing sphere in order to produce an activation.
So to answer directly your questions:
1) Note that you don't work on feature space directly. Instead, the optimization problem is solved using the inner product of the vectors in the feature space, so computationally you won't increase the dimension of the vectors.
2) It would depend on the nature of your data, having a high dimensional pattern would somehow help you to prevent overfitting but not necessarily will be linearly separable. Again, the linear separability in the new space would be achieved because the way the map is made and not only because it is in a higher dimension. In that sense, RBF would help but keep in mind that it might not perform well on generalization if your data is not locally enclosed.
The transformation usually increases the number of dimensions of your data, not necessarily very high. It depends. The RBF Kernel is one of the most popular kernel functions. It adds a "bump" around each data point. The corresponding feature space is a Hilbert space of infinite dimensions.
It's hard to tell if a transformation into 10000 dimensions is effective or not for classification without knowing the specific background of your data. However, choosing a good mapping (encoding prior knowledge + getting right complexity of function class) for your problem improves results.
For example, the MNIST database of handwritten digits contains 60K training examples and 10K test examples with 28x28 binary images.
Linear SVM has ~8.5% test error.
Polynomial SVM has ~ 1% test error.
Your question is a very natural one that almost everyone who's learned about kernel methods has asked some variant of. However, I wouldn't try to understand what's going on with a non-linear kernel in terms of the implied feature space in which the linear hyperplane is operating, because most non-trivial kernels have feature spaces that it is very difficult to visualise.
Instead, focus on understanding the kernel trick, and think of the kernels as introducing a particular form of non-linear decision boundary in input space. Because of the kernel trick, and some fairly daunting maths if you're not familiar with it, any kernel function satisfying certain properties can be viewed as operating in some feature space, but the mapping into that space is never performed. You can read the following (fairly) accessible tutorial if you're interested: from zero to Reproducing Kernel Hilbert Spaces in twelve pages or less.
Also note that because of the formulation in terms of slack variables, the hyperplane does not have to separate points exactly: there's an objective function that's being maximised which contains penalties for misclassifying instances, but some misclassification can be tolerated if the margin of the resulting classifier on most instances is better. Basically, we're optimising a classification rule according to some criteria of:
how big the margin is
the error on the training set
and the SVM formulation allows us to solve this efficiently. Whether one kernel or another is better is very application-dependent (for example, text classification and other language processing problems routinely show best performance with a linear kernel, probably due to the extreme dimensionality of the input data). There's no real substitute for trying a bunch out and seeing which one works best (and make sure the SVM hyperparameters are set properly---this talk by one of the LibSVM authors has the gory details).