What is a Tensor kernel in machine learning? What is the difference between Tensor kernels with common kernels like RBF kernel?
When tensor kernels are used and what their advantages and disadvantages?
This is more a semantic issue. A Kernel is an encapsulated function.
In order to apply it to a data-tensor we need to convert the kernel to a tensor shape, but in the end it is still the same function.
Understand tensors. They "are just high dimensional matrices".
Understand ML Kernels. Basically a kernel is an encapsulated function. The kernel trick is used to pass a linear function over a (none-linear) kernel or filter, in order to achieve a complex none-linear function in a less complex way. This allows a linear function learning none-linear features, saving a lot of computational power, since
using a polynomial function can become too expensive for large (or complicated) datasets training. I.e., we can compute linear functions using none-linear kernels to learn none-linear patterns.
So, RBF is represented as a tensor kernel (exponential gaussian kernel), but this is just the most popular kernel in machine learning to learn none-linear representations (specially used in SVM).
SVR using linear and non-linear kernels.
There are other tensor kernels that you could use instead, like the Sigmoid Kernel or the polynomial kernel.
Related
Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.
In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
We do this with the help of kernel.
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!
Thanks in advance :)
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so
<f,g> = integral (f*g) (x) dx
where f,g are Gaussian distributions.
Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.
just a bit of introduction about svm before i start answering the question. This will help you get overview about svm. Svm task is to find the best margin-maximing hyperplane that best separates the data. We have soft margin representation of svm which is also known as primal form and its equivalent form is dual form of svm. Dual form of svm makes the use of kernel trick.
kernel trick is partially replacing the feature engineering which is the most important step in machine learning when we have datasets that are not linear (eg. datasets in shape of concentric circles).
Now you can transform this dataset from non-linear to linear by both FE and kernel trick. By FE you can square each of the features in this dataset and it will transform into linear dataset and then you can apply techniques like logisitic regression which work best for linear data.
In kernel trick you can use the polynomial kernel whose general form is (a + x_i(transpose)x_j)^d where a and d are constants and d specifies the degree, suppose if the degree is 2 then we can say it is quadratic and likewise. now lets say we apply d =2 now our equation becomes (a + x_i(transpose)x_j)^2. lets the we 2 features in our original dataset (eg. for x_1 vector is [x_11,x__12] and x_2 the vector is [x_21,x_22]) now when we apply polynomial kernel on this we get we get 6-d vectors. Now we have transformed the features from 2-d to 6-d.Now you can see intuitively that higher your dimension of data better would svm work because it will eventually transform that features to a higher space. Infact the best case of svm is if you have high dimensionality, then go for svm.
Now you can see both kernel trick and Feature Engineering solve and transforms the dataset(concentric circle one) but the difference is we are doing FE explicitly but kernel trick comes implicitly with svm. There is also a general purpose kernel know as Radial basis function kernel which can be used when you don't know the kernel in advance.
RBF kernel has a parameter (sigma) if the value of sigma is set to 1, then you get a curve that looks like gaussian curve.
You can consider kernel as a similarity metric only and can interpret if the distance between the points is less, higher will be the similarity.
In the disadvantages of SVM they say
If the number of features is much greater than the number of samples,
the method is likely to give poor performances.
What is a good alternative in such case ?
You can keep an eye on the libsvm guide for beginners, in the section C.1 it gives you the answer and an example of exactly what you have asked:
C.1 Number of instances << number of features Many microarray data in
bioinformatics are of this type. We consider the Leukemia data from
the LIBSVM data sets (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets). The training and testing sets have 38 and 34 instances,
respectively. The number of features is 7,129, much larger than the
number of instances. We merge the two les and compare the cross
validation accuracy of using the RBF and the linear kernels:
RBF kernel with parameter selection
$ cat leu leu.t > leu.combined
$ python grid.py leu.combined
...
8.0 3.0517578125e-05 97.2222
(Best C=8.0, = 0:000030518 with ve-fold cross-validation rate=97.2222%)
Linear kernel with parameter selection
$ python grid.py -log2c -1,2,1 -log2g 1,1,1 -t 0 leu.combined
...
0.5 2.0 98.6111
(Best C=0.5 with ve-fold cross-validation rate=98.61111%)
Though grid.py was designed for the RBF kernel, the
above way checks various C using the linear kernel (-log2g 1,1,1 sets
a dummy ).
The cross-validation accuracy of using the linear kernel
is comparable to that of using the RBF kernel. Apparently, when the
number of features is very large, one may not need to map the data.
In addition to LIBSVM, the LIBLINEAR software mentioned below is also
effective for data in this case.
So as you can see you can use SVM with linear kernels and obtain good results.
For all SVM versions, like c-svm, v-svm, soft margin svm etc., can a support vector not be a training sample?
No, it can't. A support vector is always a sample from the training set.
This is a good thing, because it means SVMs are oblivious to the internal structure of their samples and their support vectors. Only the kernel function, which is separate from the SVM proper, has to know about the structure of samples. While most kernels operate on vectors of numbers, there exist kernels that operate on strings, trees, graphs, you name it.
(Note that linear support vector machines can be trained without taking support vectors into account. I.e., when you train a linear model under hinge loss with appropriate regularization using an algorithm such as SGD, you get a model that is equivalent to an SVM with a linear kernel, but where the support vectors are implicit.)
I have this confusion related to kernel svm. I read that with kernel svm the number of support vectors retained is large. That's why it is difficult to train and is time consuming. I didn't get this part why is it difficult to optimize. Ok I can say that noisy data requires large number of support vectors. But what does it have to do with the training time.
Also I was reading another article where they were trying to convert non linear SVM kernel to linear SVM kernel. In the case of linear kernel it is just the dot product of the original features themselves. But in the case of non linear one it is RBF and others. I didn't get what they mean by "manipulating the kernel matrix imposes significant computational bottle neck". As far as I know, the kernel matrix is static isn't it. For linear kernel it is just the dot product of the original features. In the case of RBF it is using the gaussian kernel. So I just need to calculate it once, then I am done isn't it. So what's the point of manipulating and the bottleneck thinkg
Support Vector Machine (SVM) (Cortes and Vap- nik, 1995) as the state-of-the-art classification algo- rithm has been widely applied in various scientific do- mains. The use of kernels allows the input samples to be mapped to a Reproducing Kernel Hilbert S- pace (RKHS), which is crucial to solving linearly non- separable problems. While kernel SVMs deliver the state-of-the-art results, the need to manipulate the k- ernel matrix imposes significant computational bottle- neck, making it difficult to scale up on large data.
It's because the kernel matrix is a matrix that is N rows by N columns in size where N is the number of training samples. So imagine you have 500,000 training samples, then that would mean the matrix needs 500,000*500,000*8 bytes (1.81 terabytes) of RAM. This is huge and would require some kind of parallel computing cluster to deal with in any reasonable way. Not to mention the time it takes to compute each element. For example, if it took your computer 1 microsecond to compute 1 kernel evaluation then it would take 69.4 hours to compute the entire kernel matrix. For comparison, a good linear solver can handle a problem of this size in a few minutes or an hour on a regular desktop workstation. So that's why linear SVMs are preferred.
To understand why they are so much faster you have to take a step back and think about how these optimizers work. At the highest level you can think of them as searching for a function that gives the correct outputs on all the training samples. Moreover, most solvers are iterative in the sense that they have a current best guess at what this function should be and in each iteration they test it on the training data and see how good it is. Then they update the function in some way to improve it. They keep doing this until they find the best function.
Keeping this in mind, the main reason why linear solvers are so fast is because the function they are learning is just a dot product between a fixed size weight vector and a training sample. So in each iteration of the optimization it just needs to compute the dot product between the current weight vector and all the samples. This takes O(N) time, moreover, good solvers converge in just a few iterations regardless of how many training samples you have. So the working memory for the solver is just the memory required to store the single weight vector and all the training samples. This means the entire process takes only O(N) time and O(N) bytes of RAM.
A non-linear solver on the other hand is learning a function that is not just a dot product between a weight vector and a training sample. In this case, it is a function that is the sum of a bunch of kernel evaluations between a test sample and all the other training samples. So in this case, just evaluating the function you are learning against one training sample takes O(N) time. Therefore, to evaluate it against all training samples takes O(N^2) time. There have been all manner of clever tricks devised to try and keep the non-linear function compact to speed this up. But all of them are at least a little bit heuristic or approximate in some sense while good linear solvers find exact solutions. So that's part of the reason for the popularity of linear solvers.
What is the difference between K-means clustering and vector quantization?
They seem to be very similar.
I'm dealing with Hidden Markov Models and I need to extract symbols from feature vectors.
In order to extract symbols, do I do vector quantization or k-means clustering?
The way I understand it, K-means is one type of vector quantization.
The K-means algorithms is the specialization of the celebrated "Lloyd I" quantization algorithm to the case of empirical distributions. (cf. Lloyd)
The Lloyd I algorithm is proved to yield a sequence of quantizers with a decreasing quadratic distortion. However, except in the special case of one-dimensional log-concave distributions, it dos not always converge to a quadratic optimal quantizer. (There are local minimums for the quantization error, especially when dealing with empirical distribution i.e. for the clustering problem.)
A method that converges (always) toward an optimal quantizer is the so-called CLVQ algorithms, which also generalizes to the problem of more general L^p quantization. It is a kind of Stochastic Gradient method. (cf. Pagès)
There are also some approaches based on genetic algorithms. (cf. Hamida et al.), and/or classical optimization procedures for the one dimensional case that converge faster (Pagès, Printems).