The parameter b from SVM in case of Gaussian kernel - machine-learning

I read about SVM and the case when we take as a kernel Gaussian kernel, and as I understood in the case of Gaussian kernel, it only cares about that if point is near to the some point, then the "color" of these points will be the same. And if I'll ask him to give me a "color" of a point which are far away from data points then he will answered me 0, which will means something like "I don't know". And I know that if I'll ask the predictor the color of that point, he will return b. The proof is below
Can we say that in case of Gaussian kernel b is equal to 0?

No, if we see the final decision function of SVM with Gaussian kernel as given below:
and Gaussian function is as follows:
we can see the 'sigma' parameter and 'signum' function here. The end result would be greater than zero for one class and vice versa.
So to find a plane (instead of a line for linear case) that separates these two classes, we need to adjust the values of b and sigma. These values vary from problem to problem. Hence, not necessarily be zero.
Researchers have utilized different optimization algorithms to get optimum values, i.e. Particle Swarm Optimization, Grey Wolf Optimization etc.
For example, the value of 'sigma' inflicts overfitting problem, and small value results in the under-learning problem. So it should be optimized.
For more information, you can read following open access article as an example.
The Impact of Different Kernel Functions on the
Performance of Scintillation Detection Based on
Support Vector Machines
I have used both PSO and GWO optimization algorithms to optimize key parameters of Least square support vector machine given below in my open-access research article:
Optimization of LSSVM parameters reference

Related

Should I normalize the inputs in my Neural Network?

first some context.
I'm taking on a very abitious project, making a Neural Network capable of playing Chess at a decent level. I might not succeed but I'm doing it mainly to learn how to approach this kind of machine learning.
I've decided I want to train the network using a genetic algorithm to fine tune the weights after different neural nets have fought against each other in a few games of chess.
Each neuron utilizes a hyperbolic tangent(-1, 1) to normalize the data after it has been processed but no normalization just yet to the input before it enters the network.
I've taken some inspiration from the Giraffe chess engine, particularly the inputs.
They are going to look sort of like this:
First layer:
number of remaining White Pawns (0-8)
number of remaining Black Pawns (0-8)
number of remaining White Knights (0-2)
number of remaining Black Knights (0-2)
....
Second layer still on the same level as the first:
Position of Pawn 1 (probably going with 2 values, x[0-7] and y[0-7])
Position of Pawn 2
...
Position of Queen 1
Position of Queen 2
...
Third layer, again on the same level of the previous two. The data is only going to "crosstalk" after the next layer of abstraction.
Values of pieces attacked by Pawn1 (this is going to be in the 0-12 ish range)
Values of pieces attacked by Pawn2
...
Value of pieces attacked by Bishop1
You get the idea.
If you didn't here's a terrible Paint representation of what I mean:
The question is: should I normalize the input data before it is read by the Neural Network?
I feel like squishing the data might not be such a good idea but I really don't have the competence to make a conclusive call.
I hope someone here can enlighten me on the subject and if you think I should normalize the data, I would like it if you could suggest some ways of doing so.
Thanks!
You shouldn't need to normalize anything inside the network. The point of machine learning is to train weights and bias to learn a non-linear function, in your example it'd be static chess evaluation. Thus, your second Normalized blue vertical bar (near the final output) is unnecessary.
Note: Hidden layers is a better terminology than abstraction layer, so I'll use it instead.
The other normalization you have before the hidden layers is optional but recommended. It also depends on what input we're talking about.
The Giraffe paper writes in page 18:
"Each slot has normalized x coordinate, normalized y coordinate ..."
Chess has 64 squares, without normalization the range would be [0,1,....63]. This is very discrete and the range is much higher than the other inputs (more about later). It does make sense to normalize them to something more manageable and comparable to the other inputs. The paper doesn't say how exactly it gets normalized, but I don't see why [0...1] range wouldn't work. It makes sense to normalize chess squares (or coordinate).
The other inputs such as whether there's a queen on the board is true or false, and thus require no normalization. For example, the Giraffe paper writes in page 18:
... whether piece is present or absent ...
Clearly, you wouldn't normalize it.
Answer to your question
If you represent Piece Count Layer as in Giraffe, you shouldn't need to normalize. But if you prefer a discrete representation in [0..8] (because there could be 9 queens in chess), you might want to normalize.
If you represent Piece Position Layer with chess squares, you should normalize just like Giraffe.
Giraffe doesn't normalize Piece Attack Defense Layer possibly it represents the information as the lowest-valued attacker and defender of each square. Unfortunately, the paper doesn't explicitly state how this is done. Your implementation might require normalization, so use your common sense.
Without any prior assumption which features would be more relevant for the model, you should normalize them to a comparable scaling.
EDITED
Let me answer your comment. Normalization is not the correct term, what you're talking is activation function (https://en.wikipedia.org/wiki/Activation_function). Normalization and activation function are not the same thing.

Bag of Features / Visual Words + Locality Sensitive Hashing

PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!

Custom kernels for SVM, when to apply them?

I am new to machine learning field and right now trying to get a grasp of how the most common learning algorithms work and understand when to apply each one of them. At the moment I am learning on how Support Vector Machines work and have a question on custom kernel functions.
There is plenty of information on the web on more standard (linear, RBF, polynomial) kernels for SVMs. I, however, would like to understand when it is reasonable to go for a custom kernel function. My questions are:
1) What are other possible kernels for SVMs?
2) In which situation one would apply custom kernels?
3) Can custom kernel substantially improve prediction quality of SVM?
1) What are other possible kernels for SVMs?
There are infinitely many of these, see for example list of ones implemented in pykernels (which is far from being exhaustive)
https://github.com/gmum/pykernels
Linear
Polynomial
RBF
Cosine similarity
Exponential
Laplacian
Rational quadratic
Inverse multiquadratic
Cauchy
T-Student
ANOVA
Additive Chi^2
Chi^2
MinMax
Min/Histogram intersection
Generalized histogram intersection
Spline
Sorensen
Tanimoto
Wavelet
Fourier
Log (CPD)
Power (CPD)
2) In which situation one would apply custom kernels?
Basically in two cases:
"simple" ones give very bad results
data is specific in some sense and so - in order to apply traditional kernels one has to degenerate it. For example if your data is in a graph format, you cannot apply RBF kernel, as graph is not a constant-size vector, thus you need a graph kernel to work with this object without some kind of information-loosing projection. also sometimes you have an insight into data, you know about some underlying structure, which might help classifier. One such example is a periodicity, you know that there is a kind of recuring effect in your data - then it might be worth looking for a specific kernel etc.
3) Can custom kernel substantially improve prediction quality of SVM?
Yes, in particular there always exists a (hypothethical) Bayesian optimal kernel, defined as:
K(x, y) = 1 iff arg max_l P(l|x) == arg max_l P(l|y)
in other words, if one has a true probability P(l|x) of label l being assigned to a point x, then we can create a kernel, which pretty much maps your data points onto one-hot encodings of their most probable labels, thus leading to Bayes optimal classification (as it will obtain Bayes risk).
In practise it is of course impossible to get such kernel, as it means that you already solved your problem. However, it shows that there is a notion of "optimal kernel", and obviously none of the classical ones is not of this type (unless your data comes from veeeery simple distributions). Furthermore, each kernel is a kind of prior over decision functions - closer you get to the actual one with your induced family of functions - the more probable is to get a reasonable classifier with SVM.

Support Vector Machine kernel types

Popular kernel functions used in Support Vector Machines are Linear, Radial Basis Function and Polynomial. Can someone please expalin what this kernel function is in simple way :) As I am new to this area I don't clear understand what is the importance of these kernel types.
Let us start from the beggining. Support vector machine is a linear model and it always looks for a hyperplane to separate one class from another. I will focus on two-dimensional case because it is easier to comprehend and - possible to visualize to give some intuition, however bear in mind that this is true for higher dimensions (simply lines change into planes, parabolas into paraboloids etc.).
Kernel in very short words
What kernels do is to change the definition of the dot product in the linear formulation. What does it mean? SVM works with dot products, for finite dimension defined as <x,y> = x^Ty = SUM_{i=1}^d x_i y_i. This more or less captures similarity between two vectors (but also a geometrical operation of projection, it is also heavily related to the angle between vectors). What kernel trick does is to change each occurence of <x,y> in math of SVM into K(x,y) saying "K is dot product in SOME space", and there exists a mapping f_K for each kernel, such that K(x,y)=<f_K(x), f_K(y)> the trick is, you do not use f_K directly, but just compute their dot products, which saves you tons of time (sometimes - infinite amount, as f_K(x) might have infinite number of dimensions). Ok, so what it meas for us? We still "live" in the space of x, not f_K(x). The result is quite nice - if you build a hyperplane in space of f_K, separate your data, and then look back at space of x (so you might say you project hyperplane back through f_K^{-1}) you get non-linear decision boundaries! Type of the boundary depends on f_K, f_K depends on K, thus, choice of K will (among other things) affect the shape of your boundary.
Linear kernel
Here we in fact do not have any kernel, you just have "normal" dot product, thus in 2d your decision boundary is always line.
As you can see we can separate most of points correctly, but due to the "stiffness" of our assumption - we will not ever capture all of them.
Poly
Here, our kernel induces space of polynomial combinations of our features, up to certain degree. Consequently we can work with slightly "bended" decision boundaries, such as parabolas with degree=2
As you can see - we separated even more points! Ok, can we get all of them by using higher order polynomials? Lets try 4!
Unfortunately not. Why? Because polynomial combinations are not flexible enough. It will not "bend" our space hard enough to capture what we want (maybe it is not that bad? I mean - look at this point, it looks like an outlier!).
RBF kernel
Here, our induced space is a space of Gaussian distributions... each point becomes probability density function (up to scaling) of a normal distribution. In such space, dot products are integrals (as we do have infinite number of dimensions!) and consequently, we have extreme flexibility, in fact, using such kernel you can separate everything (but is it good?)
Rough comparison
Ok, so what are the main differences? I will now sort these three kernels under few measures
time of SVM learning: linear < poly < rbf
ability to fit any data: linear < poly < rbf
risk of overfitting: linear < poly < rbf
risk of underfitting: rbf < poly < linear
number of hyperparameters: linear (0) < rbf (2) < poly (3)
how "local" is particular kernel: linear < poly < rbf
So which one to choose? It depends. Vapnik and Cortes (inventors of SVM) supported quite well the idea that you always should try to fit simpliest model possible and only if it underfits - go for more complex ones. So you should generally start with linear model (kernel in case of SVM) and if it gets really bad scores - switch to poly/rbf (however remember that it is much harder to work with them due to number of hyperparameters)
All images done using a nice applet on the site of libSVM - give it a try, nothing gives you more intuition then lots of images and interaction :-)
https://www.csie.ntu.edu.tw/~cjlin/libsvm/

Support Vector Machines understanding

Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.
In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
We do this with the help of kernel.
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!
Thanks in advance :)
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so
<f,g> = integral (f*g) (x) dx
where f,g are Gaussian distributions.
Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.
just a bit of introduction about svm before i start answering the question. This will help you get overview about svm. Svm task is to find the best margin-maximing hyperplane that best separates the data. We have soft margin representation of svm which is also known as primal form and its equivalent form is dual form of svm. Dual form of svm makes the use of kernel trick.
kernel trick is partially replacing the feature engineering which is the most important step in machine learning when we have datasets that are not linear (eg. datasets in shape of concentric circles).
Now you can transform this dataset from non-linear to linear by both FE and kernel trick. By FE you can square each of the features in this dataset and it will transform into linear dataset and then you can apply techniques like logisitic regression which work best for linear data.
In kernel trick you can use the polynomial kernel whose general form is (a + x_i(transpose)x_j)^d where a and d are constants and d specifies the degree, suppose if the degree is 2 then we can say it is quadratic and likewise. now lets say we apply d =2 now our equation becomes (a + x_i(transpose)x_j)^2. lets the we 2 features in our original dataset (eg. for x_1 vector is [x_11,x__12] and x_2 the vector is [x_21,x_22]) now when we apply polynomial kernel on this we get we get 6-d vectors. Now we have transformed the features from 2-d to 6-d.Now you can see intuitively that higher your dimension of data better would svm work because it will eventually transform that features to a higher space. Infact the best case of svm is if you have high dimensionality, then go for svm.
Now you can see both kernel trick and Feature Engineering solve and transforms the dataset(concentric circle one) but the difference is we are doing FE explicitly but kernel trick comes implicitly with svm. There is also a general purpose kernel know as Radial basis function kernel which can be used when you don't know the kernel in advance.
RBF kernel has a parameter (sigma) if the value of sigma is set to 1, then you get a curve that looks like gaussian curve.
You can consider kernel as a similarity metric only and can interpret if the distance between the points is less, higher will be the similarity.

Resources