I'm reading Andrew NG's Machine Learning notes, but the functional margin definition confused me :
I can understand to geometric margin is the distance from x to its hyperplane, but how to understand functional margin ? And why they define its formula like that ?
Think of it like this: w^T.x_i +b is the model's prediction for the i-th data point. Y_i is its label. If the prediction and ground truth have the same sign, then gamma_i will be positive. The further "inside" the class boundary this instance is, the bigger gamma_i will be : this is better because, summed over all i, you will have greater separation between your classes. If the prediction and the label don't agree in sign, then this quantity will be negative (incorrect decision by the predictor), which will reduce your margin, and it will be reduced more the more incorrect you are (analogous to slack variables).
Functional Margin:
This gives the position of the point with respect to the plane, which does not depend on the magnitude.
Geometric Margin:
This gives the distance between the given training example and the given plane.
You can transfer functional margin to geometric margin based on the following two hypothesis:
||w|| == 1, therefore (w^T)x+b == ((w^T)x+b)/||w||, which is the geometry distance from point x to the line y=(w^T)x+b.
There are only two categories for targets, where y_i can only be +1 and -1. Therefore, if the sign of y_i matches the side of the line where the point x lies in (y_i > 0 when (w^T)x+b > 0, y_i < 0 when (w^T)x+b < 0), multiplying y_i is simply equivalent to getting the absolute value of the distance (w^T)x+b.
For this question
And why they define its formula like that ?
Explanation: Functional margin doesn't tell us about the exact distance or measurement of different points to the separating plane/line.
For instance, just consider following lines they are same but functional margin would vary (a limitation of functional margin).
2*x + 3*y + 1 = 0
4*x + 6*y + 2 = 0
20*x + 30*y +10 = 0
Functional Margin just give an idea about the confidence of our classification, nothing concrete.
Please also read below reference for more details.
Referenced Andrew NG's lecture notes, please click here for more details
If y(i) = 1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need wTx + b to be a large positive number. Conversely, if y(i) = −1, then for the functional margin to be large, we need wTx + b to be a large negative number. Moreover, if y(i)(wTx + b) > 0, then our prediction on this example is correct. (Check this yourself.) Hence, a large functional margin represents a confident and a correct prediction.
functional margin is used to scale.
geometric margin = functional margin / norm(w).
Or, when norm(w) = 1 then the margin is geometric margin
Related
I am having difficulty interpreting the shape of the SVM margin.
In both of the following examples, the RBF kernel is used:
Here the separator is almost the same in the two cases. In the case of the larger gamma (the second example), it is the margin that became more influenced by the shape of the data. What is the significance of these particular training samples at the bottom being support vectors? What is the significance of the samples further at the bottom being inside the margin?
By definition of the radial kernel, when gamma is large, only samples near x give significant terms in the sum in the formula for f(x).
Hence, when gamma is large and x is far from any sample, f(x) is very small.
Margin is a line beyond which f(x) is greater than 1. Given the item 2, it is now clear why the margin has the observed shape when gamma is large.
I understood the overall SVM algorithm consisting of Lagrangian Duality and all, but I am not able to understand why particularly the Lagrangian multiplier is greater than zero for support vectors.
Thank you.
This might be a late answer but I am putting my understanding here for other visitors.
Lagrangian multiplier, usually denoted by α is a vector of the weights of all the training points as support vectors.
Suppose there are m training examples. Then α is a vector of size m. Now focus on any ith element of α: αi. It is clear that αi captures the weight of the ith training example as a support vector. Higher value of αi means that ith training example holds more importance as a support vector; something like if a prediction is to be made, then that ith training example will be more important in deriving the decision.
Now coming to the OP's concern:
I am not able to understand why particularly the Lagrangian multiplier
is greater than zero for support vectors.
It is just a construct. When you say αi=0, it is just that ith training example has zero weight as a support vector. You can instead also say that that ith example is not a support vector.
Side note: One of the KKT's conditions is the complementary slackness: αigi(w)=0 for all i. For a support vector, it must lie on the margin which implies that gi(w)=0. Now αi can or cannot be zero; anyway it is satisfying the complementary slackness condition.
For αi=0, you can choose whether you want to call such points a support vector or not based on the discussion given above. But for a non-support vector, αi must be zero for satisfying the complementary slackness as gi(w) is not zero.
I can't figure this out too...
If we take a simple example, say of 3 data points, 2 of positive class (yi=1): (1,2) (3,1) and one negative (yi=-1): (-1,-1) - and we calculate using Lagrange multipliers, we will get a perfect w (0.25,0.5) and b = -0.25, but one of our alphas was negative (a1 = 6/32, a2 = -1/32, a3 = 5/32).
For a given dataset of 2-D input data, we apply the SVM learning
algorithm and achieve an optimal decision plane:
H(x) = x^1 + 2x^2 + 3
What is the margin of this SVM?
I've been looking at this for hours trying to work out how to answer this. I think it's meant to be relatively simple but I've been searching through my learning material and cannot find how I'm meant to answer this.
I'd appreciate some help on the steps I should use to solve this.
Thanks.
It is imposible to calculate the margin wit only given optimal decision plane. You should give the support vectors or at least samples of classes.
Anyway, you can follow this steps:
1- Calculate Lagrange Multipliers (alphas) I don' t know which environment you work on but you can use Quadratic Programming Solver of MATLAB: quadprog(), it is not hard to use.
2- Find support vectors. Remember, only alphas of support vectors don' t equal to zero (but other alphas of samples equal to zero) so you can find support vectors of classes.
3- Calculate w vector which is a vector orthogonal to optimal hyperplane. You know, can use the summation below to calculate this vector:
where,
alpha(i): alphas (lagrange multipliers) of support vector;
y(i) : labels of samples (say -1 or +1);
phi() : kernel function;
x(i) : support vectors.
4- Take one support vector from each class lets say one is SV1 from class 1 and other SV2 from class 2. Now you can calculate the margin using vector projection and dot product:
margin = < (SV1 - SV2), w > / norm(w)
where,
<(SV1 - SV2), w> : dot product of vector (SV1 - SV2) and vector w
norm(w) : norm of vector w
I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.
It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.
I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.
The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces
For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.
I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.
Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.
Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.
A geometric margin is simply the euclidean distance between a certain x (data point) to the hyperlane.
What is the intuitive explanation to what a functional margin is?
Note: I realize that a similar question has been asked here:
How to understand the functional margin in SVM ?
However, the answer given there explains the equation, but not its meaning (as I understood it).
"A geometric margin is simply the euclidean distance between a certain x (data point) to the hyperlane. "
I don't think that is a proper definition for the geometric margin, and I believe that is what is confusing you. The geometric margin is just a scaled version of the functional margin.
You can think the functional margin, just as a testing function that will tell you whether a particular point is properly classified or not. And the geometric margin is functional margin scaled by ||w||
If you check the formula:
You can notice that independently of the label, the result would be positive for properly classified points (e.g sig(1*5)=1 and sig(-1*-5)=1) and negative otherwise. If you scale that by ||w|| then you will have the geometric margin.
Why does the geometric margin exists?
Well to maximize the margin you need more that just the sign, you need to have a notion of magnitude, the functional margin would give you a number but without a reference you can't tell if the point is actually far away or close to the decision plane. The geometric margin is telling you not only if the point is properly classified or not, but the magnitude of that distance in term of units of |w|
The functional margin represents the correctness and confidence of the prediction if the magnitude of the vector(w^T) orthogonal to the hyperplane has a constant value all the time.
By correctness, the functional margin should always be positive, since if wx + b is negative, then y is -1 and if wx + b is positive, y is 1. If the functional margin is negative then the sample should be divided into the wrong group.
By confidence, the functional margin can change due to two reasons: 1) the sample(y_i and x_i) changes or 2) the vector(w^T) orthogonal to the hyperplane is scaled (by scaling w and b).
If the vector(w^T) orthogonal to the hyperplane remains the same all the time, no matter how large its magnitude is, we can determine how confident the point is grouped into the right side. The larger that functional margin, the more confident we can say the point is classified correctly.
But if the functional margin is defined without keeping the magnitude of the vector(w^T) orthogonal to the hyperplane the same, then we define the geometric margin as mentioned above. The functional margin is normalized by the magnitude of w to get the geometric margin of a training example. In this constraint, the value of the geometric margin results only from the samples and not from the scaling of the vector(w^T) orthogonal to the hyperplane.
The geometric margin is invariant to the rescaling of the parameter, which is the only difference between geometric margin and functional margin.
EDIT:
The introduction of functional margin plays two roles: 1) intuit the maximization of geometric margin and 2) transform the geometric margin maximization issue to the minimization of the magnitude of the vector orthogonal to the hyperplane.
Since scaling the parameters w and b can result in nothing meaningful and the parameters are scaled in the same way as the functional margin, then if we can arbitrarily make the ||w|| to be 1(results in maximizing the geometric margin) we can also rescale the parameters to make them subject to the functional margin being 1(then minimize ||w||).
Check Andrew Ng's Lecture Notes from Lecture 3 on SVMs (notation changed to make it easier to type without mathjax/TeX on this site):
"Let’s formalize the notions of the functional and geometric margins
. Given a
training example (x_i, y_i) we define the functional margin of (w, b) with
respect to the training example
gamma_i = y_i( (w^T)x_i + b )
Note that if y_i > 0 then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need (w^T)x + b to be a large
positive number. Conversely, if y_i < 0, then for the functional margin
to be large, we need (w^T)x + b to be a large negative number. Moreover, if
y_i( (w^T)x_i + b) > 0
then our prediction on this example is correct. (Check this yourself.) Hence, a large functional margin represents a confident and a correct prediction."
Page 3 from the Lecture 3 PDF linked at the materials page linked above.
Not going into unnecessary complications about this concept, but in the most simple terms here is how one can think of and relate functional and geometric margin.
Think of functional margin -- represented as 𝛾̂, as a measure of correctness of a classification for a data unit. For a data unit x with parameters w and b and given class y = 1, the functional margin is 1 only when y and (wx + b) are both of the same sign - which is to say are correctly classified.
But we do not just rely on if we are correct or not in this classification. We need to know how correct we are, or what is the degree of confidence that we have in this classification. For this we need a different measure, and this is called geometric margin -- represented as 𝛾, and it can be expressed as below:
𝛾 = 𝛾̂ / ||𝑤||
So, geometric margin 𝛾 is a scaled version of functional margin 𝛾̂. If ||w|| == 1, then the geometric margin is same as functional margin - which is to say we are as confident in the correctness of this classification as we are correct in classifying a data unit to a particular class.
This scaling by ||w|| gives us the measure of confidence in our correctness. And we always try to maximise this confidence in our correctness.
Functional margin is like a binary valued or boolean valued variable: if we have correctly classified a particular data unit or not. So, this cannot be maximised. However, geometric margin for the same data unit gives a magnitude to our confidence, and tells us how correct we are.So, this we can maximise.
And we aim for larger margin through the means of geometric margin because the wider the margin the more is the confidence in our classification.
As an analogy, say a wider road (larger margin => higher geometric margin) gives higher confidence to drive must faster as it lessens the chance of hitting any pedestrian or trees (our data units in the training set), but on the narrower road (smaller margin => smaller geometric margin), one has to be a lot more cautious to not hit (lesser confidence) any pedestrian or trees. So, we always desire wider roads (larger margin), and that's why we aim to maximise it by maximising our geometric margin.