The significance of the shape of the SVM margin

The significance of the shape of the SVM margin - machine-learning

I am having difficulty interpreting the shape of the SVM margin.
In both of the following examples, the RBF kernel is used:
Here the separator is almost the same in the two cases. In the case of the larger gamma (the second example), it is the margin that became more influenced by the shape of the data. What is the significance of these particular training samples at the bottom being support vectors? What is the significance of the samples further at the bottom being inside the margin?

By definition of the radial kernel, when gamma is large, only samples near x give significant terms in the sum in the formula for f(x).
Hence, when gamma is large and x is far from any sample, f(x) is very small.
Margin is a line beyond which f(x) is greater than 1. Given the item 2, it is now clear why the margin has the observed shape when gamma is large.

Related

Why does support vectors in SVM have alpha (Lagrangian multiplier) greater than zero?

I understood the overall SVM algorithm consisting of Lagrangian Duality and all, but I am not able to understand why particularly the Lagrangian multiplier is greater than zero for support vectors.
Thank you.

This might be a late answer but I am putting my understanding here for other visitors.
Lagrangian multiplier, usually denoted by α is a vector of the weights of all the training points as support vectors.
Suppose there are m training examples. Then α is a vector of size m. Now focus on any ith element of α: αi. It is clear that αi captures the weight of the ith training example as a support vector. Higher value of αi means that ith training example holds more importance as a support vector; something like if a prediction is to be made, then that ith training example will be more important in deriving the decision.
Now coming to the OP's concern:
I am not able to understand why particularly the Lagrangian multiplier
is greater than zero for support vectors.
It is just a construct. When you say αi=0, it is just that ith training example has zero weight as a support vector. You can instead also say that that ith example is not a support vector.
Side note: One of the KKT's conditions is the complementary slackness: αigi(w)=0 for all i. For a support vector, it must lie on the margin which implies that gi(w)=0. Now αi can or cannot be zero; anyway it is satisfying the complementary slackness condition.
For αi=0, you can choose whether you want to call such points a support vector or not based on the discussion given above. But for a non-support vector, αi must be zero for satisfying the complementary slackness as gi(w) is not zero.

I can't figure this out too...
If we take a simple example, say of 3 data points, 2 of positive class (yi=1): (1,2) (3,1) and one negative (yi=-1): (-1,-1) - and we calculate using Lagrange multipliers, we will get a perfect w (0.25,0.5) and b = -0.25, but one of our alphas was negative (a1 = 6/32, a2 = -1/32, a3 = 5/32).

Meaning of Histogram on Tensorboard

I am working on Google Tensorboard, and I'm feeling confused about the meaning of Histogram Plot. I read the tutorial, but it seems unclear to me. I really appreciate if anyone could help me figure out the meaning of each axis for Tensorboard Histogram Plot.
Sample histogram from TensorBoard

I came across this question earlier, while also seeking information on how to interpret the histogram plots in TensorBoard. For me, the answer came from experiments of plotting known distributions.
So, the conventional normal distribution with mean = 0 and sigma = 1 can be produced in TensorFlow with the following code:
import tensorflow as tf
cwd = "test_logs"
W1 = tf.Variable(tf.random_normal([200, 10], stddev=1.0))
W2 = tf.Variable(tf.random_normal([200, 10], stddev=0.13))
w1_hist = tf.summary.histogram("weights-stdev_1.0", W1)
w2_hist = tf.summary.histogram("weights-stdev_0.13", W2)
summary_op = tf.summary.merge_all()
init = tf.initialize_all_variables()
sess = tf.Session()
writer = tf.summary.FileWriter(cwd, session.graph)
sess.run(init)
for i in range(2):
writer.add_summary(sess.run(summary_op),i)
writer.flush()
writer.close()
sess.close()
Here is what the result looks like:
.
The horizontal axis represents time steps.
The plot is a contour plot and has contour lines at the vertical axis values of -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, and 1.5.
Since the plot represents a normal distribution with mean = 0 and sigma = 1 (and remember that sigma means standard deviation), the contour line at 0 represents the mean value of the samples.
The area between the contour lines at -0.5 and +0.5 represent the area under a normal distribution curve captured within +/- 0.5 standard deviations from the mean, suggesting that it is 38.3% of the sampling.
The area between the contour lines at -1.0 and +1.0 represent the area under a normal distribution curve captured within +/- 1.0 standard deviations from the mean, suggesting that it is 68.3% of the sampling.
The area between the contour lines at -1.5 and +1-.5 represent the area under a normal distribution curve captured within +/- 1.5 standard deviations from the mean, suggesting that it is 86.6% of the sampling.
The palest region extends a little beyond +/- 4.0 standard deviations from the mean, and only about 60 per 1,000,000 samples will be outside of this range.
While Wikipedia has a very thorough explanation, you can get the most relevant nuggets here.
Actual histogram plots will show several things. The plot regions will grow and shrink in vertical width as the variation of the monitored values increases or decreases. The plots may also shift up or down as the mean of the monitored values increases or decreases.
(You may have noted that the code actually produces a second histogram with a standard deviation of 0.13. I did this to clear up any confusion between the plot contour lines and the vertical axis tick marks.)

#marc_alain, you're a star for making such a simple script for TB, which are hard to find.
To add to what he said the histograms showing 1,2,3 sigma of the distribution of weights. which is equivalent to the 68th,95th, and 98th percentiles. So think if you're model has 784 weights, the histogram shows how the values of those weights change with training.
These histograms are probably not that interesting for shallow models, you could imagine that with deep networks, weights in high layers might take a while to grow because of the logistic function being saturated. Of course I'm just mindlessly parroting this paper by Glorot and Bengio, in which they study the weights distribution through training and show how the logistic function is saturated for the higher layers for quite a while.

When plotting histograms, we put the bin limits on the x-axis and the count on the y-axis. However, the whole point of histogram is to show how a tensor changes over times. Hence, as you may have already guessed, the depth axis (z-axis) containing the numbers 100 and 300, shows the epoch numbers.
The default histogram mode is Offset mode. Here the histogram for each epoch is offset in the z-axis by a certain value (to fit all epochs in the graph). This is like seeing all histograms places one after the other, from one corner of the ceiling of the room (from the mid point of the front ceiling edge to be precise).
In the Overlay mode, the z-axis is collapsed, and the histograms become transparent, so you can move and hover over to highlight the one corresponding to a particular epoch. This is more like the front view of the Offset mode, with only outlines of histograms.
As explained in the documentation here:
tf.summary.histogram
takes an arbitrarily sized and shaped Tensor, and compresses it into a
histogram data structure consisting of many bins with widths and
counts. For example, let's say we want to organize the numbers [0.5,
1.1, 1.3, 2.2, 2.9, 2.99] into bins. We could make three bins:
a bin containing everything from 0 to 1 (it would contain one element, 0.5),
a bin containing everything from 1-2 (it would contain two elements, 1.1 and 1.3),
a bin containing everything from 2-3 (it would contain three elements: 2.2, 2.9 and 2.99).
TensorFlow uses a similar approach to create bins, but unlike in our
example, it doesn't create integer bins. For large, sparse datasets,
that might result in many thousands of bins. Instead, the bins are
exponentially distributed, with many bins close to 0 and comparatively
few bins for very large numbers. However, visualizing
exponentially-distributed bins is tricky; if height is used to encode
count, then wider bins take more space, even if they have the same
number of elements. Conversely, encoding count in the area makes
height comparisons impossible. Instead, the histograms resample the
data into uniform bins. This can lead to unfortunate artifacts in
some cases.
Please read the documentation further to get the full knowledge of plots displayed in the histogram tab.

Roufan,
The histogram plot allows you to plot variables from your graph.
w1 = tf.Variable(tf.zeros([1]),name="a",trainable=True)
tf.histogram_summary("firstLayerWeight",w1)
For the example above the vertical axis would have the units of my w1 variable. The horizontal axis would have units of the step which I think is captured here:
summary_str = sess.run(summary_op, feed_dict=feed_dict)
summary_writer.add_summary(summary_str, **step**)
It may be useful to see this on how to make summaries for the tensorboard.
Don

Each line on the chart represents a percentile in the distribution over the data: for example, the bottom line shows how the minimum value has changed over time, and the line in the middle shows how the median has changed. Reading from top to bottom, the lines have the following meaning: [maximum, 93%, 84%, 69%, 50%, 31%, 16%, 7%, minimum]
These percentiles can also be viewed as standard deviation boundaries on a normal distribution: [maximum, μ+1.5σ, μ+σ, μ+0.5σ, μ, μ-0.5σ, μ-σ, μ-1.5σ, minimum] so that the colored regions, read from inside to outside, have widths [σ, 2σ, 3σ] respectively.

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.

It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.

I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.

The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces

For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.

I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.

Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.

Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.

SVM - what is a functional margin?

A geometric margin is simply the euclidean distance between a certain x (data point) to the hyperlane.
What is the intuitive explanation to what a functional margin is?
Note: I realize that a similar question has been asked here:
How to understand the functional margin in SVM ?
However, the answer given there explains the equation, but not its meaning (as I understood it).

"A geometric margin is simply the euclidean distance between a certain x (data point) to the hyperlane. "
I don't think that is a proper definition for the geometric margin, and I believe that is what is confusing you. The geometric margin is just a scaled version of the functional margin.
You can think the functional margin, just as a testing function that will tell you whether a particular point is properly classified or not. And the geometric margin is functional margin scaled by ||w||
If you check the formula:
You can notice that independently of the label, the result would be positive for properly classified points (e.g sig(1*5)=1 and sig(-1*-5)=1) and negative otherwise. If you scale that by ||w|| then you will have the geometric margin.
Why does the geometric margin exists?
Well to maximize the margin you need more that just the sign, you need to have a notion of magnitude, the functional margin would give you a number but without a reference you can't tell if the point is actually far away or close to the decision plane. The geometric margin is telling you not only if the point is properly classified or not, but the magnitude of that distance in term of units of |w|

The functional margin represents the correctness and confidence of the prediction if the magnitude of the vector(w^T) orthogonal to the hyperplane has a constant value all the time.
By correctness, the functional margin should always be positive, since if wx + b is negative, then y is -1 and if wx + b is positive, y is 1. If the functional margin is negative then the sample should be divided into the wrong group.
By confidence, the functional margin can change due to two reasons: 1) the sample(y_i and x_i) changes or 2) the vector(w^T) orthogonal to the hyperplane is scaled (by scaling w and b).
If the vector(w^T) orthogonal to the hyperplane remains the same all the time, no matter how large its magnitude is, we can determine how confident the point is grouped into the right side. The larger that functional margin, the more confident we can say the point is classified correctly.
But if the functional margin is defined without keeping the magnitude of the vector(w^T) orthogonal to the hyperplane the same, then we define the geometric margin as mentioned above. The functional margin is normalized by the magnitude of w to get the geometric margin of a training example. In this constraint, the value of the geometric margin results only from the samples and not from the scaling of the vector(w^T) orthogonal to the hyperplane.
The geometric margin is invariant to the rescaling of the parameter, which is the only difference between geometric margin and functional margin.
EDIT:
The introduction of functional margin plays two roles: 1) intuit the maximization of geometric margin and 2) transform the geometric margin maximization issue to the minimization of the magnitude of the vector orthogonal to the hyperplane.
Since scaling the parameters w and b can result in nothing meaningful and the parameters are scaled in the same way as the functional margin, then if we can arbitrarily make the ||w|| to be 1(results in maximizing the geometric margin) we can also rescale the parameters to make them subject to the functional margin being 1(then minimize ||w||).

Check Andrew Ng's Lecture Notes from Lecture 3 on SVMs (notation changed to make it easier to type without mathjax/TeX on this site):
"Let’s formalize the notions of the functional and geometric margins
. Given a
training example (x_i, y_i) we define the functional margin of (w, b) with
respect to the training example
gamma_i = y_i( (w^T)x_i + b )
Note that if y_i > 0 then for the functional margin to be large (i.e., for
our prediction to be confident and correct), we need (w^T)x + b to be a large
positive number. Conversely, if y_i < 0, then for the functional margin
to be large, we need (w^T)x + b to be a large negative number. Moreover, if
y_i( (w^T)x_i + b) > 0
then our prediction on this example is correct. (Check this yourself.) Hence, a large functional margin represents a confident and a correct prediction."
Page 3 from the Lecture 3 PDF linked at the materials page linked above.

Not going into unnecessary complications about this concept, but in the most simple terms here is how one can think of and relate functional and geometric margin.
Think of functional margin -- represented as 𝛾̂, as a measure of correctness of a classification for a data unit. For a data unit x with parameters w and b and given class y = 1, the functional margin is 1 only when y and (wx + b) are both of the same sign - which is to say are correctly classified.
But we do not just rely on if we are correct or not in this classification. We need to know how correct we are, or what is the degree of confidence that we have in this classification. For this we need a different measure, and this is called geometric margin -- represented as 𝛾, and it can be expressed as below:
𝛾 = 𝛾̂ / ||𝑤||
So, geometric margin 𝛾 is a scaled version of functional margin 𝛾̂. If ||w|| == 1, then the geometric margin is same as functional margin - which is to say we are as confident in the correctness of this classification as we are correct in classifying a data unit to a particular class.
This scaling by ||w|| gives us the measure of confidence in our correctness. And we always try to maximise this confidence in our correctness.
Functional margin is like a binary valued or boolean valued variable: if we have correctly classified a particular data unit or not. So, this cannot be maximised. However, geometric margin for the same data unit gives a magnitude to our confidence, and tells us how correct we are.So, this we can maximise.
And we aim for larger margin through the means of geometric margin because the wider the margin the more is the confidence in our classification.
As an analogy, say a wider road (larger margin => higher geometric margin) gives higher confidence to drive must faster as it lessens the chance of hitting any pedestrian or trees (our data units in the training set), but on the narrower road (smaller margin => smaller geometric margin), one has to be a lot more cautious to not hit (lesser confidence) any pedestrian or trees. So, we always desire wider roads (larger margin), and that's why we aim to maximise it by maximising our geometric margin.

How to determine the window size of a Gaussian filter

Gaussian smoothing is a common image processing function, and for an introduction of Gaussian filtering, please refer to here. As we can see, one parameter: standard derivation will determine the shape of Gaussian function. However, when we perform convolution with Gaussian filtering, another parameter: the window size of Gaussian filter should also be determined at the same time. For example, when we use fspecial function provided by MATLAB, not only the standard derivation but also the window size must be provided. Intuitively, the larger the Gaussian standard derivation is the bigger the Gaussian kernel window should. However, there is no general rule about how to set the right window size. Any ideas? Thanks!

The size of the mask drives the filter amount. A larger size, corresponding to a larger convolution mask, will generally result in a greater degree of filtering. As a kinda trade-off for greater amounts of noise reduction, larger filters also affect the details quality of the image.
That's as milestone. Now coming to the Gaussian filter, the standard deviation is the main parameter. If you use a 2D filter, at the edge of the mask you will probably desire the weights to approximate 0.
To this respect, as I already said, you can choose a mask with a size which is generally three times the standard deviation. This way, almost the whole Gaussian bell is taken into account and at the mask's edges your weights will asymptotically tend to zero.
I hope this helps.

Here is a good reference.
After discretizing, pixel with distance greater than 3 sigma have negligible weights. See this
As already pointed, 6sigma, implies 3sigma both ways
Size of convolution matrix to be used for filtering would inadvertently be 6sigma by 6sigma, because of points 1 and 2 above.
Here how you can obtain the discrete Gaussian.
Finally, the size of the standard deviation(and therefore the Kernel used) depends on how much noise you suspect to be in the image. Clearly, a larger convolution kernel implies farther pixels get to contribute to the new value of the centre pixel as opposed to a smaller kernel.

Given sigma and the minimal weight epsilon in the filter you can solve for the necessary radius of the filter x:
For example if sigma = 1 then the gaussian is greater than epsilon = 0.01 when x <= 2.715 so a filter radius = 3 (width = 2*3 + 1 = 7) is sufficient.
sigma = 0.5, x <= 1.48, use radius 2
sigma = 1, x <= 2.715, use radius 3
sigma = 1.5, x <= 3.84, use radius 4
sigma = 2, x <= 4.89, use radius 5
sigma = 2.5, x <= 5.88, use radius 6
If you reduce/increase epsilon then you will need a larger/smaller radius.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart