Let H denote the set of axis-parallel rectangles in Rn. Each
rectangle defines a binary classifier that assigns label +1 to points
inside the rectangle and label -1 to points outside the rectangle.
Each rectangle is defined by an interval [ai,bi] in each dimension 1
≤ i ≤ n. A pointx=(x1,...,xn)∈Rn is in the rectangle if ai ≤xi ≤bi
for1≤i≤n.
What is the VC dimension of H? Justify your answer.
It's typed using LaTex; this is the original text.
From my understanding, VC dimension is kinda largest integer d such that there exists a sample of size d that can be shattered by the hypothesis set H, but in this case, how could we calculate this if H is rectangles?
For an axis-aligned rectangle in Rn, the VC dimension is (at least) 2*n.
Consider first a rectangle in R1, which is just a pair of min/max values (a1, b1) along the x1 axis (not actually a rectangle). This pair of values has VC dimension of 2 because for any two points in R1, you can set (a1, b1) such that one, both, or neither of the two points is between them. But when you add a third point along that axis, there is no way you can include the two outer points in the range (a1, b1) without also including the middle point.
To make it easier to visualize, suppose the two points lie at x1 = -2 and x1 = 2, respectively. You can shatter the set by defining your rectangle bounds (a1, b1) as
(-3, 3) -> both points included
(-3, 1) -> first point only
(-1, 3) -> second point only
(-1, 1) -> neither point included
Now suppose you add an additional dimension so your space is in R2 (i.e., now it is a true rectangular region). Add two more points and let the x1 coordinates of the new points be 0 so they will always be included along the first dimension (for the rectangles defined above) and set the x2 coordinates of the new points to -2 and 2, respectively. Also, set the x2 coordinates of the first two points to 0.
When you set (a1, b1, a2, b2) to (-3, 3, -3, 3) you will include all 4 points. And you can obviously exclude any point by simply be reducing the magnitude of one of the 4 bounds from 3 to 1. Since you can include any subset of the 4 points by changing the R2 rectangle's corresponding bound, the VC dimension of the R2 rectangle is at least 4.
It should be fairly obvious that you can repeat this procedure for an arbitrary number of dimensions. Each time you add a new dimension xi, set the xi coordinate of all points to 0 except for the two new points you add for that dimension (they will be at xi = -2 and xi = 2, respectively).
Since you can shatter 2 additional points for every dimension, then for an axis-aligned rectangle in Rn, the VC dimension will be at least 2*n.
Related
In the picture of SVM from Wikipedia, at the lower left corner - pointed by the red arrow, there's b / ||w||. How is that calculated? In other words, why is the line in the picture b / ||w||? Thanks.
The line represents the affine subspace of points x, whose scalar product with the weight vector w yields b. As w is typically not a unit vector (i.e. need not have length 1), one has to divide b by the norm (the "length") of w to get the actual distance from the origin.
More precisely: imagine a vector x starting at the origin and reaching out to a point on the red line and let u be the unit vector in the direction of w, i.e. u = w / ||w||. Then the scalar product of x and u multiplied with u is the projection of x onto the unit vector u and its length corresponds to the distance of the red line from the origin. If you calculate the scalar product of <x,w> (written as x*w in the graphics) instead, you still get a projection on u, which has length b (that's actually how b is defined), so to get back the distance from the origin one has to calculate b/||w||.
I modified equation 9.12 in http://www.deeplearningbook.org/contents/convnets.html to center the MxN convolution kernel.
That gives the following expression (take it on faith for now) for the gradient, assuming 1 input and 1 output channel (to simplify):
dK(krow, kcol) = sum(G(row, col) * V(row+krow-M/2, col+kcol-N/2); row, col)
To read the above, the single element of dK at krow, kcol is equal to the sum over all of the rows and cols of the product of G times a shifted V. Note G and V have the same dimensions. We will define going outside V to result in a zero.
For example, in one dimension, if G is [a b c d], V is [w x y z], and M is 3, then the first sum is dot (G, [0 w x y]), the second sum is dot (G, [w x y z]), and the third sum is dot (G, [x y z 0]).
ArrayFire has a shift operation, but it does a circular shift, rather than a shift with zero insertion. Also, the kernel sizes MxN are typically small, e.g., 7x7, so it seems a more optimal implementation would read in G and V once only, and accumulate over the kernel.
For that 1D example, we would read in a and w,x and start with [a*0 aw ax]. Then we read in b,y and add [bw bx by]. Then read in c,z and add [cx cy cz]. Then read in d and finally add [dy dz d*0].
Is there a direct way to compute dK in ArrayFire? I can't help but think this is some kind of convolution, but I've been unable to wrap my head around what the convolution would look like.
Ah so. For a 3x3 dK array, I use unwrap to convert my MxN input arrays to two MxN column vectors. Then I do 9 dot products of shifted subsets of the two column vectors. No, that doesn't work since the shift is in 2 dimensions.
So I need to create intermediate arrays of 1 x (MxN) and (MxN) x 9 in size, where each column of the latter is a shifted MxN window of the original with a pad border of zeros of size 1, and then do a matrix multiply.
Hmm, that requires too much memory (sometimes.) So the final solution is to do a gfor over the output 3x3, and for each loop, do a dot product of the unwrapped-once G and the unwrapped-repeatedly V.
Agreed?
consider an image matrix in which i have multiple line segments. And i have information's like start point, end points, length of the line segment, centroid and slope of all those line segments. In this scenario how do i find line segments that are nearest to a particular line segment. Also once i got nearest line segments is it possible to detect rectangles if they exist? .An example image is in this link sample.
The geometry of segment/segment distance is not so simple.
Imagine two line segments in general position, and dilate one of them. I mean, draw two parallel segments at a distance d and two half circles of radius d centered on the endpoints. This is the locus of constant distance d from the segment. You need to find the smallest d such that the other segment is hit.
You can decompose the plane in three areas: the band between the two perpendiculars through the endpoints, and the two outer half planes. You can clip the other segment in these three areas and process the clipped segments separately.
Inside the band, either the segments intersect and the distance is zero. Or they don't and the distance is the shortest distance between the line of support of the first segment and the endpoints of the other.
Inside the half planes, the distance is the shortest of the distances between the considered endpoint and both endpoints of the other segment, and the distance between the endpoint and the other segment, provided then endpoint projects inside the other segment.
ALTERNATIVE:
Maybe it is easier to use the parametric equations of the two segments and minimize the (squared) distance, like:
Min(p, q) ((Xa (1-p) + Xb p) - (Xc (1-q) + Xd q))^2 + ((Ya (1-p) + Yb p) - (Yc (1-q) + Yd q))^2 under constraints 0 <= p, q <=1.
First, you have to encode all the points in homogeneous coordinates [x, y, 1]T since this creates a symmetric relations between lines and points. Namely, in homogeneous coordinates the intersection of two lines l1 and l2 is a point p=l1xl2, where x means vector product. By the same, coin a line that passes through two points p1, p2 is l=p1xp2. Line that lie on a segment can be expressed as l=p1xp2=[a, b, c]T. The line equation then will be lT.p=0 or in Cartesian coordinates a*x+b*y+c=0
As for your task, there are two cases:
1. segments cross and then their intersection can be simply calculated as l1xl2;
2. segments don’t cross and then the closest points between two lines is the closest point between two of their 4 endpoints. To calculate 4 possible distances and choose the smallest distance between a line segment x1 and x2 and a point x0 use this formula: (x2-x1)x(x1-x0)/|x2-x1|
Let the segments be AB and CD, and running parameters along them p and q, such that 0 <= p, q <= 1.
Using vectors, the squared distance between any two points on the segments is given by:
D² = (AC - p AB + q CD)²
Let us minimize this expression by zeroing the derivatives wrt p and q:
AB.(AC - p AB + q CD) = 0
CD.(AC - p AB + q CD) = 0
When AB and CD are not parallel, this implies AC - p AB + q CD = 0, which gives you the intersection point by solving a 2x2 system, and the distance is zero.
But it can turn out that p (or q) falls out of the allowed range, let p < 0 (or p > 1). In this case, we recast the problem with p = 0 (p = 1). This amounts to finding the distance (A, CD).
D² = (AC + q CD)²
yielding
CD.(AC + q CD) = 0
even easier.
And if it turns out that q is also out of range, let q < 0, we end up with the distance (A, C):
D² = AC²
Similarly for the other out-of-range cases.
In case of parallel segments, the 2x2 system is indeterminate (both equations are equivalent). You need to solve:
CD.AC - p CD.AB + q CD² = 0
It suffices to try all four combinations with p/q = 0/1 and see if the left hand side takes different signs. This proves that there exists a solution and the distance is the same as distance (A, CD). Otherwise, the answer is one of the endpoint-to-endpoint distance.
I am beginner in neural networks. I am learning about perceptrons.
My question is Why is weight vector perpendicular to decision boundary(Hyperplane)?
I referred many books but all are mentioning that weight vector is perpendicular to decision boundary but none are saying why?
Can anyone give me an explanation or reference to a book?
The weights are simply the coefficients that define a separating plane. For the moment, forget about neurons and just consider the geometric definition of a plane in N dimensions:
w1*x1 + w2*x2 + ... + wN*xN - w0 = 0
You can also think of this as being a dot product:
w*x - w0 = 0
where w and x are both length-N vectors. This equation holds for all points on the plane. Recall that we can multiply the above equation by a constant and it still holds so we can define the constants such that the vector w has unit length. Now, take out a piece of paper and draw your x-y axes (x1 and x2 in the above equations). Next, draw a line (a plane in 2D) somewhere near the origin. w0 is simply the perpendicular distance from the origin to the plane and w is the unit vector that points from the origin along that perpendicular. If you now draw a vector from the origin to any point on the plane, the dot product of that vector with the unit vector w will always be equal to w0 so the equation above holds, right? This is simply the geometric definition of a plane: a unit vector defining the perpendicular to the plane (w) and the distance (w0) from the origin to the plane.
Now our neuron is simply representing the same plane as described above but we just describe the variables a little differently. We'll call the components of x our "inputs", the components of w our "weights", and we'll call the distance w0 a bias. That's all there is to it.
Getting a little beyond your actual question, we don't really care about points on the plane. We really want to know which side of the plane a point falls on. While w*x - w0 is exactly zero on the plane, it will have positive values for points on one side of the plane and negative values for points on the other side. That's where the neuron's activation function comes in but that's beyond your actual question.
Intuitively, in a binary problem the weight vector points in the direction of the '1'-class, while the '0'-class is found when pointing away from the weight vector. The decision boundary should thus be drawn perpendicular to the weight vector.
See the image for a simplified example: You have a neural network with only 1 input which thus has 1 weight. If the weight is -1 (the blue vector), then all negative inputs will become positive, so the whole negative spectrum will be assigned to the '1'-class, while the positive spectrum will be the '0'-class. The decision boundary in a 2-axis plane is thus a vertical line through the origin (the red line). Simply said it is the line perpendicular to the weight vector.
Lets go through this example with a few values. The output of the perceptron is class 1 if the sum of all inputs * weights is larger than 0 (the default threshold), otherwise if the output is smaller than the threshold of 0 then the class is 0. Your input has value 1. The weight applied to this single input is -1, so 1 * -1 = -1 which is less than 0. The input is thus assigned class 0 (NOTE: class 0 and class 1 could have just been called class A or class B, don't confuse them with the input and weight values). Conversely, if the input is -1, then input * weight is -1 * -1 = 1, which is larger than 0, so the input is assigned to class 1. If you try every input value then you will see that all the negative values in this example have an output larger than 0, so all of them belong to class 1. All positive values will have an output of smaller than 0 and therefore will be classified as class 0. Draw the line which separates all positive and negative input values (the red line) and you will see that this line is perpendicular to the weight vector.
Also note that the weight vector is only used to modify the inputs to fit the wanted output. What would happen without a weight vector? An input of 1, would result in an output of 1, which is larger than the threshold of 0, so the class is '1'.
The second image on this page shows a perceptron with 2 inputs and a bias. The first input has the same weight as my example, while the second input has a weight of 1. The corresponding weight vector together with the decision boundary are thus changed as seen in the image. Also the decision boundary has been translated to the right due to an added bias of 1.
Here is a viewpoint from a more fundamental linear algebra/calculus standpoint:
The general equation of a plane is Ax + By + Cz = D (can be extended for higher dimensions). The normal vector can be extracted from this equation: [A B C]; it is the vector orthogonal to every other vector that lies on the plane.
Now if we have a weight vector [w1 w2 w3], then when do w^T * x >= 0 (to get positive classification) and w^T * x < 0 (to get negative classification). WLOG, we can also do w^T * x >= d. Now, do you see where I am going with this?
The weight vector is the same as the normal vector from the first section. And as we know, this normal vector (and a point) define a plane: which is exactly the decision boundary. Hence, because the normal vector is orthogonal to the plane, then so too is the weight vector orthogonal to the decision boundary.
Start with the simplest form, ax + by = 0, weight vector is [a, b], feature vector is [x, y]
Then y = (-a/b)x is the decision boundary with slope -a/b
The weight vector has slope b/a
If you multiply those two slopes together, result is -1
This proves decision boundary is perpendicular to weight vector
Although the question was asked 2 years ago, I think many students will have the same doubts. I reached this answer because I asked the same question.
Now, just think of X, Y (a Cartesian coordinate system is a coordinate system that specifies each point uniquely in a plane by a pair of numerical coordinates, which are the signed distances from the point to two fixed perpendicular directed lines [from Wikipedia]).
If Y = 3X, in geometry Y is perpendicular to X, then let w = 3, then Y = wX, w = Y/X and if we want to draw the relation between X, w we will have two perpendicular lines just like when we draw the relation between X, Y. So always think of the w-coefficient as perpendicular to X and Y.
For a set of real-valued functions F = {f:X->R}, How to calculate the *Pseudo-dimension of F?
providing with an example will help my understanding. like cases
where X = {(x,y,z): 0<x<a,0<y<b,0<z<c}
*Pseudo-dimension is a generalization of VC-dimension
Dimensions like this are meant to capture the number of degrees of freedom of a concept class, which you label as F in your question. Intuitively, the VC dimension and pseudo dimension will often be a number very close to what you might guess as the number of degrees of freedom.
For example, the set of rectangles in the plane has VC dimension 4, since you can shatter a set of at most 4 points. (Pick any four points that are placed at the end-points of a + sign; you can assign any desired +/- signs to those four points by choosing the appropriate rectangle.) And 4 is a good guess for a number of dimensions for rectangles, since you can specify any rectangle with (x,y) of a certain corner along with (width, height).
For pseudo dimension, you are basically taking the indicator sets {(x,y) : f(x) > y} and taking the VC dimension of that new set.
For example, let F = { f(x) : f(x) = kx for some real number k }. Then the indicator sets will be { (x,y) : kx > y } for each k. In other words, you'll get all the lower half-planes that go through the origin, excluding the vertical line x=0. This set only has VC dimension 1, but that makes sense because your only degree of freedom in F is to choose k.
By the way, this question could also be asked on metaoptimize.com.
Just don't get too confused on VC dimensions, you can easily create examples of families of functions that have 1 degree of freedom and infinite VC dimension, for example:
F={1_{sin(ax)>0, a\in Reals}
It has only one degree of freedom but it is easy to see that for every n there is a set of n elements that you can shatter with this family of functions, so the VC dimension is infinite. There are also examples of families that have infinite amount of free parameters nevertheless have finite VC dimension. Like hyperplanes w*x+b on an infinite dimensional space but where the norm of w is bounded.