Pseudo-dimension of real-valued functions - machine-learning

For a set of real-valued functions F = {f:X->R}, How to calculate the *Pseudo-dimension of F?
providing with an example will help my understanding. like cases
where X = {(x,y,z): 0<x<a,0<y<b,0<z<c}
*Pseudo-dimension is a generalization of VC-dimension

Dimensions like this are meant to capture the number of degrees of freedom of a concept class, which you label as F in your question. Intuitively, the VC dimension and pseudo dimension will often be a number very close to what you might guess as the number of degrees of freedom.
For example, the set of rectangles in the plane has VC dimension 4, since you can shatter a set of at most 4 points. (Pick any four points that are placed at the end-points of a + sign; you can assign any desired +/- signs to those four points by choosing the appropriate rectangle.) And 4 is a good guess for a number of dimensions for rectangles, since you can specify any rectangle with (x,y) of a certain corner along with (width, height).
For pseudo dimension, you are basically taking the indicator sets {(x,y) : f(x) > y} and taking the VC dimension of that new set.
For example, let F = { f(x) : f(x) = kx for some real number k }. Then the indicator sets will be { (x,y) : kx > y } for each k. In other words, you'll get all the lower half-planes that go through the origin, excluding the vertical line x=0. This set only has VC dimension 1, but that makes sense because your only degree of freedom in F is to choose k.
By the way, this question could also be asked on metaoptimize.com.

Just don't get too confused on VC dimensions, you can easily create examples of families of functions that have 1 degree of freedom and infinite VC dimension, for example:
F={1_{sin(ax)>0, a\in Reals}
It has only one degree of freedom but it is easy to see that for every n there is a set of n elements that you can shatter with this family of functions, so the VC dimension is infinite. There are also examples of families that have infinite amount of free parameters nevertheless have finite VC dimension. Like hyperplanes w*x+b on an infinite dimensional space but where the norm of w is bounded.

Related

Feature Scaling with Octave

I want to do feature scaling datasets by using means and standard deviations, and my code is below; but apparently it is not a univerisal code, since it seems only work with one dataset. Thus I am wondering what is wrong with my code, any help will be appreciated! Thanks!
X is the dataset I am currently using.
mu = mean(X);
sigma = std(X);
m = size(X, 1);
mu_matrix = ones(m, 1) * mu;
sigma_matrix = ones(m, 1) * sigma;
featureNormalize = (X-mu_matrix)/sigma;
Thank you for clarifying what you think the code should be doing in the comments.
My answer will effectively answer why what you think is happening is not what is happening.
First let's talk about the mean and std functions. When their input is a vector (whether this is vertically or horizontally aligned), then this will return a single number which is the mean or standard deviation of that vector respectively, as you might expect.
However, when the input is a matrix, then you need to know what it does differently. Unless you specify the direction (dimension) in which you should be calculating means / std, then it will calculate means along the rows, i.e. returning a single number for each column. Therefore, the end-result of this operation will be a horizontal vector.
Therefore, both mu and sigma will be horizontal vectors in your code.
Now let's move on to the 'matrix multiplication' operator (i.e. *).
When using the matrix multiplication operator, if you multiply a horizontal vector with a vertical vector (i.e. the usual matrix multiplication operation), your output is a single number (i.e. a scalar). However, if you reverse the orientations, as in, you multiply a vertical vector by a horizontal one, you will in fact be calculating a 'Kronecker product' instead. Since the output of the * operation is completely defined by the rows of the first input, and the columns of the second input, whether you're getting a matrix multiplication or a kronecker product is implicit and entirely dependent on the orientation of your inputs.
Therefore, in your case, the line mu_matrix = ones(m, 1) * mu; is not in fact appending a vector of ones, like you say. It is in fact performing the kronecker product between a vertical vector of ones, and the horizontal vector that is your mu, effectively creating an m-by-n matrix with mu repeated vertically for m rows.
Therefore, at the end of this operation, as the variable naming would suggest, mu_matrix is in fact a matrix (same with sigma_matrix), having the same size as X.
Your final step is X- mu_sigma, which gives you at each element, the difference between x and mu at that element. Then you "divide" with the sigma matrix.
Here is why I asked if you were sure you should be using ./ instead of /.
/ is the matrix division operator. With / You are effectively performing matrix multiplication by an inverse matrix, since D / S is mathematically equivalent to D * inv(S). It seems to me you should be using ./ instead, to simply divide each element by the standard deviation of that column (which is why you had to repeat the horizontal vector over m rows in sigma_matrix, so that you could use it for 'elementwise division'), since what you are trying to do is to normalise each row (i.e. observation) of a particular column, by the standard deviation that is specific to that column (i.e. feature).

Why does fundamental matrix have 7 degrees of freedom?

There are 9 parameters in the fundamental matrix to relate the pixel co-ordinates of left and right images but only 7 degrees of freedom (DOF).
The reasoning for this on several pages that I've searched says :
Homogenous equations means we lose a degree of freedom
The determinant of F = 0, therefore we lose another degree of freedom.
I don't understand why those 2 reasons mean we lose 2 DOF - can someone explain it?
We initially have 9 DOF because the fundamental matrix is composed of 9 parameters, which implies that we need 9 corresponding points to compute the fundamental matrix (F). But because of the following two reasons, we only need 7 corresponding points.
Reason 1
We lose 1 DOF because we are using homogeneous coordinates. This basically is a way to represent nD points as a vector form by adding an extra dimension. ie) A 2D point (0,2) can be represented as [0,2,1], in general [x,y,1]. There are useful properties when using homogeneous coordinates with 2D/3D transformation, but I'm going to assume you know that.
Now given the expression p and p' representing pixel coordinates:
p'=[u',v',1] and p=[u,v,1]
the fundamental matrix:
F = [f1,f2,f3]
[f4,f5,f6]
[f7,f8,f9]
and fundamental matrix equation:
(transposed p')Fp = 0
when we multiple this expression in algebra form, we get the following:
uu'f1 + vu'f2 + u'f3 + uv'f4 + vv'f5 + v'f6 + uf7 + vf8 + f9 = 0.
In a homogeneous system of linear equation form Af=0 (basically the factorization of the above formula), we get two components A and f.
A:
[uu',vu',u', uv',vv',v',u,v,1]
f (f is essentially the fundamental matrix in vector form):
[f1,f2'f3,f4,f5,f6,f7,f8,f9]
Now if we look at the components of vector A, we have 8 unknowns, but one known value 1 because of homogeneous coordinates, and therefore we only need 8 equations now.
Reason 2
det F = 0.
A determinant is a value that can be obtained from a square matrix.
I'm not entirely sure about the mathematical details of this property but I can still infer the basic idea, and, hopefully, you can as well.
Basically given some matrix A
A = [a,b,c]
[d,e,f]
[g,h,i]
The determinant can be computed using this formula:
det A = aei+bfg+cdh-ceg-bdi-afh
If we look at the determinant using the fundamental matrix, the algebra would look something like this:
F = [f1,f2,f3]
[f4,f5,f6]
[f7,f8,f9]
det F = (f1*f5*f8)+(f2*f6*f7)+(f3*f4*f8)-(f3*f5*f7)-(f2*f4*f9)-(f1*f6*f8)
Now we know the determinant of the fundamental matrix is zero:
det F = (f1*f5*f8)+(f2*f6*f7)+(f3*f4*f8)-(f3*f5*f7)-(f2*f4*f9)-(f1*f6*f8) = 0
So, if we work out only 7 of the 9 parameters of the fundamental matrix, we can work out the last parameter using the above determinant equation.
Therefore the fundamental matrix has 7DOF.
The reasons why F has only 7 degrees of freedom are
F is a 3x3 homogeneous matrix. Homogeneous means there is a scale ambiguity in the matrix, so the scale doesn't matter (as shown in #Curator Corpus 's example). This drops one degree of freedom.
F is a matrix with rank 2. It is not a full rank matrix, so it is singular and its determinant is zero (Proof here). The reason why F is a matrix with rank 2 is that it is mapping a 2D plane (image1) to all the lines (in image 2) that pass through the epipole (of image 2).
Hope it helps.
As for the highest votes answer by nbro, I think it can be interpreted as this way where we have reason two, matrix F has a rank2, so its determinant is zero as a constraint to the f variable function. So, we only need 7 points to determine the rest of variables (f1-f8), with the previous constriant. And 8 equations, 8 variables, leaving only one solution. So there is 7 DOF.

what input x maximize activation function in an autoencoder hidden layer?

Hi when i am reading about Stanford's Machine Learning materials about autoencoder, found a formula hard to prove by myself. Link to Material
Question is:
" What input image x would cause ai to be maximally activated? "
Screen shot of the Question and Context:
Many thanks to your answers in advance!
While this can be rigorously solved using KLT conditions and Lagrange multipliers, there is a more intuitive way to figure the result out. I assume that f(.) is a monotone increasing, sigmoid type of nonlinearity (ReLU is also valid). So, finding the maximum of w1x1+...+w100x100 + b under the constraint (x1)^2+...+(x100)^2 <= 1 is equivalent to finding the maximum of f(w1x1+...+w100x100 + b) with the same constraint.
Note that g = w1x1+...+w100x100 + b is a linear function of x terms (Name it as g, so later we can refer it by that). So, the direction of largest increase at any point (x1,...,x100) in the domain of that function is the same, which is the gradient. The gradient is simply (w1,w2,...,w100) at any point in the domain, which means if we go in the direction of (w1,w2,...,w100), independent from where we start, we obtain the largest increase in the function. To make things simplier and to allow us to visualize, assume that we are in the R^2 space and the function is w1x1 + w2x2 + b:
The optimum x1 and x2 are constrained to lie in or on the circle C:(x1)^2 + (x2)^2 =1. Assume that we are on the origin (0.0). If we go in the direction of the gradient (blue arrow) (w1,w2), we are going to attain the largest value of the function where the blue arrow intersects with the circle. That intersection has the coordinates c*(w1,w2) and it is c^2(w1^2 + w2^2) = 1, where c is a scalar coefficient. c is easily solved as c= 1 / sqrt(w1^2 + w2^2). Then at the intersection we have x1=w1/sqrt(w1^2 + w2^2) and x2=w2/sqrt(w1^2 + w2^2), which the solution we seek. This can be extended in the same way to 100 dimensional case.
You may ask why we started at the origin and not any other point in the circle. Note that the red line is perpendicular to the gradient vector and the function is constant along that line. Draw that (u1,u2) line, preserving its orientation, arbitrarily with the constraint that it intersects the circle C. Then choose any point on the line, such that it lies within the circle. On the (u1,u2) line, you start at the same value of the function g, wherever you are. Then as you go in the (w1,w2) direction, the longest path taken within the circle always goes through the origin, which means the path you increase the function g the most.

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.
It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.
I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.
The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces
For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.
I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.
Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.
Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.

Why is weight vector orthogonal to decision plane in neural networks

I am beginner in neural networks. I am learning about perceptrons.
My question is Why is weight vector perpendicular to decision boundary(Hyperplane)?
I referred many books but all are mentioning that weight vector is perpendicular to decision boundary but none are saying why?
Can anyone give me an explanation or reference to a book?
The weights are simply the coefficients that define a separating plane. For the moment, forget about neurons and just consider the geometric definition of a plane in N dimensions:
w1*x1 + w2*x2 + ... + wN*xN - w0 = 0
You can also think of this as being a dot product:
w*x - w0 = 0
where w and x are both length-N vectors. This equation holds for all points on the plane. Recall that we can multiply the above equation by a constant and it still holds so we can define the constants such that the vector w has unit length. Now, take out a piece of paper and draw your x-y axes (x1 and x2 in the above equations). Next, draw a line (a plane in 2D) somewhere near the origin. w0 is simply the perpendicular distance from the origin to the plane and w is the unit vector that points from the origin along that perpendicular. If you now draw a vector from the origin to any point on the plane, the dot product of that vector with the unit vector w will always be equal to w0 so the equation above holds, right? This is simply the geometric definition of a plane: a unit vector defining the perpendicular to the plane (w) and the distance (w0) from the origin to the plane.
Now our neuron is simply representing the same plane as described above but we just describe the variables a little differently. We'll call the components of x our "inputs", the components of w our "weights", and we'll call the distance w0 a bias. That's all there is to it.
Getting a little beyond your actual question, we don't really care about points on the plane. We really want to know which side of the plane a point falls on. While w*x - w0 is exactly zero on the plane, it will have positive values for points on one side of the plane and negative values for points on the other side. That's where the neuron's activation function comes in but that's beyond your actual question.
Intuitively, in a binary problem the weight vector points in the direction of the '1'-class, while the '0'-class is found when pointing away from the weight vector. The decision boundary should thus be drawn perpendicular to the weight vector.
See the image for a simplified example: You have a neural network with only 1 input which thus has 1 weight. If the weight is -1 (the blue vector), then all negative inputs will become positive, so the whole negative spectrum will be assigned to the '1'-class, while the positive spectrum will be the '0'-class. The decision boundary in a 2-axis plane is thus a vertical line through the origin (the red line). Simply said it is the line perpendicular to the weight vector.
Lets go through this example with a few values. The output of the perceptron is class 1 if the sum of all inputs * weights is larger than 0 (the default threshold), otherwise if the output is smaller than the threshold of 0 then the class is 0. Your input has value 1. The weight applied to this single input is -1, so 1 * -1 = -1 which is less than 0. The input is thus assigned class 0 (NOTE: class 0 and class 1 could have just been called class A or class B, don't confuse them with the input and weight values). Conversely, if the input is -1, then input * weight is -1 * -1 = 1, which is larger than 0, so the input is assigned to class 1. If you try every input value then you will see that all the negative values in this example have an output larger than 0, so all of them belong to class 1. All positive values will have an output of smaller than 0 and therefore will be classified as class 0. Draw the line which separates all positive and negative input values (the red line) and you will see that this line is perpendicular to the weight vector.
Also note that the weight vector is only used to modify the inputs to fit the wanted output. What would happen without a weight vector? An input of 1, would result in an output of 1, which is larger than the threshold of 0, so the class is '1'.
The second image on this page shows a perceptron with 2 inputs and a bias. The first input has the same weight as my example, while the second input has a weight of 1. The corresponding weight vector together with the decision boundary are thus changed as seen in the image. Also the decision boundary has been translated to the right due to an added bias of 1.
Here is a viewpoint from a more fundamental linear algebra/calculus standpoint:
The general equation of a plane is Ax + By + Cz = D (can be extended for higher dimensions). The normal vector can be extracted from this equation: [A B C]; it is the vector orthogonal to every other vector that lies on the plane.
Now if we have a weight vector [w1 w2 w3], then when do w^T * x >= 0 (to get positive classification) and w^T * x < 0 (to get negative classification). WLOG, we can also do w^T * x >= d. Now, do you see where I am going with this?
The weight vector is the same as the normal vector from the first section. And as we know, this normal vector (and a point) define a plane: which is exactly the decision boundary. Hence, because the normal vector is orthogonal to the plane, then so too is the weight vector orthogonal to the decision boundary.
Start with the simplest form, ax + by = 0, weight vector is [a, b], feature vector is [x, y]
Then y = (-a/b)x is the decision boundary with slope -a/b
The weight vector has slope b/a
If you multiply those two slopes together, result is -1
This proves decision boundary is perpendicular to weight vector
Although the question was asked 2 years ago, I think many students will have the same doubts. I reached this answer because I asked the same question.
Now, just think of X, Y (a Cartesian coordinate system is a coordinate system that specifies each point uniquely in a plane by a pair of numerical coordinates, which are the signed distances from the point to two fixed perpendicular directed lines [from Wikipedia]).
If Y = 3X, in geometry Y is perpendicular to X, then let w = 3, then Y = wX, w = Y/X and if we want to draw the relation between X, w we will have two perpendicular lines just like when we draw the relation between X, Y. So always think of the w-coefficient as perpendicular to X and Y.

Resources