Alternating Least Squares Derivative - machine-learning

I'm trying to understand recommender systems that use ALS by reading up some content here : https://blog.insightdatascience.com/explicit-matrix-factorization-als-sgd-and-all-that-jazz-b00e4d9b21ea
I don't understand how the
the second line of the equation , follows from the first line of the equation.
Moreover, x(u)_T seems to have dimensions : k x 1
and Y_T seems to have the dimensions k x m
In that case, how can these matrices be multiplied ?
Thanks!

Related

Unable to understand transition from one equation to another in linear regression. (From Ch-2, The elements of Statistical Learning)

While reading linear regression in Ch-2 of book "The elements of Statistical Learning", I came across 2 equations and I failed to understand how the 2nd was derived from the first.
Background:
How do we fit the linear model to a set of training data? There are
many different methods, but by far the most popular is the method of
least squares. In this approach, we pick the coefficients β to minimize the
residual sum of squares
Equation 1
RSS(β) is a quadratic function of the parameters, and hence its minimum
always exists, but may not be unique. The solution is easiest to characterize
in matrix notation. We can write
Equation 2
where X is an N × p matrix with each row an input vector, and y is an
N-vector of the outputs in the training set.
1st equation:
2nd equation:
I got it. The RHS of the 2nd equation is in the matrix form and to get the 1st equation, you have to transpose one part of the RHS of 2nd equation(this is how matrix multiplication is done)

Hough Transform Accumulator to Cartesian

I'm studying a course on vision systems and one of the questions posed was;
For the accumulator shown;
Determine the most likely r,θ combination representing the straight line of the greatest strength in the original image.
From my understanding of the accumulator this would be r = 60, θ = 150 as the 41 votes is the highest number of votes in this cluster of large votes. Am I correct with this combination?
And hence calculate the equation of this line in the form y = mx + c
I'm not sure of the conversion steps required to convert the r = 60, θ = 150 to y = mx + c with the information given since r = 60, θ = 150 denotes 1 point on the line.
State the resolution of your answer and give your reasoning
I assume the resolution is got to do with some of the steps in the auscultation and not the actual resolution of the original image since that's irrelevant to the edges detected in the image.
Any guidance on the above 3 points would be greatly appreciated!
Yes, this is correct.
This is asking you what the slope and intercept are of the line given r and theta. r and theta are not one point on the line, they are one point of the accumulator. r and theta describe a line using the line equation in polar coordinates: . This is the cool thing about the hough transform, every line in one space, (i.e. image space) can be described by a point in another space (r, theta). This could be done with m and b from the line equation , but as we all know, m is undefined for vertical lines. This is the reason the polar line equation is used. It is important to note that the line described by the HT r and theta refers to a line from the origin extending to the actual line in the image. This means your image line y = mx + b equation will need to be orthogonal to the polar equation. The wiki article on the HT describes this well and shows examples. I would recommend drawing a diagram of your r and theta extending to a line like this:
Then use trig to get two points on the red line. Two points are enough to give you m and b from the line equation.
I'm not entirely sure what "resolution" refers to in this context. But it does seem like your line estimator will have some precision loss since r is every 20 mm and theta is every 15 degrees. Perhaps it is asking what degree of error you could get given an accumulator of this resolution.

Why does fundamental matrix have 7 degrees of freedom?

There are 9 parameters in the fundamental matrix to relate the pixel co-ordinates of left and right images but only 7 degrees of freedom (DOF).
The reasoning for this on several pages that I've searched says :
Homogenous equations means we lose a degree of freedom
The determinant of F = 0, therefore we lose another degree of freedom.
I don't understand why those 2 reasons mean we lose 2 DOF - can someone explain it?
We initially have 9 DOF because the fundamental matrix is composed of 9 parameters, which implies that we need 9 corresponding points to compute the fundamental matrix (F). But because of the following two reasons, we only need 7 corresponding points.
Reason 1
We lose 1 DOF because we are using homogeneous coordinates. This basically is a way to represent nD points as a vector form by adding an extra dimension. ie) A 2D point (0,2) can be represented as [0,2,1], in general [x,y,1]. There are useful properties when using homogeneous coordinates with 2D/3D transformation, but I'm going to assume you know that.
Now given the expression p and p' representing pixel coordinates:
p'=[u',v',1] and p=[u,v,1]
the fundamental matrix:
F = [f1,f2,f3]
[f4,f5,f6]
[f7,f8,f9]
and fundamental matrix equation:
(transposed p')Fp = 0
when we multiple this expression in algebra form, we get the following:
uu'f1 + vu'f2 + u'f3 + uv'f4 + vv'f5 + v'f6 + uf7 + vf8 + f9 = 0.
In a homogeneous system of linear equation form Af=0 (basically the factorization of the above formula), we get two components A and f.
A:
[uu',vu',u', uv',vv',v',u,v,1]
f (f is essentially the fundamental matrix in vector form):
[f1,f2'f3,f4,f5,f6,f7,f8,f9]
Now if we look at the components of vector A, we have 8 unknowns, but one known value 1 because of homogeneous coordinates, and therefore we only need 8 equations now.
Reason 2
det F = 0.
A determinant is a value that can be obtained from a square matrix.
I'm not entirely sure about the mathematical details of this property but I can still infer the basic idea, and, hopefully, you can as well.
Basically given some matrix A
A = [a,b,c]
[d,e,f]
[g,h,i]
The determinant can be computed using this formula:
det A = aei+bfg+cdh-ceg-bdi-afh
If we look at the determinant using the fundamental matrix, the algebra would look something like this:
F = [f1,f2,f3]
[f4,f5,f6]
[f7,f8,f9]
det F = (f1*f5*f8)+(f2*f6*f7)+(f3*f4*f8)-(f3*f5*f7)-(f2*f4*f9)-(f1*f6*f8)
Now we know the determinant of the fundamental matrix is zero:
det F = (f1*f5*f8)+(f2*f6*f7)+(f3*f4*f8)-(f3*f5*f7)-(f2*f4*f9)-(f1*f6*f8) = 0
So, if we work out only 7 of the 9 parameters of the fundamental matrix, we can work out the last parameter using the above determinant equation.
Therefore the fundamental matrix has 7DOF.
The reasons why F has only 7 degrees of freedom are
F is a 3x3 homogeneous matrix. Homogeneous means there is a scale ambiguity in the matrix, so the scale doesn't matter (as shown in #Curator Corpus 's example). This drops one degree of freedom.
F is a matrix with rank 2. It is not a full rank matrix, so it is singular and its determinant is zero (Proof here). The reason why F is a matrix with rank 2 is that it is mapping a 2D plane (image1) to all the lines (in image 2) that pass through the epipole (of image 2).
Hope it helps.
As for the highest votes answer by nbro, I think it can be interpreted as this way where we have reason two, matrix F has a rank2, so its determinant is zero as a constraint to the f variable function. So, we only need 7 points to determine the rest of variables (f1-f8), with the previous constriant. And 8 equations, 8 variables, leaving only one solution. So there is 7 DOF.

What the relationship of least square and X'Xθ = X'y?

I have three points: (1,1), (2,3), (3, 3.123). I assume the hypothesis is , and I want to do linear regression on the three points. I have two methods to calculate θ:
Method-1: Least Square
import numpy as np
# get an approximate solution using least square
X = np.array([[1,1],[2,1],[3,1]])
y = np.array([1,3,3.123])
theta = np.linalg.lstsq(X,y)[0]
print theta
Method-2: Matrix multiplication
We have the following derivation process:
# rank(X)=2, rank(X|y)=3, so there is no exact solution.
print np.linalg.matrix_rank(X)
print np.linalg.matrix_rank(np.c_[X,y])
theta = np.linalg.inv(X.T.dot(X)).dot(X.T.dot(y))
print theta
Both method-1 and method-2 can get result [ 1.0615 0.25133333], it seems that method-2 is equivalent to least square. But, I don't know why, can anyone reveal the underlying principle of their equivalence?
Both approaches are equivalent, because least squares method is
θ = argmin (Xθ-Y)'(Xθ-Y) = argmin ||(Xθ-Y)||^2 = argmin ||(Xθ-Y)||, that means that you try to minimize length of vector (Xθ-Y), so you try to minimize distance between Xθ and Y. X is a constant matrix, so Xθ is vector from column space of X. That means the shortest distance between these two vectors is when Xθ is equal to projection of vector Y to column space of X (can be easy observed from picture). That results to Y^(hat) = Xθ = X(X'X)^(-1)X'Y, where X(X'X)^(-1)X' is the projection matrix to column space of X. After some changes you can observe that this is equivalent with (X'X)θ = X'Y. You can find exact proof in any linear algebra book.

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.
It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.
I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.
The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces
For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.
I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.
Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.
Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.

Resources