Generalised Fibonacci - fibonacci

How to find the nth term of the recurrence in log(n) time.
F[n]=F[n-1]+F[n-3]
F[2]=1;
F[3]=2;
F[4]=3;
F[5]=4;
I could not create the required matrix to exponentiate.
Sorry if this seems somewhat off-topic.
Thanks but i found the answer.
https://math.stackexchange.com/questions/891964/generalised-fibonacci/891979#891979

The nth fibnonacci number is given by:
f(n) = Floor(phi^n / sqrt(5) + 1/2)
where
phi = (1 + sqrt(5)) / 2
Assuming that the primitive mathematical operations (+, -, * and /) are O(1) you can use this result to compute the nth Fibonacci number in O(log n) time (O(log n) because of the exponentiation in the formula).
With reference to:
nth fibonacci number in sublinear time

Related

How does X*theta in cost function from?

In Linear Regression, there is a cost function as:
The code in Octave is:
function J = computeCost(X, y, theta)
%COMPUTECOST Compute cost for linear regression
% J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
% parameter for linear regression to fit the data points in X and y
% Initialize some useful values
m = length(y); % number of training examples
% You need to return the following variables correctly
J = 0;
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta
% You should set J to the cost.
H = X*theta;
S = (H - y).^2;
J = 1 / (2*m) * sum(S);
% =========================================================================
end
Could someone tell me why sigma(h0(x(i))) is equal to a vectorization X*theta?
Thanks
Could someone tell me why sigma(h0(x(i))) is equal to a vectorization X*theta?
That is not the case. At no point in this code does sigma(h(x_i)) get computed separately. The variable H is not equal to that value but it's a (column) vector that stores the values
`h(x_i)=dot_product(x_i,theta)`
for all examples.
The formula that you give in Latex just says that it wants us to sum the ((h(x_i)-y_i))^2 over all examples. What you want to avoid doing is to compute h(x_i) for all of those examples in a sequential manner, because that would be time consuming. From the definition of h(x), you know that
#I've written a more general case, and the case `n==1` will correspond to your Latex formula)
h(x_i)=[1 x_i1 ... x_in]*[theta_0 theta_1 ... theta_n]'
The matrix X is of size m*n, where m is the number of examples. So each row of the vector
H=X*theta #H is a vector of size m*1
will correspond to a single h(x_i).
Knowing this, you can see that
S=(H-y).^2 #S is a vector of size m*1
is a vector such that each element is one of the (h(x_i)-y_i)^2. So, you just need to sum all of them with sum(S) to get the value of the sigma from your Latex formula.
I have used octave notation and syntax for writing matrices: 'comma' for separating column items, 'semicolon' for separating row items and 'single quote' for Transpose.
The second line of the Latex expression in the question, is valid with just one training example, x is a '(f+1) x 1' matrix or a column vector. Specifically x = [x0; x1; x2; x3; .... xf]
x0 is always '1'. Here 'f' is the number of features.
theta = [theta0; theta1; theta2; theta3; .... thetaf].
'theta' is a column vector or '(f+1) x 1' matrix. theta0 is the intercept term.
In this special case with one training example, the '1 x (f+1)' matrix formed by taking theta' and x could be multiplied to give the correct '1x1' hypothesis matrix or a real number.
h = theta' * x as in the second line of the Latex expression is valid.
But the expression m = length(y) indicates that there are multiple training examples. With 'm' training examples, X is a 'm x (f+1)' matrix.
To simplify, let there be two training examples each with 'f' features.
X = [ x1; x2].
(Please note 1 and 2 inside the brackets are not exponential terms but indexes for the training examples).
Here, x1 = [ x01, x11, x21, x31, .... xf1 ]
and
x2 = [ x02, x12, x22, x32, .... xf2].
So X is a '2 x (f+1)' matrix.
Now to answer the question, theta' is a '1 x (f+1)' matrix and X is a '2 x (f+1)' matrix. With this, the valid expression is X * theta.
The expression in Latex theta' * X becomes invalid.
The expected hypothesis matrix in my example, 'h', should have two predicted values (two real numbers), one for each of the two training examples. 'h' is a '2 x 1' matrix or column vector.
The hypothesis can be obtained by using the expression, X * theta which is valid and algebraically correct. Multiplying a '2 x (f+1)' matrix with a '(f+1) x 1' matrix resulting in a '2 x 1' hypothesis matrix.

Why theta*X not theta'*X in practical?

While doing MOOC on ML by Andrew Ng, he in theory explains theta'*X gives us hypothesis and while doing coursework we use theta*X. Why it's so?
theta'*X is used to calculate the hypothesis for a single training example when X is a vector. Then you have to calculate theta' to get to the h(x) definition.
In the practice, since you have more than one training example, X is a Matrix (your training set) with "m x n" dimension where m is the number of your training examples and n your number of features.
Now, you want to calculate h(x) for all your training examples with your theta parameter in just one move right?
Here is the trick: theta has to be a n x 1 vector then when you do Matrix-Vector Multiplication (X*theta) you will obtain an m x 1 vector with all your h(x)'s training examples in your training set (X matrix). Matrix multiplication will create the vector h(x) row by row making the corresponding math and this will be equal to the h(x) definition at each training example.
You can do the math by hand, I did it and now is clear. Hope i can help someone. :)
In mathematics, a 'vector' is always defined as a vertically-stacked array, e.g. , and signifies a single point in a 3-dimensional space.
A 'horizontal' vector, typically signifies an array of observations, e.g. is a tuple of 3 scalar observations.
Equally, a matrix can be thought of as a collection of vectors. E.g., the following is a collection of four 3-dimensional vectors:
A scalar can be thought of as a matrix of size 1x1, and therefore its transpose is the same as the original.
More generally, an n-by-m matrix W can also be thought of as a transformation from an m-dimensional vector x to an n-dimensional vector y, since multiplying that matrix with an m-dimensional vector will yield a new n-dimensional one. If your 'matrix' W is '1xn', then this denotes a transformation from an n-dimensional vector to a scalar.
Therefore, notationally, it is customary to introduce the problem from the mathematical notation point of view, e.g. y = Wx.
However, for computational reasons, sometimes it makes more sense to perform the calculation as a "vector times a matrix" rather than "matrix times a vector". Since (Wx)' === x'W', sometimes we solve the problem like that, and treat x' as a horizontal vector. Also, if W is not a matrix, but a scalar, then Wx denotes scalar multiplication, and therefore in this case Wx === xW.
I don't know the exercises you speak of, but my assumption would be that in the course he introduced theta as a proper, vertical vector, but then transposed it to perform proper calculations, i.e. a transformation from a vector of n-dimensions to a scalar (which is your prediction).
Then in the exercises, presumably you were either dealing with a scalar 'theta' so there was no point transposing it, and was left as theta for convenience or, theta was now defined as a horizontal (i.e. transposed) vector to begin with for some reason (e.g. printing convenience), and then was left in that state when performing the necessary transformation.
I don't know what the dimensions for your theta and X are (you haven't provided anything) but actually it all depends on the X, theta and hypothesis dimensions. Let's say m is the number of features and n - the number of examples. Then, if theta is a mx1 vector and X is a nxm matrix then X*theta is a nx1 hypothesis vector.
But you will get the same result if calculate theta'*X. You can also get the same result with theta*X if theta is 1xm and X - mxn
Edit:
As #Tasos Papastylianou pointed out the same result will be obtained if X is mxn then (theta.'*X).' or X.'*theta are answers. If the hypothesis should be a 1xn vector then theta.'*X is an answer. If theta is 1xm, X - mxn and the hypothesis is 1xn then theta*X is also a correct answer.
i had the same problem for me. (ML course, linear regression)
after spending time on it, here is how i see it: there is a confusion between the x(i) vector and the X matrix.
About the hypothesis h(xi) for a xi vector (xi belongs to R3x1), theta belongs to R3x1
theta = [to;t1;t2] #R(3x1)
theta' = [to t1 t2] #R(1x3)
xi = [1 ; xi1 ; xi2] #(R3x1)
theta' * xi => to + t1.xi,1 +t2.xi,2
= h(xi) (which is a R1x1 => a real number)
to the theta'*xi works here
About the vectorization equation
in this case X is not the same thing as x (vector). it is a matrix with m rows and n+1 col (m =number of examples and n number of features on which we add the to term)
therefore from the previous example with n= 2,
the matrix X is a m x 3 matrix
X = [1 xo,1 xo,2 ; 1 x1,1 x1,2 ; .... ; 1 xi,1 xi,2 ; ...; 1 xm,1 xm,2]
if you want to vectorize the equation for the algorithm, you need to consider for that for each row i, you will have the h(xi) (a real number)
so you need to implement X * theta
that will give you for each row i
[ 1 xi,1 xi,2] * [to ; t1 ; t2] = to + t1.xi,1 + t2.xi,2
Hope it helps
I have used octave notation and syntax for writing matrices: 'comma' for separating column items, 'semicolon' for separating row items and 'single quote' for Transpose.
In the course theory under discussion, theta = [theta0; theta1; theta2; theta3; .... thetaf].
'theta' is therefore a column vector or '(f+1) x 1' matrix. Here 'f' is the number of features. theta0 is the intercept term.
With just one training example, x is a '(f+1) x 1' matrix or a column vector. Specifically x = [x0; x1; x2; x3; .... xf]
x0 is always '1'.
In this special case the '1 x (f+1)' matrix formed by taking theta' and x could be multiplied to give the correct '1x1' hypothesis matrix or a real number.
h = theta' * x is a valid expression.
But the coursework deals with multiple training examples. If there are 'm' training examples, X is a 'm x (f+1)' matrix.
To simplify, let there be two training examples each with 'f' features.
X = [ x1; x2].
(Please note 1 and 2 inside the brackets are not exponential terms but indexes for the training examples).
Here, x1 = [ x01, x11, x21, x31, .... xf1 ]
and
x2 = [ x02, x12, x22, x32, .... xf2].
So X is a '2 x (f+1)' matrix.
Now to answer the question, theta' is a '1 x (f+1)' matrix and X is a '2 x (f+1)' matrix. With this, the following expressions are not valid.
theta' * X
theta * X
The expected hypothesis matrix, 'h', should have two predicted values (two real numbers), one for each of the two training examples. 'h' is a '2 x 1' matrix or column vector.
The hypothesis can be obtained only by using the expression, X * theta which is valid and algebraically correct. Multiplying a '2 x (f+1)' matrix with a '(f+1) x 1' matrix resulting in a '2 x 1' hypothesis matrix.
When Andrew Ng first introduced x in the cost function J(theta), x is a column vector
aka
[x0; x1; ... ; xn]
i.e.
x0;
x1;
...;
xn
However, in the first programming assignment, we are given X, which is an (m * n) matrix, (# training examples * features per training example). The discrepancy comes with the fact that from file the individual x vectors(training samples) are stored as horizontal row vectors rather than the vertical column vectors!!
This means the X matrix you see is actually an X' (X Transpose) matrix!!
Since we have X', we need to make our code work given our equation is looking for h(theta) = theta' * X(when the vectors in matrix X are column vectors)
we have the linear algebra identity for matrix and vector multiplication:
(A*B)' == (B') * (A') as shown here Properties of Transposes
let t = theta,
given, h(t) = t' * X
h(t)' = (t' X)'
= X' * t
Now we have our variables in the format they were actually given to us. What I mean is that our input file really holds X' and theta is normal, so multiplying them in the order specified above will give a practically equivilant output to that he taught us to use which was theta' * X. Since we are summing all the elements of h(t)' at the end it doesn't matter that it is transposed for the final calculation. However, if you wanted h(t), not h(t)', you could always take your computed result and transpose it because
(A')' == A
However, for the coursera machine learning programming assignment 1, this is unnecessary.
This is because the computer has the coordinate (0,0) positioned on the top left, while geometry has the coordinate (0,0) positioned on the bottom left.
enter image description here

Is division by zero included in QF_NRA?

Is division by zero included in QF_NRA?
The SMT-LIB standard is confusing in this matter. The paper where the standard is defined simply does not discuss this point, in fact NRA and QF_NRA do not appear anywhere in that document. Some information is provided on the standard website. Reals are defined as including:
- all terms of the form (/ m n) or (/ (- m) n) where
- m is a numeral other than 0,
- n is a numeral other than 0 and 1,
- as integers, m and n have no common factors besides 1.
This explicitly excludes zero from the denominator when it comes to constant values. However, later, division is defined as:
- / as a total function that coincides with the real division function
for all inputs x and y where y is non-zero,
This is followed up by a note:
Since in SMT-LIB logic all function symbols are interpreted as total
functions, terms of the form (/ t 0) *are* meaningful in every
instance of Reals. However, the declaration imposes no constraints
on their value. This means in particular that
- for every instance theory T and
- for every closed terms t1 and t2 of sort Real,
there is a model of T that satisfies (= t1 (/ t2 0)).
This is seemingly contradictory, because the first quote says that (/ m 0) is not a number in QV_NRA, but the latter quote says that / is a function such that (= t1 (/ t2 0)) is satisfiable for any t1 and t2.
The de-facto reality on the ground is that division by zero seems to be included in SMT-LIB, despite the statement that (/ m n) is only a Real number if n is nonzero. This is related to a previous question of mine: y=1/x, x=0 satisfiable in the reals?
the first quote says that (/ m 0) is not a number
No, but it does not say what number it is.
but the latter quote says that / is a function such that (= t1 (/ t2 0)) is satisfiable for any t1 and t2
This is correct.
You need to get away from the school mentality that says "division by zero is not allowed!". It is undefined. Undefined means that there is no axiom that specifies what values this is. (And this is true in school as well.)
What is f(1234)? It's undefined, so Z3 is allowed to pick any number at all. There is no difference between a / 0 and f(a) where f is some uninterpreted function. Z3 can fill in any function it likes.
Therefore, a / 0 == b is satisfiable and any a and b are OK. But (a / 0) == (a / 0) + 1 is false.
Mathematical operators are just functions. The standard partially specifies those functions.

How to determine time complexity of EM algorithm of probabilistic PCA?

I was studying probabilistic pca from bishop's book, there an EM algo is provdied to calculate principal subspace.
Here M is MxM matrix, W is DxM matrix and (xn − x) is vector Dx1 matrix.
Later in the book there is statement regarding the time complexity:
"Instead, the most computationally demanding steps are those involving sums over
the data set that are O(NDM)."
I was wondering if anyone can help me understanding the time complexity of the algorithm. Thanks in advance.
Let us go one by one
E[zn] = M^-1 W' (xn - x)
M^-1 can be precomputed, thus you do not pay O(M^3) everytime you need this kind of value, but rather a single O(M^3) cost at the end
despite that it is multiplication of matrices of sizes MxM * MxD * Dx1 which is O(M^2D)
result is of size Mx1
E[zn zn'] = sigma^2 M^-1 + E[zn]E[zn]'
sigma^2 M^-1 is just multiplication by constant thus linear in size of matrix, O(M^2)
second operation is outer product of Mx1 and 1xM vectors, thus result is MxM again, and takes O(M^2) too
result is M x M matrix
Wnew = [SUM (xn-x) E[zn]'][SUM E[zn zn']]
First part is N times repeated (sum) operation of multiplying Dx1 matrix by 1xM, thus complexity is O(NDM); result is of size D x M
Second part is again sum of N elements, each being a matrix of M x M, thus in total O(NM^2)
Finally we compute product of D x M and M x M, which is O(DM^2), and again results in D x M matrix
sigma^2new = 1/ND SUM[||xn-x||^2 - 2E[zn]'Wnew'(xn-x) + Tr(E[zn zn']Wnew'Wnew)]
Again we sum N times, this time 3 element sum - first part is just a norm, thus we compute it in O(D) (linear in size of vectors), second term is multiplication of 1 x M, M x D and D x 1 resulting in complexity of O(MD) (per each of iterations, thus in total O(NMD)), and last part is again about multiplying three matrices of sizes M x M, M x D, D x M thus leading to O(M^3D) (*N), but you just need the trace and you can precompute Wnew'Wnew, thus this part is just trace of MxM times MxM matrices leading to O(M^2) (*N)
In total you get O(M^3) + O(NMD) + O(M^2D) + O(M^2N), and I suppose there is an assumption that M<=D<=N thus O(NMD)

Weighing Samples in a Decision Tree

I've constructed a decision tree that takes every sample equally weighted. Now to construct a decision tree which gives different weights to different samples. Is the only change that I need to make is in finding Expected Entropy before calculating information gain. I'm a little confused how to proceed, plz explain....
For example: Consider a node containing p positive node and n negative nodes.So the nodes entropy will be -p/(p+n)log(p/(p+n)) -n/(p+n)log(n/(p+n)). Now if a split is found somehow dividing the parent node in two child nodes.Suppose the child 1 contains p' positives and n' negatives(so child 2 contains p-p' and n-n').Now for child 1 we will calculate entropy as calculated for parent and take the probability of reaching it i.e. (p'+n')/(p+n). Now expected reduction in entropy will be entropy(parent)-(prob of reaching child1*entropy(child1)+prob of reaching child2*entropy(child2)). And the split with max info gain will be chosen.
Now to do this same procedure when we have weights available for each sample.What changes need to be made? What changes need to be made specifically for adaboost(using stumps only)?
(I guess this is the same idea as in some comments, e.g., #Alleo)
Suppose you have p positive examples and n negative examples. Let's denote the weights of examples to be:
a1, a2, ..., ap ---------- weights of the p positive examples
b1, b2, ..., bn ---------- weights of the n negative examples
Suppose
a1 + a2 + ... + ap = A
b1 + b2 + ... + bn = B
As you pointed out, if the examples have unit weights, the entropy would be:
p p n n
- _____ log (____ ) - ______log(______ )
p + n p + n p + n p + n
Now you only need to replace p with A and replace n with B and then you can obtain the new instance-weighted entropy.
A A B B
- _____ log (_____) - ______log(______ )
A + B A + B A + B A + B
Note: nothing fancy here. What we did is just to figure out the weighted importance of the group of positive and negative examples. When examples are equally weighted, the importance of positive examples is proportional to the ratio of positive numbers w.r.t number of all examples. When examples are non-equally weighted, we just perform a weighted average to get the importance of positive examples.
Then you follow the same logic to choose the attribute with largest Information Gain by comparing entropy before splitting and after splitting on an attribute.

Resources