Is the following set convex? - machine-learning

Is the following set convex? - machine-learning

The set of points closer to a given point than a given set, i.e.,
{x | ||x − x0|| ≤ ||x − y|| for all y ∈ S}
where S ⊆ R^n
It reminds me of an Euclidean ball, but I don't know how to proceed to check if it's convex or not.
( In this case it's the 2-norm used above ).

The link: https://see.stanford.edu/materials/lsocoee364a/hw1sol.pdf
2.7 provides the answer, where you prove that the set is a halfspace and thus the set is convex.

Related

Why does cubical agda choose the particular two component homogeneous path composition operator it does?

In the implementation of the prelude for cubical agda, there is a definition of 3 component path composition
_∙∙_∙∙_ : w ≡ x → x ≡ y → y ≡ z → w ≡ z
This definition feels reasonably natural and clean to me. But then to get the 2 component path composition operator there are 3 choices: One for fixing each of the arguments as refl.
The standard ∙ is done by fixing the first argument, and another implementation (∙') is given for fixing the third argument. There is then a proof that they are the same. But the version fixing the 2nd argument is not discussed.
It seems to me that the 2nd argument version (call it ∘) has a nice property that
sym (p1 ∘ p2) is equal to sym p2 ∘ sym p1 definitionally. This seems like it can reduce the book keeping in some proofs.
Are there reasons why this version is not the standard version? Are there other computational properties that are better with the standard version?

Why thetaX not theta'X in practical?

While doing MOOC on ML by Andrew Ng, he in theory explains theta'*X gives us hypothesis and while doing coursework we use theta*X. Why it's so?

theta'*X is used to calculate the hypothesis for a single training example when X is a vector. Then you have to calculate theta' to get to the h(x) definition.
In the practice, since you have more than one training example, X is a Matrix (your training set) with "m x n" dimension where m is the number of your training examples and n your number of features.
Now, you want to calculate h(x) for all your training examples with your theta parameter in just one move right?
Here is the trick: theta has to be a n x 1 vector then when you do Matrix-Vector Multiplication (X*theta) you will obtain an m x 1 vector with all your h(x)'s training examples in your training set (X matrix). Matrix multiplication will create the vector h(x) row by row making the corresponding math and this will be equal to the h(x) definition at each training example.
You can do the math by hand, I did it and now is clear. Hope i can help someone. :)

In mathematics, a 'vector' is always defined as a vertically-stacked array, e.g. , and signifies a single point in a 3-dimensional space.
A 'horizontal' vector, typically signifies an array of observations, e.g. is a tuple of 3 scalar observations.
Equally, a matrix can be thought of as a collection of vectors. E.g., the following is a collection of four 3-dimensional vectors:
A scalar can be thought of as a matrix of size 1x1, and therefore its transpose is the same as the original.
More generally, an n-by-m matrix W can also be thought of as a transformation from an m-dimensional vector x to an n-dimensional vector y, since multiplying that matrix with an m-dimensional vector will yield a new n-dimensional one. If your 'matrix' W is '1xn', then this denotes a transformation from an n-dimensional vector to a scalar.
Therefore, notationally, it is customary to introduce the problem from the mathematical notation point of view, e.g. y = Wx.
However, for computational reasons, sometimes it makes more sense to perform the calculation as a "vector times a matrix" rather than "matrix times a vector". Since (Wx)' === x'W', sometimes we solve the problem like that, and treat x' as a horizontal vector. Also, if W is not a matrix, but a scalar, then Wx denotes scalar multiplication, and therefore in this case Wx === xW.
I don't know the exercises you speak of, but my assumption would be that in the course he introduced theta as a proper, vertical vector, but then transposed it to perform proper calculations, i.e. a transformation from a vector of n-dimensions to a scalar (which is your prediction).
Then in the exercises, presumably you were either dealing with a scalar 'theta' so there was no point transposing it, and was left as theta for convenience or, theta was now defined as a horizontal (i.e. transposed) vector to begin with for some reason (e.g. printing convenience), and then was left in that state when performing the necessary transformation.

I don't know what the dimensions for your theta and X are (you haven't provided anything) but actually it all depends on the X, theta and hypothesis dimensions. Let's say m is the number of features and n - the number of examples. Then, if theta is a mx1 vector and X is a nxm matrix then X*theta is a nx1 hypothesis vector.
But you will get the same result if calculate theta'*X. You can also get the same result with theta*X if theta is 1xm and X - mxn
Edit:
As #Tasos Papastylianou pointed out the same result will be obtained if X is mxn then (theta.'*X).' or X.'*theta are answers. If the hypothesis should be a 1xn vector then theta.'*X is an answer. If theta is 1xm, X - mxn and the hypothesis is 1xn then theta*X is also a correct answer.

i had the same problem for me. (ML course, linear regression)
after spending time on it, here is how i see it: there is a confusion between the x(i) vector and the X matrix.
About the hypothesis h(xi) for a xi vector (xi belongs to R3x1), theta belongs to R3x1
theta = [to;t1;t2] #R(3x1)
theta' = [to t1 t2] #R(1x3)
xi = [1 ; xi1 ; xi2] #(R3x1)
theta' * xi => to + t1.xi,1 +t2.xi,2
= h(xi) (which is a R1x1 => a real number)
to the theta'*xi works here
About the vectorization equation
in this case X is not the same thing as x (vector). it is a matrix with m rows and n+1 col (m =number of examples and n number of features on which we add the to term)
therefore from the previous example with n= 2,
the matrix X is a m x 3 matrix
X = [1 xo,1 xo,2 ; 1 x1,1 x1,2 ; .... ; 1 xi,1 xi,2 ; ...; 1 xm,1 xm,2]
if you want to vectorize the equation for the algorithm, you need to consider for that for each row i, you will have the h(xi) (a real number)
so you need to implement X * theta
that will give you for each row i
[ 1 xi,1 xi,2] * [to ; t1 ; t2] = to + t1.xi,1 + t2.xi,2
Hope it helps

I have used octave notation and syntax for writing matrices: 'comma' for separating column items, 'semicolon' for separating row items and 'single quote' for Transpose.
In the course theory under discussion, theta = [theta0; theta1; theta2; theta3; .... thetaf].
'theta' is therefore a column vector or '(f+1) x 1' matrix. Here 'f' is the number of features. theta0 is the intercept term.
With just one training example, x is a '(f+1) x 1' matrix or a column vector. Specifically x = [x0; x1; x2; x3; .... xf]
x0 is always '1'.
In this special case the '1 x (f+1)' matrix formed by taking theta' and x could be multiplied to give the correct '1x1' hypothesis matrix or a real number.
h = theta' * x is a valid expression.
But the coursework deals with multiple training examples. If there are 'm' training examples, X is a 'm x (f+1)' matrix.
To simplify, let there be two training examples each with 'f' features.
X = [ x1; x2].
(Please note 1 and 2 inside the brackets are not exponential terms but indexes for the training examples).
Here, x1 = [ x01, x11, x21, x31, .... xf1 ]
and
x2 = [ x02, x12, x22, x32, .... xf2].
So X is a '2 x (f+1)' matrix.
Now to answer the question, theta' is a '1 x (f+1)' matrix and X is a '2 x (f+1)' matrix. With this, the following expressions are not valid.
theta' * X
theta * X
The expected hypothesis matrix, 'h', should have two predicted values (two real numbers), one for each of the two training examples. 'h' is a '2 x 1' matrix or column vector.
The hypothesis can be obtained only by using the expression, X * theta which is valid and algebraically correct. Multiplying a '2 x (f+1)' matrix with a '(f+1) x 1' matrix resulting in a '2 x 1' hypothesis matrix.

When Andrew Ng first introduced x in the cost function J(theta), x is a column vector
aka
[x0; x1; ... ; xn]
i.e.
x0;
x1;
...;
xn
However, in the first programming assignment, we are given X, which is an (m * n) matrix, (# training examples * features per training example). The discrepancy comes with the fact that from file the individual x vectors(training samples) are stored as horizontal row vectors rather than the vertical column vectors!!
This means the X matrix you see is actually an X' (X Transpose) matrix!!
Since we have X', we need to make our code work given our equation is looking for h(theta) = theta' * X(when the vectors in matrix X are column vectors)
we have the linear algebra identity for matrix and vector multiplication:
(A*B)' == (B') * (A') as shown here Properties of Transposes
let t = theta,
given, h(t) = t' * X
h(t)' = (t' X)'
= X' * t
Now we have our variables in the format they were actually given to us. What I mean is that our input file really holds X' and theta is normal, so multiplying them in the order specified above will give a practically equivilant output to that he taught us to use which was theta' * X. Since we are summing all the elements of h(t)' at the end it doesn't matter that it is transposed for the final calculation. However, if you wanted h(t), not h(t)', you could always take your computed result and transpose it because
(A')' == A
However, for the coursera machine learning programming assignment 1, this is unnecessary.

This is because the computer has the coordinate (0,0) positioned on the top left, while geometry has the coordinate (0,0) positioned on the bottom left.
enter image description here

Can all context free grammars be converted to NFA/DFA?

I've seen this post about how to convert context free grammar to a DFA:
Automata theory : Conversion of a Context free grammar to a DFA
However, just wondering can all context free grammars be converted to DFA/NFA? What about context free grammars that cannot be expressed as a regular expression? Ex. S->(S) | ()
Thanks!

Only regular languages can be converted to a DFA, and not all CFGs represent regular languages, including the one in the question.
So the answer is "no".
NFAs are not more expressive than DFAs, so the above statement would still be true if you replaced DFA with NFA
A CFG represents a regular language if it is right- or left-linear. But the mere fact that a CFG is not left- or right-linear proves nothing. For example, S→a | a S a happens to generate the same language as S→a | S a a.

Yes ... if the F in "DFA" is replaced by I to get "DIA", but no ... for DFA, itself; and I will show how this works for your example at the end. In fact, all languages have DIA's whose state diagrams reside on a single Universal State Diagram as sub-diagrams thereof.
Consider your example, but rewrite it as S → u S v, S → w. This grammar, like all grammars, is algebraically a system of inequations over a certain partially ordered algebra. In particular, it can be rewritten as
S ⊇ {u}S{v}, S ⊇ {w},
or equivalently as
S ⊇ {u}S{v} ∪ {w}.
The object identified by the grammar is the least solution to the system. Since the system is a fixed point system S ⊇ f(S) = {u}S{v} ∪ {w}, then the least solution may also be described as the least fixed point solution and it is denoted μx f(x) = μx({u}x{v} ∪ {w}).
The ordering relation, for this algebra here, is subset ordering y ⊆ x ⇔ x ⊇ y. The operations include a product AB ≡ { ab: a ∈ A, b ∈ B }, defined element-wise (where, component-wise, the product is word concatenation, with ab being the concatenation of a and b). The product has {1} as an identity, where 1 denotes the empty word. Both word concatenation and product satisfy the fundamental properties
(xy)z = x(yz) [Associativity]
and
xe = x = ex [Identity property]
with the respective identities e = 1 (for concatenation) or e = {1} (for set product). The algebra is called a Monoid.
The simplest and most direct monoid formed from the elements X = {u,v,w} is the Free Monoid X* = {u,v,w}*, which is equivalently described as the set of all words of finite length (including the empty word, 1, of length 0) formed from u, v and w. It is possible to frame the question in terms of more general monoids, but (as the literature usually does) I will restrict it to free monoids.
The family of languages over X is one and the same as the family 𝔓M of subsets A ⊆ M of the monoid M = X*; the defining condition being A ∈ 𝔓M ⇔ A ⊆ M. Other distinguished subfamilies exist, such as the families ℜM ⊆ ℭM ⊆ 𝔗M ⊆ 𝔓M, respectively, of rational, context-free and Turing (or recursively enumerable) languages. The second of these ℭM, which is what your question is concerned with, are given by context-free grammars and are identified as the least fixed point solutions to the corresponding fixed point system of inequations.
Over 𝔓M, one can define the left-quotient operation v\A = { w ∈ M: vw ∈ A }, for each word v ∈ M and subset A ∈ 𝔓M. Because M = X* is a free monoid, it can be decomposed uniquely into left-quotients on the individual elements of X, by the properties 1\A = A, and (vw)\A = w\(v\A).
Correspondingly, one can define a state transition on each x ∈ X by x: A → x\A, treating each subset A ∈ 𝔓M as a state. Together, 𝔓M comprises the state set of the Universal State Diagram over M. Because M = X* is a free monoid, every element of M is either of the form xw for some x ∈ X and w ∈ X*, or is the empty word 1. The decomposition is unique: xw ≠ 1 for any x ∈ X or w ∈ X* and xw = x'w' for x, x' ∈ X and w, w' ∈ X*, only if x = x' and w = w'. Therefore, every A ∈ 𝔓M decomposes uniquely into a partition in a manner analogous to Taylor's Theorem as
A = A₀ ∪ ⋃_{x∈X} {x} x\A.
where A₀ ≡ A ∩ {1} is either {1} if 1 ∈ A or is ∅ if 1 ∉ A. The states for which A₀ = {1} may be regarded as the Final States in the Universal State Diagram.
The analogy to Taylor's Theorem is not too far-removed, since the left-quotient satisfies an analogue of the Product Rule
x\(AB) = (x\A) B ∪ A₀ (x\B)
so it is also denoted as a partial derivative x\A = ∂A/∂x: the Brzozowski Derivative, so that the decomposition rule could just as well be written as:
A = A₀ ∪ ⋃_{x∈X} {x} ∂A/∂x.
What you actually have is an infinite fixed-point system of inequations
A ⊇ A₀ ∪ ⋃_{x∈X} {x} ∂A/∂x for all A ∈ 𝔓M,
with variables A ∈ 𝔓M ranging over all of 𝔓M, whose right-hand sides are all right-linear in the variables. The sets, themselves, are the least fixed point solution to their own system (and to all closed subsystems of the universal system that contain that set as a variable).
Choosing different states as start states yields the different DIA's contained within it. Every minimal DIA (and every minimal DFA) of every language over X is contained in it.
In particular, in this diagram, you can consider the largest subdiagram accessible from a specific state A ∈ 𝔓M. All the states that can be accessed from A are left-quotients by words in M. So, together they comprise a family δA ≡ { v\A: v ∈ M }. The subdiagram consisting only of these states gives you the minimal DIA for the language A, where A, itself, is treated as the start state of the DIA.
If δA is finite, then the I is an F and it's actually a DFA - and that's what you're looking for. Which states in 𝔓M have DIA that are actually DFA's? The regular ones - the ones in the subfamily ℜM ⊆ 𝔓M. This is the case when M = X* is a free monoid. I'm not totally sure if this can also be proven for non-free monoids (like X* × Y*, whose rational subsets ℜ(X* × Y*) are one and the same as what are known as rational transductions) ... because of the reliance on the Taylor's Formula decomposition. There is still something like a Taylor's Theorem, but the decompositions are not necessarily partitions or unique, any longer.
For larger subfamilies of 𝔓M, the DIA are necessarily infinite; but their transitions may possess a sufficient degree of symmetry to allow both the states and transition rules to be wrapped up more succinctly. Correspondingly, one can distinguish different families of DIA by what symmetry properties they possess.
For your example, X = {u,v,w} and M = {u,v,w}*. The subset identified by your grammar is S = {uⁿ w vⁿ: n = 0, 1, 2, ...}. We can define the following sets
S(n) = S {vⁿ}, T(n) = {vⁿ}, for n = 0, 1, 2, ...
The sub-diagram of states accessible from S consists of all the states
δS = { S(n): n = 0, 1, 2, ... } ∪ { T(n): n = 0, 1, 2, ... } ∪ { ∅ }
The state transitions are the following
u: S(n) → S(n+1)
v: T(n+1) → T(n)
w: S(n) → T(n)
with x: A → ∅ in all other cases for x ∈ {u,v,w} and A ∈ δS. The sole final state is T(0).
As you can see, the DIA is infinite and is not a DFA at all. If you were to draw out the diagram, you would see an infinite ladder with S = S(0) being the start state T(0) = {1} the final state, with all the u transitions climbing up a rung, all the v transitions climbing down a rung, and the w transitions crossing over on a rung.
The symmetry is captured by factoring the state set into
δS = {S,T}×{0,1,2,3,⋯} ∪ {∅}
with S(n) rewritten as (S,n) and T(n) as (T,n). This includes a finite set of states Q = {S,T} for a finite state "control" and a set of states D = {0,1,2,3,⋯} for a "device"; as well as the empty set ∅ for the fail state. That device is none other than a counter, and this DIA is just a one-counter automaton in disguise.
All of the classical automata models posed in the literature have a similar form, when expressed as DIA. They contain a state set Q×D ∪ {∅} that includes a finite set Q for the "finite state control" and a (generally infinite) state set D for the device, along with the fail state ∅. The restrictions or constraints on the device correspond to what types of symmetries are contained in the underlying DIA. A deterministic PDA, with two stack symbols {a,b} for instance, has a device state set D = {a,b}* (consisting of all stack words), and an underlying DIA that has the form of an infinite binary tree with copies of Q residing at each node.
You can best see this by writing out and graphing the DIA for the Dyck language, which is given by the grammar D₂ → b D₂ d D₂, D₂ → p D₂ q D₂, D₂ → 1 as a language over X = {b,d,p,q} and subset of M = X* = {b,d,p,q}*; i.e. as the least-fixed point D₂ = μx ({b}x{d}x ∪ {p}x{q}x ∪ {1}).
Every subset in A ⊆ ℭM can be expressed in terms of a subset in A' ⊆ ℜM[b,d,p,q] of the free extension of the monoid M by indeterminates {b,d,p,q}, by carrying out insertions of {b,d,p,q} in suitable places in A, such that the result upon applying the identities {bd} = {1} = {pq}, {bq} = ∅ = {pd}, and xy = yx for x ∈ M and y ∈ {b,d,p,q} will yield A, itself, from A'.
This result (known, but unpublished since the 1990's and published only in 2022) is the algebraic form of the Chomsky-Schützenberger Theorem and is true for all monoids M. For instance, it holds for the non-free monoid M = X* × Y*, where the corresponding family ℭ(X* × Y*) comprise the push-down transductions from X to Y (or "simple syntax directed translations"; aka yacc-like grammars).
So, there is also something like a DFA even for these classes of DIA; provided you include transition arrows for {b,d,p,q}. For your example, A = μx({u}x{v} ∪ {w}), you have A' = {b}{up,qv,w}*{d} and you can easily write down the corresponding DFA. That automaton is just the one-counter machine, itself, with "b" interpreted as "start up at count 0", "d" as "check for count 0 and finish", "p" as "add one to the count" and "q" as "check for count greater than 0 and subtract 1". With respect to the algebraic rules given for {b,d,p,q}, A' is not just a representation of A, it is actually is A: A' = A.

How to determine time complexity of EM algorithm of probabilistic PCA?

I was studying probabilistic pca from bishop's book, there an EM algo is provdied to calculate principal subspace.
Here M is MxM matrix, W is DxM matrix and (xn − x) is vector Dx1 matrix.
Later in the book there is statement regarding the time complexity:
"Instead, the most computationally demanding steps are those involving sums over
the data set that are O(NDM)."
I was wondering if anyone can help me understanding the time complexity of the algorithm. Thanks in advance.

Let us go one by one
E[zn] = M^-1 W' (xn - x)
M^-1 can be precomputed, thus you do not pay O(M^3) everytime you need this kind of value, but rather a single O(M^3) cost at the end
despite that it is multiplication of matrices of sizes MxM * MxD * Dx1 which is O(M^2D)
result is of size Mx1
E[zn zn'] = sigma^2 M^-1 + E[zn]E[zn]'
sigma^2 M^-1 is just multiplication by constant thus linear in size of matrix, O(M^2)
second operation is outer product of Mx1 and 1xM vectors, thus result is MxM again, and takes O(M^2) too
result is M x M matrix
Wnew = [SUM (xn-x) E[zn]'][SUM E[zn zn']]
First part is N times repeated (sum) operation of multiplying Dx1 matrix by 1xM, thus complexity is O(NDM); result is of size D x M
Second part is again sum of N elements, each being a matrix of M x M, thus in total O(NM^2)
Finally we compute product of D x M and M x M, which is O(DM^2), and again results in D x M matrix
sigma^2new = 1/ND SUM[||xn-x||^2 - 2E[zn]'Wnew'(xn-x) + Tr(E[zn zn']Wnew'Wnew)]
Again we sum N times, this time 3 element sum - first part is just a norm, thus we compute it in O(D) (linear in size of vectors), second term is multiplication of 1 x M, M x D and D x 1 resulting in complexity of O(MD) (per each of iterations, thus in total O(NMD)), and last part is again about multiplying three matrices of sizes M x M, M x D, D x M thus leading to O(M^3D) (*N), but you just need the trace and you can precompute Wnew'Wnew, thus this part is just trace of MxM times MxM matrices leading to O(M^2) (*N)
In total you get O(M^3) + O(NMD) + O(M^2D) + O(M^2N), and I suppose there is an assumption that M<=D<=N thus O(NMD)

Curious case LL(1) or not?

I have stumbled upon a very curious case:
Consider
1) S -> Ax
2) & 3) A->alpha|beta
4) alpha-> b
5) & 6) beta -> epsilon | x
Now I checked and this grammar doesn't defy any rules of LL(1) grammars. But when I construct the parsing table, there are some collisions.
First Sets
S => {b,x}
A=>{b,x,epsilon}
alpha=>{b}
beta=> {x,epsilon}
Follow sets
S=> {$}
A => {x}
alpha => {x}
beta => {x}
Here is the parsing table **without considering** the RHS's which can produce
epsilons
x b $
S 1 1
A 3 2
alpha b
beta 6
So far so good, but when we do consider RHS's that can derive epsilon, we get collisions in the table!
So is this LL(1) or not?

So is this LL(1) or not?
First(A) contains x, and Follow(A) contains x. Since A can derive empty, and there is an intersection between First(A) and Follow(A), it is not LL(1).

I am really sorry, its a blunder on my part.
Actually it doesn't satisfy all the rules of LL(1) grammars
beta-> epsilon | x
hence first(x)^follow(beta) should be disjoint but thats not the case!!
Sorryy!!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Is the following set convex? - machine-learning

The set of points closer to a given point than a given set, i.e., {x | ||x − x0|| ≤ ||x − y|| for all y ∈ S} where S ⊆ R^n It reminds me of an Euclidean ball, but I don't know how to proceed to check if it's convex or not. ( In this case it's the 2-norm used above ).

The link: https://see.stanford.edu/materials/lsocoee364a/hw1sol.pdf 2.7 provides the answer, where you prove that the set is a halfspace and thus the set is convex.

Related

Why does cubical agda choose the particular two component homogeneous path composition operator it does?

Why thetaX not theta'X in practical?

Can all context free grammars be converted to NFA/DFA?

How to determine time complexity of EM algorithm of probabilistic PCA?

Curious case LL(1) or not?

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Is the following set convex? - machine-learning

The set of points closer to a given point than a given set, i.e., {x | ||x − x0|| ≤ ||x − y|| for all y ∈ S} where S ⊆ R^n It reminds me of an Euclidean ball, but I don't know how to proceed to check if it's convex or not. ( In this case it's the 2-norm used above ).

The link: https://see.stanford.edu/materials/lsocoee364a/hw1sol.pdf 2.7 provides the answer, where you prove that the set is a halfspace and thus the set is convex.

Related

Why does cubical agda choose the particular two component homogeneous path composition operator it does?

Why theta*X not theta'*X in practical?

Can all context free grammars be converted to NFA/DFA?

How to determine time complexity of EM algorithm of probabilistic PCA?

Curious case LL(1) or not?

Categories

Resources

Why thetaX not theta'X in practical?