About validating a kernel - machine-learning

About validating a kernel - machine-learning

I've a test in a few days and I've a few issues with some of the subjects.
Let's start with kernels, basically I understood that a kernel needs to be positive semidefinite and symmetric in order to be valid. Is that enough? For example the following kernel, kernel(x,y) = 2 * k1(x,y) for some k1 which is a valid kernel. Is that valid? My question is how can I distinguish between a valid kernel and a nonvalid kernel if I'm given a kernel in the test ?

There are three requirements to apply Mercer's theorem:
K is continuous
K is symmetric
K is positive semidefinite
If you have these three properties, you have a valid kernel.
For example the following kernel, kernel(x,y) = 2 * k1(x,y) for some k1 which is a valid kernel. Is that valid?
Yes, it is easy to show that given proper kernels K1, K2:
aK1, for any a>0
K1 + K2
are valid kernels, thus you also get that for any a, b > 0 aK1 + bK2 is a valid kernel.
My question is how can I distinguish between a valid kernel and a nonvalid kernel if I'm given a kernel in the test ?
There is no magic way. The problem is really hard for generic functions. Thus on the test I would expect either non-kernels which are easy to falsify (do not have a property of typical dot product), or valid kernels, which are possible to prove are valid through either Mercer's theorem, or through construction.
In particular, another way of proving something is a kernel is by explicitely finding phi, since by definition for every kernel K there exists phi such that
K(x,y) = <phi(x), phi(y)>
so if you can find phi that has this property - you prove that K is a kernel.
For example - let K be a graph kernel, defined as K(G1, G2) = amount of vertices shared by G1 and G2. It is easy to show that if we take phi(G) = one hot encoding of the vertices in G, then
K(G1,G2) = <phi(G1), phi(G2)>

Related

Z3: express linear algebra properties

I would like to prove properties of expressions involving matrices and vectors (potentially large size, but size is fixed).
For example I want to prove that the outcome of an expression is a diagonal matrix or a triangular matrix, or it is positive definite, ...
To that end I'd like encode well known properties and identities from linear algebra, such as:
||x + y|| <= ||x|| + ||y||
(A * B) * C = A * (B * C)
det(A+B) = det(A) + det(B)
Tr(zA) = z * Tr(A)
(I + AB) ^ (-1) = I - A(I + BA) ^ (-1) * B
...
I have attempted to implement this in Z3. But even for simple properties it returns unknown or times out. I've tried with array theory and quantifiers.
I'd like know if this problem can be solved with an SMT solver or is it not suited for these kind of problems? Could you give a hint by giving a small example?

You can certainly use Z3 to do this.
I have constructed a small example here, which defines the identity matrix and what it means to be a diagonal matrix, and then proves that the identity matrix is diagonal.
So, it is definitely possible to do this kind of work in Z3. Though you may find you have a better time using a tool built on top of Z3 that has more interactive proving features, such as Dafny or F*.

Gradient from non-trainable weights function

I'm trying to implement a self-written loss function. My pipeline is as follows
x -> {constant computation} = x_feature -> machine learning training -> y_feature -> {constant computation} = y_produced
These "constant computations" are necessary to bring out the differences between the desired o/p and produced o/p.
So if I take the L2 norm of the y_produced and y_original, how should I incorporate this loss in the original loss.
Please Note that y_produced has a different dimension than y_feature.

As long as you are using differentiable operations there is no difference between "constant transformations" and "learnable ones". There is no such distinction, look even at the linear layer of a neural net
f(x) = sigmoid( W * x + b )
is it constant or learnable? W and b are trained, but "sigmoid" is not, yet gradient flows the same way, no matter if something is a variable or not. In particular gradient wrt. to x is the same for
g(x) = sigmoid( A * x + c )
where A and c are constants.
The only problem you will encounter is using non-differentiable operations, such as: argmax, sorting, indexing, sampling etc. these operations do not have a well defined gradient thus you cannot directly use first order optimisers with them. As long as you stick with the differentiable ones - the problem described does not really exist - there is no difference between "constant transromations" and any other transformations - no matter change of the size etc.

Machine Learning: Why xW+b instead of Wx+b?

I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much

This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.

I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D

Fast Exact Solvers for Chromatic Number

Finding the chromatic number of a graph is an NP-Hard problem, so there isn't a fast solver 'in theory'. Is there any publicly available software that can compute the exact chromatic number of a graph quickly?
I'm writing a Python script that computes the chromatic number of many graphs, but it is taking too long for even small graphs. The graphs I am working with a wide range of graphs that can be sparse or dense but usually less than 10,000 nodes. I formulated the problem as an integer program and passed it to Gurobi to solve. Do you have recommendations for software, different IP formulations, or different Gurobi settings to speed this up?
import networkx as nx
from gurobipy import *
# create test graph
n = 50
p = 0.5
G = nx.erdos_renyi_graph(n, p)
# compute chromatic number -- ILP solve
m = Model('chrom_num')
# get maximum number of variables necessary
k = max(nx.degree(G).values()) + 1
# create k binary variables, y_0 ... y_{k-1} to indicate whether color k is used
y = []
for j in range(k):
y.append(m.addVar(vtype=GRB.BINARY, name='y_%d' % j, obj=1))
# create n * k binary variables, x_{l,j} that is 1 if node l is colored with j
x = []
for l in range(n):
x.append([])
for j in range(k):
x[-1].append(m.addVar(vtype=GRB.BINARY, name='x_%d_%d' % (l, j), obj=0))
# objective function is minimize colors used --> sum of y_0 ... y_{k-1}
m.setObjective(GRB.MINIMIZE)
m.update()
# add constraint -- each node gets exactly one color (sum of colors used is 1)
for u in range(n):
m.addConstr(quicksum(x[u]) == 1, name='NC_%d' % u)
# add constraint -- keep track of colors used (y_j is set high if any time j is used)
for u in range(n):
for j in range(k):
m.addConstr(x[u][j] <= y[j], name='SH_%d_%d' % (u,j))
# add constraint -- adjacent nodes have different colors
for u in range(n):
for v in G[u]:
if v > u:
for j in range(k):
m.addConstr(x[u][j] + x[v][j] <= 1, name='ADJ_%d_%d_COL_%d' % (u,v,j))
# update model, solve, return the chromatic number
m.update()
m.optimize()
chrom_num = m.objVal
I am looking to compute exact chromatic numbers although I would be interested in algorithms that compute approximate chromatic numbers if they have reasonable theoretical guarantees such as constant factor approximation, etc.

You might want to try to use a SAT solver or a Max-SAT solver. I expect that they will work better than a reduction to an integer program, since I think colorability is closer to satsfiability.
SAT solvers receive a propositional Boolean formula in Conjunctive Normal Form and output whether the formula is satisfiable. The following problem COL_k is in NP:
Input: Graph G and natural number k.
Output: G is k-colorable.
To solve COL_k you encode it as a propositional Boolean formula with one propositional variable for each pair (u,c) consisting of a vertex u and a color 1<=c<=k. You need to write clauses which ensure that every vertex is is colored by at least one color. You also need clauses to ensure that each edge is proper.
Then you just do a binary search to find the value of k such that G is k-colorable but not (k-1)-colorable.
There are various free SAT solvers. I have used Lingeling successfully, but you can find many others on the SAT competition website. They all use the same input and output format. Google "MiniSAT User Guide: How to use the MiniSAT SAT Solver" for an explanation on this format.
You can also use a Max-SAT solver, again consult the Max-SAT competition website. They can solve the Partial Max-SAT problem, in which clauses are partitioned into hard clauses and soft clauses. Here, the solver finds the maximal number of soft clauses which can be satisfied while also satisfying all of the hard clauses, see the input format in the Max-SAT competition website (under rules->details).
You can formulate the chromatic number problem as one Max-SAT problem (as opposed to several SAT problems as above). In this sense, Max-SAT is a better fit. On the other hand, I have the impression that SAT solvers generally perform better than Max-SAT solvers. I don't have any experience with this kind of solver, so cannot say anything more.

Practicing Kernel trick in SVM

I am reading the theory of SVM. In kernel trick, what I understand is, if we have a data which is not linear separable in the original dimensions n, we use the kernel to map the data to a higher space to be linear separable (we have to choose the right kernel depending on the data set, etc). However, when I watched this video of Andrew ng Kernel SVM, What I understand is we can map original data into a smaller space which make me confused!? Any explanation.
Could you explain me how does RBF kernel work to map each original data sample x1(x11,x12,x13,....,x1n) to a higher space (with dimensions m) to be X1(X11,X12,X13,...,X1m) with a concrete example. Also, what I understand is the kernel compute the inner product of the transformed data (so there is an other transformation before the RBF, which means that RBF transform implicitly the data to a higher space but How?).
other thing: the kernel is a function k(x,x1):(R^n)^2->R =g(x).g(x1), with g is a transformation function, how to define g in the case of RBF kernel?
Suppose that we are in the test set, What I understand is x is the sample to be classified and x1 is the support vector (because only the support vectors will be used to calculate the hyperplane). in the case of RBF
k(x,x1)=exp(-(x-x1)^2/2sigma), so where is the transformation?
Last question: Admit that the RBF do the mapping to a higher dimension m, it is possible to show this m? I want to see the theoretical reality.
I want to implement SVM with RBF kernel. What is the m here and how to choose it? How to implement kernel trick in practice?

Could you explain me how does RBF kernel work to map each original data sample x1(x11,x12,x13,....,x1n) to a higher space (with dimensions m) to be X1(X11,X12,X13,...,X1m) with a concrete example. Also, what I understand is the kernel compute the inner product of the transformed data (so there is an other transformation before the RBF, which means that RBF transform implicitly the data to a higher space but How?).
Exactly as you said - kernel is an inner product of the projected space, not the projection itself. The whole trick is that you do not ever transform your data, because it is computationally too expensive to do so.
other thing: the kernel is a function k(x,x1):(R^n)^2->R =g(x).g(x1), with g is a transformation function, how to define g in the case of RBF kernel?
For rbf kernel, g is actually a mapping from R^n into the space of continuous functions (L2), and each point is mapped into unnormalized gaussian distribution with mean x, and variance sigma^2. Thus (up to some normalizing constant A that we will drop)
g(x) = N(x, sigma^2)[z] / A # notice this is not a number but a function of z!
and now inner product in the space of functions is the integral of products over the whole domain thus
K(x, y) = <g(x), g(y)>
= INT_{R^n} N(x, sigma^2)[z] N(y, sigma^2)[z] / A^2 dz
= B exp(-||x-y||^2 / (2*sigma^2))
where B is some constant factor (normalization) depending solely on sigma^2, thus we can drop it (as scaling does not really matter here) for computational simplicity.
Suppose that we are in the test set, What I understand is x is the sample to be classified and x1 is the support vector (because only the support vectors will be used to calculate the hyperplane). in the case of RBF k(x,x1)=exp(-(x-x1)^2/2sigma), so where is the transformation?
as said before - transformation is never explicitly used, you simply show that inner product of your hyperplane with the transformed point can be expressed again as inner products with support vectors, thus you do not ever transform anything, just use kernels
<w, g(x)> = < SUM_{i=1}^N alpha_i y_i g(sv_i), g(x)>
= SUM_{i=1}^N alpha_i y_i <g(sv_i), g(x)>
= SUM_{i=1}^N alpha_i y_i K(sv_i, x)
where sv_i is i'th support vector, alpha_i is the per-sample weight (Lagrange multiplier) found during the optimization process and y_i is label of i'th support vector.
Last question: Admit that the RBF do the mapping to a higher dimension m, it is possible to show this m? I want to see the theoretical reality.
In this case m is infinity, as your new space is space of continuous functions in the domain of R^n -> R, thus a single vector (function) is defined as a continuum (size of the set of real numbers) values - one per each possible input value coming from R^n (it is a simple set theory result that R^n for any positive n is of size continuum). Thus in terms of pure mathematics, m = |R|, and using set theory this is so called Beth_1 (https://en.wikipedia.org/wiki/Beth_number).
I want to implement SVM with RBF kernel. What is the m here and how to choose it? How to implement kernel trick in practice?
You do not choose m, it is defined by the kernel itself. Implementing kernel trick in practise requires expressing all your optimization routines in the form, where training points are used solely in the context of inner products, and just replacing them with kernel calls. This is way too complex to describe in SO form.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart