Intuition behind the KKT conditions - machine-learning

what is the physical significance of KKT conditions in the lagrangian dual problem formulation while solving for the SVM optimization primal problem ?

Mathematically, the KKT conditions are equations that must be met in order to prove strong duality, which means that the solution to the dual problem will be equivalent to the solution of the primal problem.
In concept, when you run the code for SVM, the KKT conditions are going to be the primary reason why all the non-border alphas end up going to 0 and, consequently, the support vectors will be the only alphas > 0.

Related

Does Drake guarantee optimizations will succeed if they are satisfied by the initial guess?

If I create a mathematical program in drake, with some constraints and some costs, and give it an initial guess that satisfies all the constraints, are all drake solvers guaranteed to find a solution? And is that solution guaranteed to have a cost less than or equal to the cost of the initial guess?
No, none of Drake's supported solvers guarantee these.
I suppose you have a nonlinear non-convex optimization problem. For these problems Drake support SNOPT, IPOPT and NLopt solvers. For all these solvers, they don't guarantee constraint violation during the iterative optimization process. Temporarily violating the constraints enable the solver to find possibly better solutions. For example if you have a problem
min x
s.t x² = 1
and you start from the initial guess x=1, you probably would hope the solver to find the better solution x=-1, but to jump from x=1 to x=-1 the solver has to go through the intermediate values (like x = 0.5, 0, ...) which violates the constraint.
It doesn't guarantee to have a cost less than or equal to the cost of the initial guess either. It could end up with a worse solution. In non-convex optimization problem, there isn't much we can guarantee. If you find the solver returns a worse solution than your initial guess, you could just use the initial guess.

How to leverage Z3 SMT solver for ILP problems

Problem
I'm trying to use z3 to disprove reachability assertions on a Petri net.
So I declare N state variables v0,..v_n-1 which are positive integers, one for each place of a Petri net.
My main strategy given an atomic proposition P on states is the following :
compute (with an exterior engine) any "easy" positive invariants as linear constraints on the variables, of the form alpha_0 * v_0 + ... = constant with only positive or zero alpha_i, then check_sat if any state reachable under these constraints satisfies P, if unsat conclude, else
compute (externally to z3) generalized invariants, where the alpha_i can be negative as well and check_sat, conclude if unsat, else
add one positive variable t_i per transition of the system, and assert the Petri net state equation, that any reachable state has a Parikh firing count vector (a value of t_i's) such that M0 the initial state + product of this Parikh vector by incidence matrix gives the reached state. So this one introduces many new variables, and involves some multiplication of variables, but stays a linear integer programming problem.
I separate the steps because since I want UNSAT, any check_sat that returns UNSAT stops the procedure, and the last step in particular is very costly.
I have issues with larger models, where I get prohibitively long answer times or even the dreaded "unknown" answer, particularly when adding state equation (step 3).
Background
So besides splitting the problem into incrementally harder segments I've tried setting logic to QF_LRA rather than QF_LIA, and declaring the variables as Real than integers.
This overapproximation is computationally friendly (z3 is fast on these !) but unfortunately for many models the solutions are not integers, nor is there an integer solution.
So I've tried setting Reals, but specifying that each variable is either =0 or >=1, to remove solutions with fractions of firings < 1. This does eliminate spurious solutions, but it "kills" z3 (timeout or unknown) in many cases, the problem is obviously much harder (e.g. harder than with just integers).
Examples
I don't have a small example to show, though I can produce some easily. The problem is if I go for QF_LIA it gets prohibitively slow at some number of variables. As a metric, there are many more transitions than places, so adding the state equation really ups the variable count.
This code is generating the examples I'm asking about.
This general presentation slides 5 and 6 express the problem I'm encoding precisely, and slides 7 and 8 develop the results of what "unsat" gives us, if you want more mathematical background.
I'm generating problems from the Model Checking Contest, with up to thousands of places (primary variables) and in some cases above a hundred thousand transitions. These are extremum, the middle range is a few thousand places, and maybe 20 thousand transitions that I would really like to deal with.
Reals + the greater than 1 constraint is not a good solution even for some smaller problems. Integers are slow from the get-go.
I could try Reals then iterate into Integers if I get a non integral solution, I have not tried that, though it involves pretty much killing and restarting the solver it might be a decent approach on my benchmark set.
What I'm looking for
I'm looking for some settings for Z3 that can better help it deal with the problems I'm feeding it, give it some insight.
I have some a priori idea about what could solve these problems, traditionally they've been fed to ILP solvers. So I'm hoping to trigger a simplex of some sort, but maybe there are conditions preventing z3 from using the "good" solution strategy in some cases.
I've become a decent level SMT/Z3 user, but I've never played with the fine settings of :options, to guide the solver.
Have any of you tried feeding what are basically ILP problems to SMT, and found options settings or particular encodings that help it deploy the right solutions ? thanks.

Is the Bias unit in neural networks always one?

I have been studying Neural Networks for a couple of weeks and noticed that all guides and documentation either never mentioned the Bias unit and/or always assumed it to be 1.
Is there any reason or cases where we want a bias unit not to be 1?
Or have it as an adjustable parameter in the network?
Edit:I'm sorry, i'm new to stack overflow and found similar questions so I thought this was a good place to ask, thank you for correcting me.
Edit: When people refer to bias they are in most cases referring to the bias_weight:
Bias&BiasUnit
The bias unit is also the reason we get the equation for the bias Δb in back-propagation as:
Δb = ΔY * 1 (the * 1 is just normally left out as it has no effect on the equation)
Hope that clears thinks up.
This question is better suited for cross-validation or maybe data-science (not about code at all).
I think you have a misunderstanding, the bias term is a trainable parameter that is also learned and updated during training.
I think I know what is the source to your confusion (correct me if I'm wrong). In many places, the bias term is incorporated into the input vector x as a constant 1 element.
So if we have the following input:
The output for some operation can be written as:
Where the trained parameters are:
But it can also be written in the following way:
But, despite the fact that we have the constant 1 in the input, since is still one of the trainable parameters, the bias can still be anything.

Genetic algorithm - shortest path in weighted graph

I want to make a genetic algorithm that solves a shortest path problem in weighted, connected graph. Similar to travelling salesman, but instead of fully-connected graph, it's just connected.
My idea is to randomly generate a path consisting of n-1 nodes for each chromosome in binary form, where numbers indicate nodes in a path. Then I will choose the best depending on sum of weights (if cant go from A to B i would give it penalty) and crossover/mutate bits in it. Will it work? It feels a little like smaller version of bruteforce. Is there a better way?
Thanks!
Genetic algorithm is pretty much "smaller version of bruteforce". It is just a metaheuristic, not an optimization method which has decent convergence guarantees. It basically depends on randomness to provide new solutions, thus it is a "slightly better random search".
So "will it work"? Yes, it will do something, as long as you have enough randomness in mutation it will even (eventually) converge to optimum. Will it work better than a random search? Hard to say, this depends on dozens of factors, not only your encoding, but also all the hyperparameters used etc. in general genetic algorithms are about trials and errors. In particular representation of chromosomes which does not loose any information (yours does not) does not matter, meaning that everything depends on clever implementation of crossover and mutation (as long as chromosomes do not loose any information they are all equivalent).
Edited.
You can use permutation coding GA. In permutation coding, you should give the start and end points. GA searches for the best chromosome with your fitness function. Candidate solutions (chromosomes) will be like 2-5-4-3-1 or 2-3-1-4-5 or 1-2-5-4-3 etc. So your solution depends on your fitness function. (Look at GA package for R to apply permutation GA easily.)
Connections are constraints for your problem. My best advice is create a constraint matrix like that:
FirstPoint SecondPoint Connected
A B true
A C true
A E false
... ... ...
In standard TSP, just distances are considered. In your fitness function, you have to consider this matrix and add a penalty to return value for each false.
Example chromosome: A-B-E-D-C
A-B: 1
B-E: 1
E-D: 4
D-C: 3
Fitness value: 9
.
Example chromosome: A-E-B-C-D
A-E: penalty
E-B: 1
B-C: 6
C-D: 3
Fitness value: 10 + penalty value.
Because your constraint is a hard constraint, you can use max integer value as the penalty. GA will find the best solution. :)

Why is inference in Markov Random Fields hard?

I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:
"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."
Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?
Alternatively, any have good intuition on why inference in MRF is difficult?
Sources:
Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf
Specific section seen below
When training your CRF, you want to estimate your parameters, \theta.
In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.
You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.
Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.

Resources