High amount of polynomials in many variables - z3

I have something between 20 and 200 polynomials in about 100 or 200 variables. All have a similar form to this one
x(6)(1)(1)*y(1)(1)^2+x(6)(2)(1)*y(1)(1)*y(2)(1)+x(6)(2)(2)*y(2)(1)^2+x(6)(3)(1)*y(1)(1)*y(3)(1)+x(6)(3)(2)*y(2)(1)*y(3)(1)+x(6)(3)(3)*y(3)(1)^2+x(6)(4)(1)*y(1)(1)*y(4)(1)+x(6)(4)(2)*y(2)(1)*y(4)(1)+x(6)(4)(3)*y(3)(1)*y(4)(1)+x(6)(4)(4)*y(4)(1)^2+x(6)(5)(1)*y(1)(1)*y(5)(1)+x(6)(5)(2)*y(2)(1)*y(5)(1)+x(6)(5)(3)*y(3)(1)*y(5)(1)+x(6)(5)(4)*y(4)(1)*y(5)(1)+x(6)(5)(5)*y(5)(1)^2
This is from singular. The brackets are just indices for the the variables. So this a degree 3 polynomial in 20 variables or something. And all coefficients are +-1.
Can Z3 solve the following problem in a reasonable time or do I not even have to try Z3 here?
Is there a real x such that 50 such polynomials are zero and 50 are non-zero in x.
Thanks in advance

Impossible to say without trying. Z3 has a decision procedure for nonlinear-real arithmetic, so in theory; yes, it can answer these questions. But how "quick," is anybody's guess. The community would appreciate if you actually do try and report what you find out!

Now I did just try it. So the problem where only 80 polynomials have to vanish and none have to be nonzero is working really well. It takes about half an hour or so. This is most likely due to the fact that there are "simple" answers in that case, meaning lots of zeros.
But as soon as I add one polynomial and require it to be nonzero it becomes much worse. After 1 day there is still no result. But since I do not know if there even is a positive answer, Z3 probably has to try everything, so this was to be expected I think.
One more question: Assuming there is no answer to the problem, meaning Z3 is eventually outputting "non-sat". Is there any way Z3 can output any kind of progress during its search, so I am able to at least have some kind of worst case time?

Related

How to leverage Z3 SMT solver for ILP problems

Problem
I'm trying to use z3 to disprove reachability assertions on a Petri net.
So I declare N state variables v0,..v_n-1 which are positive integers, one for each place of a Petri net.
My main strategy given an atomic proposition P on states is the following :
compute (with an exterior engine) any "easy" positive invariants as linear constraints on the variables, of the form alpha_0 * v_0 + ... = constant with only positive or zero alpha_i, then check_sat if any state reachable under these constraints satisfies P, if unsat conclude, else
compute (externally to z3) generalized invariants, where the alpha_i can be negative as well and check_sat, conclude if unsat, else
add one positive variable t_i per transition of the system, and assert the Petri net state equation, that any reachable state has a Parikh firing count vector (a value of t_i's) such that M0 the initial state + product of this Parikh vector by incidence matrix gives the reached state. So this one introduces many new variables, and involves some multiplication of variables, but stays a linear integer programming problem.
I separate the steps because since I want UNSAT, any check_sat that returns UNSAT stops the procedure, and the last step in particular is very costly.
I have issues with larger models, where I get prohibitively long answer times or even the dreaded "unknown" answer, particularly when adding state equation (step 3).
Background
So besides splitting the problem into incrementally harder segments I've tried setting logic to QF_LRA rather than QF_LIA, and declaring the variables as Real than integers.
This overapproximation is computationally friendly (z3 is fast on these !) but unfortunately for many models the solutions are not integers, nor is there an integer solution.
So I've tried setting Reals, but specifying that each variable is either =0 or >=1, to remove solutions with fractions of firings < 1. This does eliminate spurious solutions, but it "kills" z3 (timeout or unknown) in many cases, the problem is obviously much harder (e.g. harder than with just integers).
Examples
I don't have a small example to show, though I can produce some easily. The problem is if I go for QF_LIA it gets prohibitively slow at some number of variables. As a metric, there are many more transitions than places, so adding the state equation really ups the variable count.
This code is generating the examples I'm asking about.
This general presentation slides 5 and 6 express the problem I'm encoding precisely, and slides 7 and 8 develop the results of what "unsat" gives us, if you want more mathematical background.
I'm generating problems from the Model Checking Contest, with up to thousands of places (primary variables) and in some cases above a hundred thousand transitions. These are extremum, the middle range is a few thousand places, and maybe 20 thousand transitions that I would really like to deal with.
Reals + the greater than 1 constraint is not a good solution even for some smaller problems. Integers are slow from the get-go.
I could try Reals then iterate into Integers if I get a non integral solution, I have not tried that, though it involves pretty much killing and restarting the solver it might be a decent approach on my benchmark set.
What I'm looking for
I'm looking for some settings for Z3 that can better help it deal with the problems I'm feeding it, give it some insight.
I have some a priori idea about what could solve these problems, traditionally they've been fed to ILP solvers. So I'm hoping to trigger a simplex of some sort, but maybe there are conditions preventing z3 from using the "good" solution strategy in some cases.
I've become a decent level SMT/Z3 user, but I've never played with the fine settings of :options, to guide the solver.
Have any of you tried feeding what are basically ILP problems to SMT, and found options settings or particular encodings that help it deploy the right solutions ? thanks.

Why is my p value too low with > 1,000,000 obs (p<<0.01)

I know that generally a low P value is good since I want to reject the H0 hypothesis. But my problem is an odd one, and I would appreciate any help or insight you may give me.
I work with huge data sets (n > 1,000,000), each representing data of one year. I am required to analyse the data and find out whether the mean of the year is significantly different than the mean of the previous year. Yet everyone would prefer it to be non-significant instead of significant.
By "significant" I mean that I want to be able to tell my boss, "look, these non-significant changes are noise, while these significant changes represent something real to consider."
The problem is that simply comparing the two averages with a t-test always results in a significant difference, even if the difference is very very small (probably due to the huge sample size) and falls within the O.K zone of reality. So basically the way I perceive it, a p value does not function well for my needs.
What do you think I should do?
There is nothing wrong with the p value. Even slight effects with this number of observations will be flagged for significance. You have rightfully asserted that the effect size for such a sample is very weak. This basically nullifies whatever argument can be made for using the p value alone for "significance"...while the effect can be determined to not be by chance, its actual usefulness in the real world is likely low given it doesn't produce anything predictable.
For a comprehensive book on this subject, see the often-cited book by Jacob Cohen on power analysis. You can also check out my recent post on Cross Validated regarding two regression models with significant p values for predictors, but with radically different predictive power.

Data based estimation of missing values

I have problem at hand where I feel there should be an rather elegant solution to it, but at this point I have problems finding the right search terms or getting the first step towards the right direction.
Basics:
I have a high dimensional data space with D = 19 and have about 100 points in the space (100 measurements). With PCA and dimensionality estimation algorithms, I already confirmed that the latent space on which the points lie on is relatively low dimensional (max 5 dimensions or so). Therefore, I think in general it is not impossible what I am asking.
The problem:
Now, based on uncomplete measurements of a new point, I would like to estimate the missing values. The problem is that I do not know which values will be missing. Basically all combinations of missing values are (somewhat) similarly likely. -> I could have 1 missing value, 19 missing values or something in between. In a perfect world, the algorithm I am looking for not only gives an estimate of the missing values, but also some error measure.
To further illustrate, I attach you one image with the raw data. The x-axis shows the 19 individual measured parameters and the y axis gives the values of those parameters. You can see that the measurements are highly correlated. So even if I specify only one measurement/dimension I should be able to give a somewhat reliable estimation of the rest.
Does anyone of you have any pointers for me? Any thoughts or advice would be really helpful!
Thanks,
Thomas
The Right Way (TM) to handle missing data is to average (i.e., integrate) over the missing variables, given the values of any known variables. A Bayesian belief network is a formalization of this idea. If you can say more about what the variables are, I can say more about how to go about building a suitable belief network.

How to fine tune input parameters for ALWRS Factorizer in Apache Mahout?

So I have been using Apache Mahout for building a recommendation system. I am interested in using the SVD matrix factorization method.
I would like to know how I can fine tune the input paramter for :
ALSWRFactorizer(dataModel, no_of_hidden_features, lambda, iterations)
I have tried varying the values of lamda from 0.05 - 0.065 and my recommendation scores increased and then decreased. I thus selected 0.05945 as the value where the scores had reached the peak.
Is this the only approach I can use to estimate no_of_iterations and hidden_features. (values are rising and then decreasing, I expect no-of_features to be between 20-30).
Moreover is this the right approach even?
EDIT: Well I ran a couple more tests, and I seem to have zeroed in on using 20 hidden features, lambda = 0.0595, 20 iterations.
However I'd appreciate any answers explaining how I can do it in a better way.
So I came across this paper:
Application of Dimensionality Reduction in Recommender System
In section 4.3 they have essentially followed the same steps that I have done. After spending a day or two going through google results, iterative experimentation seems to be the only answer to fine tune these paramters.
Not sure what you mean that your "scores" increased then decreased. If you are describing running a precision type cross-validation test after applying each parameter iteration then you did the right thing. The values you came up with are very close to the "rules of thumb" for ALS-WR.

Genetic Algorithm, large population vs small one

Im wondering if there is a general rule of thumb for population sizing. Ive read in a book that 2x the chromosome length is a good starting point. Am i correct in assuming then that if i had an equation with 5 variables, i should have a population of 10?
Im also wondering if the following is correct:
Larger Population Size.
Pros:
Larger diversity so more likely to pick up on traits which return a good fitness.
Cons:
Requires longer to process.
vs
Smaller Population Size.
Pros:
Larger number of generations experienced per unit time.
Cons:
Mutation will have to be more prominent in order to compensate for smaller population??
EDIT
A little additional info, say i have an equation which has 5 unknown parameters. For each parameter i have anywhere between 10-50 values i would like to try assign to each of these variables. So for example
variable1 = 20 different values
variable2 = 15 different values
...
I thought a GA would be a decent approach to such a problem as the search space is quite large, ie worst case for the above would be 312,500,000 permutations (unless i have screwed up?) n!/(n-k)! where n = 50 and k = 1 => 50 * 50 * 50 * 50 * 50
unfortunately the number of parameters/range of values to check can vary alot so i was looking for some sort of rule of thumb as to how large i should set the population.
Thanks for ur help + if there is any more info you need/prefer to discuss in one of the chatrooms, just give me a shout.
I'm not sure where you read that 2x the chromosome length is a good starting point, but I'm guessing it's a book that concentrated on larger problems.
If you only have five variables, a genetic algorithm is probably not the right choice for converging upon a solution. With a chromosome length of five you're probably going to find that you very quickly reach a non-deterministic(this will change in subsequent runs) local minimum and slowly iterate around that space until you find the true local minimum.
However, if you are insistent on using a GA I would suggest abandoning that rule of thumb for this problem and really think about starting population as a measure of how far from the final solution you expect a random solution to be.
The reason that many rule of thumbs is dependent on chromosome length is because that's a decent proxy for this, if I have a hundred variables, and given randomly generating dna sequence is going to be further from ideal than if I had only one variable.
Additionally, if you're worried about computation intensity I'm going to go ahead and say that it shouldn't be an issue since you're dealing with such a small solution set. I think a better rule of thumb for smaller sets like this would be along the lines of:
(ln(chromosome_length*(solution_space/granularity)/mutation_rate))^2
Probably with a constant thrown in to scale for the particular problem.
It's definitely not a great rule of thumb (no rule is) but here's my logic for it:
Chromosome length is just a proxy for size of solution space, so taking into account the size of the solution space will necessarily increase the accuracy of this proxy
A smaller mutation rate necessitates a larger population size to compensate for the fact that you are more prone to get caught in local minima
Any rule of thumb should scale logarithmically since a genetic algorithm is akin to a tree search of your solution space.
The squared term was mostly the result of trying this out, but it looks like the logarithmic scaling was a little aggressive, though the general shape seemed right.
However I think a better choice would be to start at a reasonable number (100) and try iterating up and down until you find a population size that seems to balance accuracy with execution speed.
As with most genetic algorithm parameters population size is highly dependant on the problem. There are certain factors that can help to point in the direction of whether you should have a large or small population size but a lot of the time testing different values against a known solution before running it on your problem is a good idea (if this is possible of course).
A population size of 10 does seem rather small though. You say you have an equation with five variables. Is your problem represented by a chromosome of 5 values? It seems small for a chromosome and if this is the case it's likely that using a genetic algorithm may not be the best way to solve the problem. Perhaps if you give a bit more detail on your problem and how you are representing it people may have a better idea of how to advise you.
I'd also add that your cons for large and small population sizes aren't exactly correct. A larger population size does take longer to process than a small one but since it can often solve the problem quicker then overall the processing time isn't necessarily longer. gain, it's highly dependant on the problem. With a smaller population size mutation shouldn't have to be more prominent. Mutation is generally used to stop the genetic algorithm from becoming stuck in a local maximum and should usually be a very small value. A small population is more likely to become stuck in a local maximum but if you have a mutation value which is too high you may be nullifying the natural improvement of the genetic algorithm.

Resources