(Sub)optimal way to get a legit range info when using a SMT constraint with Z3 - z3

This question is related to my previous question
Is it possible to get a legit range info when using a SMT constraint with Z3
So it seems that "efficiently" finding the maximum range info is not proper, given typical 32-bit vectors and so on. But on the other hand, I am thinking whether it is feasible to find certain "sub-maximum" range info, which hopefully becomes more efficient. Another thing is that we may want to have certain "safe" guarantee, say for all elements in the sub-maximum range, they must satisfy the constraint, but there could exist some other solutions that would satisfy the constraint as well.
I am currently exploring whether model counting technique could make sense in this setting. Any thoughts would be appreciated very much. Thanks.

General case
This is not just a question of efficiency. Consider a problem where you have two variables a and b, and a single constraint:
a != b
What's the range of b? (maximum or otherwise?)
You can say all values are legitimate. But that would be wrong, as obviously the choice of a impacts the choice of b. The more variables you have around, the more complicated the problem will become. I don't think the problem is even well defined in this case, so searching for a solution (efficient or otherwise) doesn't make much sense.
Single variable assumption
Having said that, I think you can come up with a solution if you assume there's precisely one variable in the system. (Or, alternatively, if you fix all the other variables to some predefined constants.) If you're willing to go down this path, then you can implement a binary search algorithm to find a reasonably sized range by simply proving the quantified formula
Exists([b], And(b >= minBound, b <= maxBound, Not(constraints)))
Once you get unsat for this, you have your range. So long as you get sat, you can adjust your minBound/maxBound to search within smaller ranges. In the worst case, this can turn into a linear walk, but you can "cut-down" this search by making sure you go down a significant size in each step. That could be a parameter to the whole search, depending on how large you want your intervals to be. It'll have to be a choice between trying to find a maximal range, and how long you want to spend in this search. Of course, if you cut-down too much, you can miss a big interval, but that's the cost of efficiency.
Example1 (Good case) There's a single constraint that says b != 5. Then your search will be quick and depending on which branch you'll go, you'll either find [0, 4] or [6, 255] assuming 8-bit words.
Example2 (Bad case) There's a single constraint that says b is even. Then your search will exhibit worst-case behavior, and if your "cut-down" size is 1, you'll possibly iterate 255 times before you settle down on [0, 0]; assuming z3 gives you the maximum odd number in each call.
I hope that illustrates the point. In general, though, I'd assume you'd be closer to the "good case" for practical applications and even if your cut-down size is minimal you can most likely converge in a few iterations. Of course, this entirely depends on your problem domain, but I'd expect it to hold for software analysis in general.

Related

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

Neo4j floating point sum different results

I am using neo4j to calculate some statistics on a data set. For that I am often using sum on a floating point value. I am getting different results depending on the circumstances. For example, a query that does this:
...
WITH foo
ORDER BY foo.fooId
RETURN SUM(foo.Weight)
Returns different result than the query that simply does the sum:
...
RETURN SUM(foo.Weight)
The differences are miniscule (293.07724195098984 vs 293.07724195099007). But it is enough to make simple equality checks fail. Another example would be a different instance of the database, loaded with the same data using the same loading process can produce the same issue (the dbs might not be 1:1, the load order of some relations might be different). I took the raw values that neo4j sums (by simply removing the SUM()) and verified that they are the same in all cases (different dbs and ordered/not ordered).
What are my options here? I don't mind losing some precision (I already tried to cut down the precision from 15 to 12 decimal places but that did not seem to work), but I need the results to match up.
Because of rounding errors, floats are not associative. (a+b)+c!=a+(b+c).
The result of every operation is rounded to fit the floats coding constraints and (a+b)+c is implemented as round(round(a+b) +c) while a+(b+c) as round(a+round(b+c)).
As an obvious illustration, consider the operation (2^-100 + 1 -1). If interpreted as a (2^-100 + 1)-1, it will return 0, as 1+2^-100 would require a precision too large for floats or double coding in IEEE754 and can only be coded as 1.0. While (2^-100 +(1-1)) correctly returns 2^-100 that can be coded by either floats or doubles.
This is a trivial example, but these rounding errors may exist after every operation and explain why floating point operations are not associative.
Databases generally do not return data in a garanteed order and depending on the actual order, operations will be done differently and that explains the behaviour that you have.
In general, for this reason, it not a good idea to do equality comparison on floats. Generally, it is advised to replace a==b by abs(a-b) is "sufficiently" small.
"sufficiently" may depend on your algorithm. float are equivalent to ~6-7 decimals and doubles to 15-16 decimals (and I think that it is what is used on your DB). Depending on the number of computations, you may have the last 1--3 decimals affected.
The best is probably to use
abs(a-b)<relative-error*max(abs(a),abs(b))
where relative-error must be adjusted to your problem. Probably something around 10^-13 can be correct, but you must experiment, as rounding errors depends on the number of computations, on the dispersion of the values and on what you may consider as "equal" for you problem.
Look at this site for a discussion on comparison methods. And read What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg that discusses, among others, these problems.

NeuroEvolution of Augmenting Topologies (NEAT) and global innovation number

I was not able to find why we should have a global innovation number for every new connection gene in NEAT.
From my little knowledge of NEAT, every innovation number corresponds directly with an node_in, node_out pair, so, why not only use this pair of ids instead of the innovation number? Which new information there is in this innovation number? chronology?
Update
Is it an algorithm optimization?
Note: this more of an extended comment than an answer.
You encountered a problem I also just encountered whilst developing a NEAT version for javascript. The original paper published in ~2002 is very unclear.
The original paper contains the following:
Whenever a new
gene appears (through structural mutation), a global innovation number is incremented
and assigned to that gene. The innovation numbers thus represent a chronology of the
appearance of every gene in the system. [..] ; innovation numbers are never changed. Thus, the historical origin of every
gene in the system is known throughout evolution.
But the paper is very unclear about the following case, say we have two ; 'identical' (same structure) networks:
The networks above were initial networks; the networks have the same innovation ID, namely [0, 1]. So now the networks randomly mutate an extra connection.
Boom! By chance, they mutated to the same new structure. However, the connection ID's are completely different, namely [0, 2, 3] for parent1 and [0, 4, 5] for parent2 as the ID is globally counted.
But the NEAT algorithm fails to determine that these structures are the same. When one of the parents scores higher than the other, it's not a problem. But when the parents have the same fitness, we have a problem.
Because the paper states:
In composing the offspring, genes are randomly chosen from veither parent at matching genes, whereas all excess or disjoint genes are always included from the more fit parent, or if they are equally fit, from both parents.
So if the parents are equally fit, the offspring will have connections [0, 2, 3, 4, 5]. Which means that some nodes have double connections... Removing global innovation counters, and just assign id's by looking at node_in and node_out, you avoid this problem.
So when you have equally fit parents, yes you have optimized the algorithm. But this is almost never the case.
Quite interesting: in the newer version of the paper, they actually removed that bolded line! Older version here.
By the way, you can solve this problem by instead of assigning innovation ID's, assign ID based on node_in and node_out using pairing functions. This creates quite interesting neural networks when fitness is equal:
I can't provide a detailed answer, but the innovation number enables certain functionality within the NEAT model to be optimal (like calculating the species of a gene), as well as allowing crossover between the variable length genomes. Crossover is not necessary in NEAT, but it can be done, due to the innovation number.
I got all my answers from here:
http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
It's a good read
During crossover, we have to consider two genomes that share a connection between the two same nodes in their personal neural networks. How do we detect this collision without iterating both genome's connection genes over and over again for each step of crossover? Easy: if both connections being examined during crossover share an innovation number, they are connecting the same two nodes because they received that connection from the same common ancestor.
Easy Example:
If I am a genome with a specific connection gene with innovation number 'i', my children that take gene 'i' from me may eventually cross over with each other in 100 generations. We have to detect when these two evolved versions (alleles) of my gene 'i' are in collision to prevent taking both. Taking two of the same gene would cause the phenotype to probably loop and crash, killing the genotype.
When I created my first implementation of NEAT I thought the same... why would you keep a innovation number tracker...? and why would you use it only for one generation? Wouldn't be better to not keep it at all and use a key value par with the nodes connected?
Now that I am implementing my third revision I can see what Kenneth Stanley tried to do with them and why he wanted to keep them only for one generation.
When a connection is created, it will start its optimization in that moment. It marks its origin. If the same connection pops out in another generation, that will start its optimization then. Generation numbers try to separate the ones which come from a common ancestor, so the ones that have been optimized for many generations are not put side to side that one that was just generated. If a same connection is found in two genomes, that means that that gene comes from the same origin and thus, can be aligned.
Imagine then that you have your generation champion. Some of their genes will have 50 percent chance to be lost due that the aligned genes are treated equally.
What is better...? I haven't seen any experiments comparing the two approaches.
Kenneth Stanley also addressed this issue in the NEAT users page: https://www.cs.ucf.edu/~kstanley/neat.html
Should a record of innovations be kept around forever, or only for the current
generation?
In my implementation of NEAT, the record is only kept for a generation, but there
is nothing wrong with keeping them around forever. In fact, it may work better.
Here is the long explanation:
The reason I didn't keep the record around for the entire run in my
implementation of NEAT was because I felt that calling something the same
mutation that happened under completely different circumstances was not
intuitive. That is, it is likely that several generations down the line, the
"meaning" or contribution of the same connection relative to all the other
connections in a network is different than it would have been if it had appeared
generations ago. I used a single generation as a yardstick for this kind of
situation, although that is admittedly ad hoc.
That said, functionally speaking, I don't think there is anything wrong with
keeping innovations around forever. The main effect is to generate fewer species.
Conversely, not keeping them around leads to more species..some of them
representing the same thing but separated nonetheless. It is not currently clear
which method produces better results under what circumstances.
Note that as species diverge, calling a connection that appeared in one species a
different name than one that appeared earlier in another just increases the
incompatibility of the species. This doesn't change things much since they were
incompatible to begin with. On the other hand, if the same species adds a
connection that it added in an earlier generation, that must mean some members of
the species had not adopted that connection yet...so now it is likely that the
first "version" of that connection that starts being helpful will win out, and
the other will die away. The third case is where a connection has already been
generally adopted by a species. In that case, there can be no mutation creating
the same connection in that species since it is already taken. The main point is,
you don't really expect too many truly similar structures with different markings
to emerge, even with only keeping the record around for 1 generation.
Which way works best is a good question. If you have any interesting experimental
results on this question, please let me know.
My third revision will allow both options. I will add more information to this answer when I have results about it.

SARSA Implementation

I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L is the learning rate, r is the reward associated to (a,s), Q(s',a') is the expected reward from an action a' in the new state s' and D is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s), why are we re-subtracting the current Q-value?
Secondly, when picking actions a and a' why do these have to be random? I know in some implementations or SARSA all possible Q(s', a') are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s) value to update? Or why not update all Q(a,s) for the current s?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to state s from action a, and the rewards associated with going to future states will change in the future. So we can't just set Q(a,s) equal to r + DQ(a',s1). Instead, we just want to push it in the right direction so that it will eventually converge on the right value. So we look at the error in prediction, which requires subtracting Q(a,s) from r + DQ(a',s1). This is the amount that we would need to change Q(a,s) by in order to make it perfectly match the reward that we just observed. Since we don't want to do that all at once (we don't know if this is always going to be the best option), we multiply this error term by the learning rate, l, and add this value to Q(a,s) for a more gradual convergence on the correct value.`
Why do we pick actions randomly? The reason to not always pick the next state or action in a deterministic way is basically that our guess about which state is best might be wrong. When we first start running SARSA, we have a table full of 0s. We put non-zero values into the table by exploring those areas of state space and finding that there are rewards associated with them. As a result, something not terrible that we have explored will look like a better option than something that we haven't explored. Maybe it is. But maybe the thing that we haven't explored yet is actually way better than we've already seen. This is called the exploration vs exploitation problem - if we just keep doing things that we know work, we may never find the best solution. Choosing next steps randomly ensures that we see more of our options.
Why can't we just take all possible actions from a given state? This will force us to basically look at the entire learning table on every iteration. If we're using something like SARSA to solve the problem, the table is probably too big to do this for in a reasonable amount of time.
Why can SARSA only do one-step look-ahead? Good question. The idea behind SARSA is that it's propagating expected rewards backwards through the table. The discount factor, D, ensures that in the final solution you'll have a trail of gradually increasing expected rewards leading to the best reward. If you filled in the table at random, this wouldn't always be true. This doesn't necessarily break the algorithm, but I suspect it leads to inefficiencies.
Why is SARSA better than search? Again, this comes down to an efficiency thing. The fundamental reason that anyone uses learning algorithms rather than search algorithms is that search algorithms are too slow once you have too many options for states and actions. In order to know the best action to take from any other state action pair (which is what SARSA calculates), you would need to do a search of the entire graph from every node. This would take O(s*(s+a)) time. If you're trying to solve real-world problems, that's generally too long.

Package for fast determination of similarity between two bit sequences

I need to compare a query bit sequence with a database of up to a million bit sequences. All bit sequences are 100 bits long. I need the lookup to be as fast as possible. Are there any packages out there for fast determination of the similarity between two bit sequences? --Edit-- The bit sequences are position sensitive.
I have seen a possible algorithm on Bit Twiddling Hacks but if there is a ready made package that would be better.
If the database is rather static, you may want to build a tree data structure on it.
Search the tree recursively or in multiple threads and per search keep an actual difference variable. If the actual difference becomes greater than what you would consider 'similar', abort the search.
E.g. Suppose we have the following tree:
root
0 1
0 1 0 1
0 1 0 1 0 1 0 1
If you want to look for patterns similar to 011, and only want to allow 1 different bit at most, search like this (recursively or multi-threaded):
Start at the root
Take the left branch (0), this is similar, so difference is still 0
Take the left branch (0), this is different, so difference becomes 1, which is still acceptable
take the left branch (0), this is different, so difference becomes 2, which is too high. Abort looking in this branch.
take the right branch (1), this is equal, so difference remains 1, continue to search in this branch (not shown here)
Take the right branch (1), this is equal, so difference remains 0, go on
take the left branch (0), this is different, so difference becomes 1, which is still acceptable, go on.
This goes on until you have found your bit patterns.
If your bit patterns are more dynamic and being updated in your application, you will have to update the tree.
If memory is a problem, consider going to 64-bit.
If you want to look up the, let's say 50, most matching patterns, and we can assume that the input data set is rather static (or can be dynamically updated), you can repeat the initial phase of the previous comment, so:
For every bit pattern, count the bits.
Store the bit patterns in a multi_map (if you use STL, Java probably has something similar)
Then, use the following algorithm:
Make 2 collections: one for storing the found patterns, one for storing possibly good patterns (this second collection should probably be map, mapping 'distances' to patterns)
Take your own pattern and count the bits, assume this is N
Look in the multimap at index N, all these patterns will have the same sum, but not necessarily be completely identical
Compare all the patterns at index N. If they are equal store the result in the first collection. If they are not equal, store the result in the second collection/map, using the difference as key.
Look in the multimap at index N-1, all these patterns will have a distance of 1 or more
Compare all the patterns at index N-1. If they have a distance of 1, store them in the first collection. If they have a larger distance, store the result in the second collection/map, using the difference as key.
Repeat for index N+1
Now look in the second collection/map and see if there is something stored with distance 1. If it is, remove them from the second collection/map and store them in the first collection.
Repeat this for distance 2, distance 3, ... until you have enough patterns.
If the number of required patterns is not too big, and the average distance is also not too big, then the number of real compares between patterns is probably only a few %.
Unfortunately, since the patterns will be distributed using a Gaussian curve, there will still be quite some patterns to check. I didn't do a mathematical check on it, but in practice, if you don't want too many patterns out of the millions, and the average distance is not too far, you should be able to find the set of most-close patterns by checking only a few percent of the total bit patterns.
Please keep me updated of your results.
I came up with a second alternative.
For every bit pattern of the million ones count the number of bits and store the bit patterns in an STL multi_map (if you're writing in C++).
Then count the number of bits in your pattern. Suppose you have N bits set in your bit pattern.
If you now want to allow at most D differences, look up all the bit patterns in the multi_map having N-D, N-D+1, ..., N-1, N, N+1, ... N+D-1, N+D bits.
Unfortunately, the division of bit patterns in the multi_map will follow a Gaussian pattern, which means that in practice you will still have to compare quite some bit patterns.
(Originally I thought this could be solved by counting even 0's and uneven 1's but this isn't true.)
Assuming that you want to allow 1 difference, you have to look up 3 slots in the multi_map out of the 100 possible slots, leaving you with 3% of the actual bit patterns to do a full compare.

Resources