I am trying to model job shop scheduling problems using Z3. Specifically let's say i have a set of tasks each of which may have other task dependencies. Then I wish to minimize the time of scheduling the last tasks i.e. the makespan.
Since there can be more than one job which has dependencies on other jobs but no forward dependencies (i.e. no job depends on this one), A simple minimize operation in Z3 may not suffice. And Z3 doesn't admit to a max function over a list.
Hence to solve this, I am considering adding a fake job which depends on all such jobs and then minimizing the time of scheduling this job. I wonder if this approach is scalable as I need to add constraints to many jobs.
Is this the only approach or are there other more elegant means?
You can define max using a chain of ite calls yourself; assuming you know exactly how many jobs there are. See here: Use Z3 and SMT-LIB to get a maximum of two values
Related
My team uses a lot of aggregators (custom counters) for many of dataflow pipelines we use for monitoring and analysis purposes.
We mostly write DoFn classes to do so, but we sometimes use Combine.perKey(), by writing our own combine class that implements SerializableFunction<Iterable<T>, S> (usually in our case, T and S are the same). Some of the jobs we run have a small fraction of very hot keys, and we are looking to utilize some of the features offered by Combine (such as hot key fanout), but there is one issue with this approach.
It appears that aggregators are only available within DoFn, and I am wondering if there is a way around this, or this is a likely feature to be added in the future. Mostly, we use a bunch of custom counters to count the number of certain events/objects of different types for analysis and monitoring purposes. In some cases, we can probably apply another DoFn after the Combine step to do this, but in other cases we really need to count things during the combine process -- for instance, we want to know the distribution of objects over keys to understand how many hot keys we have and what draws the line between hot keys and very hot keys, for instance. There are a few other cases that seem tricky to us.
I searched around, but I couldn't find much resource around how one can use aggregators during the Combine step, so any help will be really appreciated!
If needed, I can perhaps describe what kind of Combine step we use and what we are trying to count, but it'll take some time and I'd like to have a general solution around this.
This is not currently possible. In the future (as part of Apache Beam) it is likely to be possible to define metrics (which are like aggregators) within a CombineFn which should address this.
In the meantime, for your use case you can do as you describe. You can have a Combine.perKey(), and then have multiple steps consuming the result -- one for your actual processing and others to report various metrics.
You could also look at the methods in CombineFns which allow creating a composed CombineFn. For instance, you could use your CombineFn and a simple Count, so that the reporting DoFn can report the number of elements in each key (consuming the Count) and the actual processing DoFn can consume the result of your CombineFn.
I am looking at the opportunities for implementing a data analysis algorithm using Google Cloud Dataflow. Mind you, I have no experience with dataflow yet. I am just doing some research on whether it can fulfill my needs.
Part of my algorithm contains some conditional iterations, that is, continue until some condition is met:
PCollection data = ...
while(needsMoreWork(data)) {
data = doAStep(data)
}
I have looked around in the documentation and as far as I can see I am only able to do "iterations" if I know the exact number of iterations before the pipeline starts. In this case my pipeline construction code can just create a sequential pipeline with fixed number of steps.
The only "solution" I can think of is to run each iteration in separate pipelines, store the intermediate data in some database, and then decide in my pipeline construction whether or not to launch a new pipeline for the next iteration. This seems to be an extremely inefficient solution!
Are there any good ways to perform this kind of additional iterations in Google cloud dataflow?
Thanks!
For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.
We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/50.
Background:
I'm a software engineering student and I was checking out several algorithms for recommendation systems. One of these algorithms, a collaborative filtering has a lot of loops int it, it has to go through all of the users and for each user all of the ratings he has made on movies, or other rateable items.
I was thinking of implementing it on ruby for a rails app.
The point is there is a lot of data to be processed so:
Should this be done in the database? using regular queries? using PL/SQL or something similar (Testing dbs is extremely time consuming and hard, specially for these kind of algorithms )
Should I do a background job that caches the results of the algorithm? (If so the data is processed on memory and if there are millions of users, how well does this scale)
Should I run the algorithm every time there is a request or every x requests? (Again, the data is processed in memory)
The Question:
I know there are things that do this like Apache Mahout but they rely on Hadoop for scaling. Is there another way out? is there a Mahout or Machine Learning equivalent for ruby and if so how where does the computation take place?
Here is my thoughts on each of the methods:
No it should not. Some calculations would be much faster to run in your database and some would not. However it would be hard and time consuming to test exactly which calculations that should be runned in your db, and you would properly experience that some part of the algorithm is slow in postgreSQL or whatever you use.
More importantly: this is not the right place to run logic, as you say yourself, it would be hard to test and it's overall a bad practice. It would also affect the performance of your requests overall each time the db have to calculate the algorithm. Also the db would still use a lot of memory processing this so that isn't a advantage.
By far the best solution. See below for more explanation.
This is a much better solution than number one. However this would mean that your apps performance would be very unstable. Some times all resources would be free for normal requests, and some times you would use all your resources on you calculations.
Option 2 is the best solution, as this doesn't interfere with the performance of the rest off your app and is much easier to scale as it works in isolation. If for example you experience that your worker can't keep up, you can just add some more running processes.
More importantly you would be able to run the background processes on a separate server and thereby easily monitor the memory and resource usage, and scale your server as necessary.
Even for real time updates a background job will be the best solution (if of course the calculation is not small enough to be done in the request). You could create a "high priority" queue that has enough resources to almost always be empty. If you need to show the result to the user with a reload, you would have to add some kind of push notification after a background job is complete. This notification could then trigger an update on the page through javascript (you can also check out the new live stream function of rails 4).
I would recommend something like Sidekiq with Redis. You could then cache the results in memcache or you could recalculate the result each time, that really depends on how often you would need to calculate this. With this solution, however, it would be much easier to setup a stable cache if you want it.
Where I work, we have an application that runs some heavy queries with a lot of calculations like this. Each night these jobs are queued and then run on a isolated server over the next few hours. This scales really well, and is also easy to monitor with new relic.
Hope this helps, and makes sense (I know my English isn't perfect), but please feel free to ask if I misunderstood something or you have more questions.
I am using Z3 to solve the path conditions produced by a symbolic executor, which explores the state space in depth-first order, quite similarly to CUTE, DART or (possibly) SAGE. We are experimenting different ways of using Z3. At one extreme, we send every query to Z3 and (reset) it right after. At the other, we (push) every additional branch constraint, and (pop) (pop) upon backtrack the minimum necessary to correctly weaken the path condition. The problem is, no strategy seems to work better than any other in all the circumstances. Pushing seems to offer the best advantage, but we met a few cases where resetting Z3 after every query is more than one order of magnitude faster than doing push/pop. Note that communication overhead is negligible: almost all the time is spent inside check-sat.
Does anyone have any experience to share, or some indication on the state kept internally by Z3 (lemmas, etc), which can help clarifying its behavior? And what about the behavior of other SMT solvers?
The next release (v4.3.2) will expose a feature that may be useful for you. In Z3, the default solver combines a non-incremental solver and an incremental one. When push/pop are used (or multiple checks are used without invoking reset), Z3 will use the incremental solver. In the next release, we can provide a timeout for the incremental solver. If the incremental solver can't solve the problem in the given timeout, Z3 will automatically switch to the non-incremental one. Perhaps, if you use this feature, you will be able to get the best of "both worlds". To get the source code for the next release candidate, you should use
git clone https://git01.codeplex.com/z3 -b rc
To compile it, we have use
cd z3
python scripts/mk_make.py
cd build
make
To set the timeout for the incremental solver, we have to provide the following command line option:
combined_solver.solver2_timeout=<time in milliseconds>
If you are using the programmatic APIs, you can the new API:
Z3_global_param_set(Z3_string param_id, Z3_string param_value)
Note that, the next release will have a new framework for setting parameters. It allows the user to set parameters for internal Z3 modules.
While the stack can give us nested function calls (and probably more), what would a queue give us? Call-after-exit? Would there be any use at all?
Are there any readings on this topic?
I'm curious, this is not homework.
I think you're looking at this backwards: it's simply not true that someone somewhere decided arbitrarily to use a stack and this determined the structure of programs from then on. It's the other way round: programmers wanted arbitrarily nested (and recursive) subroutine calls, and developed the stack structure to implement this. Queues are used to implement different requirements (e.g. scheduling, breadth-first graph traversal).
A queue can be used for tasks - a job queue. A language could support procedure calls which insert tasks into a queue.
I think this is somewhat related to what functional programming is all about. For instance, monads is a way of describing your program as a chain of sequential operations which take the results of the previous operation as their inputs.
It's called Cheney-on-the-MTA.