Is there a dask equivalent to maxtasksperchild? - dask

We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.

No, there is no such equivalent in Dask

Related

Communicate progress of work inside a Dask delayed task back to Client thread

I would like to use a Dask delayed task to call an external program, which outputs it's progress to STDOUT. In the delayed, I plan to monitor the STDOUT and would like to update the Client process that is waiting for the delayed task with progress information extracted from the STDOUT. Is there a recommended way for a delayed task to communicate with its Client processes, or do I need to roll my own?
You could achieve this kind of flow with any of the coordination primitives or actors provided by dask. From your description, the Queue or pubsub mechanisms seem like they might be favourite. You should note that all of these are generally means for low-frequency and low-volume communications.

Waiting for external dependencies in dask

Context:
I'm using custom dask graphs to manage and distribute computations.
Problem:
Some tasks include reading in files which are produced outside of dask and not necessarily available at the time of calling dask.get(graph,result_key).
Question:
Having the i/o tasks wait for files is not an option as this would block workers. Is there (or which would be) a good way to let dask wait for the files to become available and only then execute the i/o tasks?
Thanks a lot for any thoughts!
It sounds like you might want to use some of the more real-time features of Dask, described here.
You might consider making tasks that use secede and rejoin or use async-await style programming and only launch tasks once your client process notices that they exist.

Running process scheduler in Dask distributed

Local dask allows using process scheduler. Workers in dask distributed are using ThreadPoolExecutor to compute tasks. Is it possible to replace ThreadPoolExecutor with ProcessPoolExecutor in dask distributed? Thanks.
The distributed scheduler allows you to work with any number of processes, via any of the deployment options. Each of these can have one or more threads. Thus, you have the flexibility to choose your favourite mix of threads and processes as you see fit.
The simplest expression of this is with the LocalCluster (same as Client() by default):
cluster = LocalCluster(n_workers=W, threads_per_worker=T, processes=True)
makes W workers with T threads each (which can be 1).
As things stand, the implementation of workers uses a thread pool internally, and you cannot swap in a process pool in its place.

All tasks assigned to one worker when using Dask in adaptive mode

When using Dask normally things work fine. However, when I use Dask with an adaptive cluster I find that sometimes all the tasks get assigned to a single worker. Why is this?
This should be considered a usability bug, and it would be reasonable to file an issue about it.
However, to explain what is going on (at least today 2018-08-09) probably what happens is that
Your scheduler first has no tasks and so has no workers assigned to it
You submit a lot of work from a client, the scheduler responds and asks for many workers
The first worker arrives and the scheduler hands it all of the work
Milliseconds later, several other workers arrive. The scheduler then proceeds to load balance between the available workers
Ideally, the load balancing heuristics should handle the situation. There were older versions of Dask where this performed less well, but usually this is fine. I recommend first updating your version of the dask and distributed packages to the newest possible releases and if that doesn't work, report an issue with a minimal example if possible.

Should spawn be used in Erlang whenever I have a non dependent asynchronous function?

If I have a function that can be executed asynchronously without any dependencies and no other functions require its results directly, should I use spawn ? In my scenario I want to proceed to consume a message queue, so spawning would relif my blocking loop, but if there are other situations where I can distribute function calls as much as possible, will that affect negatively my application ?
Overall, what would be the pros and cons of using Spawn.
Unlike operating system processes or threads, Erlang processes are very light weight. There is minimal overhead in starting, stopping, and scheduling new processes. You should be able to spawn as many of them as you need (the max per vm is in the hundreds of thousands). The Actor model Erlang implements allows you to think about what is actually happening in parallel and write your programs to express that directly. Avoid complicating your logic with work queues if you can avoid it.
Spawn a process whenever it makes logical sense, and optimize only when you have to.
The first thing that come in mind is the size of parameters. They will be copied from your current process to the new one and if the parameters are huge it may be inefficient.
Another problem that may arise is bloating VM with such amount of processes that your system will become irresponsive. You can overcome this problem by using pool of worker processes or special monitor process that will allow to work only limited amount of such processes.
so spawning would relif my blocking loop
If you are in the situation that a loop will receive many messages requiring independant actions, don't hesitate and spawn new processes for each message processing, this way you will take advantage of the multicore capabilities (if any) of your computer. As kjw0188 says, the Erlang processes are very light weight and if the system hits the limit of process numbers alive in parallel (with the assumption that you are doing reasonable code) it is more likely that the application is overloading the capability of the node.

Resources