Communicate progress of work inside a Dask delayed task back to Client thread - dask

I would like to use a Dask delayed task to call an external program, which outputs it's progress to STDOUT. In the delayed, I plan to monitor the STDOUT and would like to update the Client process that is waiting for the delayed task with progress information extracted from the STDOUT. Is there a recommended way for a delayed task to communicate with its Client processes, or do I need to roll my own?

You could achieve this kind of flow with any of the coordination primitives or actors provided by dask. From your description, the Queue or pubsub mechanisms seem like they might be favourite. You should note that all of these are generally means for low-frequency and low-volume communications.

Related

Mass Transit: Using AWS SQS and Job Consumers

Currently we're using Hangfire for scheduling and running long lived tasks. We need these tasks to be able to be retried in the event of an ungraceful shutdown, which Hangfire handles for us.
We're looking to try and move to a producer/consumer model and I've built a basic prototype with Masstransit and AWS SQS, but I have some concerns about how to handle the event of a task being processed during an ungraceful shutdown.
I understand that eventually the SQS visibility timeout will expire and the queued item will be picked up for processing again, but setting that timeout isn't trivial as the length of tasks can be quite varied and I'd prefer if the task could immediately resume/retry processing when the application starts up again.
I got reading about Job Consumers and they seemed to be better fitted to this type of scenario, but all the examples I've seen are using RabbitMQ. Wondering if it's possible/appropriate to do this using SQS, or if there's a better approach?
Thank you for taking the time to read this question :)
MassTransit will extend the visibility timeout as long as the consumer is still running.
I believe SQS has an upper-limit of something like 12 hours, but you should look it up and find out.
Job Consumers have significantly greater requirements (sagas, temporary queues, etc.) and SQS is really annoying about not having auto-delete/expiring queues, so I'd stick to a regular consumer if you can swing it.

Is there a dask equivalent to maxtasksperchild?

We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.
No, there is no such equivalent in Dask

How can I get result of Dask compute on a different machine than the one that submitted it?

I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here: https://github.com/MoonVision/django-dask-demo/blob/master/demo/daskmanager/daskmanager.py
I want to be able to separate the saving of a task from the server that submitted it for robustness and scalability. I also would like more detailed information as to the processing status of the task, right now the future status is always pending even if the task is processing. Having a rough estimate of percent complete would also be great.
Right now, if the web server were to die, the client would get deleted and the task would stop as no client is still holding the future. I can get around this by using fire_and_forget but I then have no way to save the task status and result when it completes.
Ways I see to track the status and save the result after a fire_and_forget:
I could have a scheduler plugin that sends all transfers to AMPQ server (RabbitMQ). I like the robustness and being able to subscribe to certain messages that are output by the scheduler and knowing every message will be processed. I'm not sure how I could get the result it self with this method. I could manually adding a node to the end of every graph to save the result but would rather have it be behind the scenes.
get_task_stream on separate server or use it in some way. With this, it seems I could miss some messages if the server were to go down so seems like a worse option 1.
Other option?
What would be the best way to accomplish this?
Edit: Just tested and it seems when the client that submitted a task shuts down, all futures it created are moved from processing to forgotten, even if calling fire_and_forget.
You probably want to look at Dask's coordination primitivies like Queues and Pub/Sub. My guess is that putting your futures into a queue would solve your problem.
https://docs.dask.org/en/latest/futures.html#coordination-primitives

How to correctly use Resque workers?

I have the following tasks to do in a rails application:
Download a video
Trim the video with FFMPEG between a given duration (Eg.: 00:02 - 00:09)
Convert the video to a given format
Move the converted video to a folder
Since I wanted to make this happen in background jobs, I used 1 resque worker that processes a queue.
For the first job, I have created a queue like this
#queue = :download_video that does it's task, and at the end of the task I am going forward to the next task by calling Resque.enqueue(ConvertVideo, name, itemId). In this way, I have created a chain of queues that are enqueued when one task is finished.
This is very wrong, since if the first job starts to enqueue the other jobs (one from another), then everything get's blocked with 1 worker until the first list of queued jobs is finished.
How should this be optimised? I tried adding more workers to this way of enqueueing jobs, but the results are wrong and unpredictable.
Another aspect is that each job is saving a status in the database and I need the jobs to be processed in the right order.
Should each worker do a single job from above and have at least 4 workers? If I double the amount to 8 workers, would it be an improvement?
Have you considered using sidekiq ?
As said in Sidekiq documentation :
resque uses redis for storage and processes messages in a single-threaded process. The redis requirement makes it a little more difficult to set up, compared to delayed_job, but redis is far better as a queue than a SQL database. Being single-threaded means that processing 20 jobs in parallel requires 20 processes, which can take a lot of memory.
sidekiq uses redis for storage and processes jobs in a multi-threaded process. It's just as easy to set up as resque but more efficient in terms of raw processing speed. Your worker code does need to be thread-safe.
So you should have two kind of jobs : download videos and convert videos and any download video job should be done in parallel (you can limit that if you want) and then each stored in one queue (the "in-between queue") before being converted by multiple convert jobs in parallel.
I hope that helps, this link explains quite well the best practices in Sidekiq : https://github.com/mperham/sidekiq/wiki/Best-Practices
As #Ghislaindj noted Sidekiq might be an alternative - largely because it offers plugins that control execution ordering.
See this list:
https://github.com/mperham/sidekiq/wiki/Related-Projects#execution-ordering
Nonetheless, yes, you should be using different queues and more workers which are specific to the queue. So you have a set of workers all working on the :download_video queue and then you other workers attached to the :convert_video queue, etc.
If you want to continue using Resque another approach would be to use delayed execution, so when you enqueue your subsequent jobs you specify a delay parameter.
Resque.enqueue_in(10.seconds, ConvertVideo, name, itemId)
The down-side to using delayed execution in Resque is that it requires the resque-scheduler package, so you're introducing a new dependency:
https://github.com/resque/resque-scheduler
For comparison Sidekiq has delayed execution natively available.
Have you considered merging all four tasks into just one? In this case you can have any number of workers, one will do the job. It will work very predictable, you can even know how much time will take to finish the task. You also don't have problems when one of the subtasks takes longer than all others and it piles up in the queue.

Erlang VM: scheduler runtime information

I was searching for a way to retrieve information about how the scheduling is done during a program's execution: which processes are in which scheduler, if they change, what process is active at each scheduler, if each scheduler runs in one core etc...
Any ideas or related documentation/articles/anything?
I would suggest you take a look on the following tracing/profiling options:
erlang:system_profile/2
It has options for monitoring scheduler and run queue (runnable_procs) activity.
The scheduler option will report
{profile, scheduler, Id, State, NoScheds, Ts}
where State will tell you if it is active or not. NoScheds reports the number of currently active schedulers (if I remember correctly).
The runnable_procs option will let you know if a process is put into or removed from a run queue of a particular scheduler.
If you have a system that supports DTrace, you can use the erlang dtrace probes being developed to see exactly when process scheduling events occur.
For example, I wrote a simple one-liner that shows you the number of nanoseconds that pass between sending a message to a process and having the recipient process be scheduled for execution (± a few nanoseconds for cross-core clock variance and processes and such).

Resources