I'm a new user of DASK. I have a code in which I use DASK for some parallelizations. Is there some easy way, like a flag, for example, to run the code with DASK off, that is, in serial?
See the docs
You can set the scheduler to be the single-thread, in-process one, known as "sync" (or "single-threaded") within a context block:
with dask.config.set(scheduler='sync'):
# do stuff
or until further notice
dask.config.set(scheduler='sync')
# do stuff
Related
EDIT:
My question was horrifically put so I delete it and rephrase entirely here.
I'll give a tl;dr:
I'm trying to assign each computation to a designated worker that fits the computation type.
In long:
I'm trying to run a simulation, so I represent it using a class of the form:
Class Simulation:
def __init__(first_Client: Client, second_Client: Client)
self.first_client = first_client
self.second_client = second_client
def first_calculation(input):
with first_client.as_current():
return output
def second_calculation(input):
with second_client.as_current():
return output
def run(input):
return second_calculation(first_calculation(input))
This format has downsides like the fact that this simulation object is not pickleable.
I could edit the Simulation object to contain only addresses and not clients for example, but I feel as if there must be a better solution. For instance, I would like the simulation object to work the following way:
Class Simulation:
def first_calculation(input):
client = dask.distributed.get_client()
with client.as_current():
return output
...
Thing is, the dask workers best fit for the first calculation, are different than the dask workers best fit for the second calculation, which is the reason my Simulation object has two clients that connect to tow different schedulers to begin with. Is there any way to make it so there is only one client but two types of schedulers and to make it so the client knows to run the first_calculation to the first scheduler and the second_calculation to the second one?
Dask will chop up large computations in smaller tasks that can run in paralell. Those tasks will then be submitted by the client to the scheduler which in turn wil schedule those tasks on the available workers.
Sending the client object to a Dask scheduler will likely not work due to the serialization issue you mention.
You could try one of two approaches:
Depending on how you actually run those worker machines, you could specify different types of workers for different tasks. If you run on kubernetes for example you could try to leverage the node pool functionality to make different worker types available.
An easier approach using your existing infrastructure would be to return the results of your first computation back to the machine from which you are using the client using something like .compute(). And then use that data as input for the second computation. So in this case you're sending the actual data over the network instead of the client. If the size of that data becomes an issue you can always write the intermediary results to something like S3.
Dask does support giving specific tasks to specific workers with annotate. Here's an example snippet, where a delayed_sum task was passed to one worker and the doubled task was sent to the other worker. The assert statements check that those workers really were restricted to only those tasks. With annotate you shouldn't need separate clusters. You'll also need the most recent versions of Dask and Distributed for this to work because of a recent bug fix.
import distributed
import dask
from dask import delayed
local_cluster = distributed.LocalCluster(n_workers=2)
client = distributed.Client(local_cluster)
workers = list(client.scheduler_info()['workers'].keys())
with dask.annotate(workers=workers[0]):
delayed_sum = delayed(sum)([1, 2])
with dask.annotate(workers=workers[1]):
doubled = delayed_sum * 2
# use persist so scheduler doesn't clean up
# wrap in a distributed.wait to make sure they're there when we check the scheduler
distributed.wait([doubled.persist(), delayed_sum.persist()])
worker_restrictions = local_cluster.scheduler.worker_restrictions
assert worker_restrictions[delayed_sum.key] == {workers[0]}
assert worker_restrictions[doubled.key] == {workers[1]}
I have a task that I need to generate immediately after the request is created and get it done ASAP.
So for this purpose, I have created a /config/sidekiq.yml file where I defined this:
---
:queues:
- default
- [critical, 10]
And for the respective worker, I set this:
class GeneratePDFWorker
include Sidekiq::Worker
sidekiq_options queue: 'critical', retry: false
def perform(order_id)
...
Then, when I call this worker:
GeneratePDFWorker.perform_async(#order.id)
So I am testing this. But - I found this post, where is said that if I want to execute the tasks immediately, I should call:
GeneratePDFWorker.new.perform(#order.id)
So my question is - should I use the combination of a (critical) queue + the new (GeneratePDFWorker.new.perform) method? Does it make sense?
Also, how can I verify that the tasks is execute as critical?
Thank you
So my question is - should I use the combination of a (critical) queue + the new (GeneratePDFWorker.new.perform) method? Does it make sense?
Using GeneratePDFWorker.new.perform will run the code right there and then, like normal, inline code (in a blocking manner, not async). You can't define a queue, because it's not being queued.
As Walking Wiki mentioned, GeneratePDFWorker.new.perform(#order.id) will call the worker synchronously. So if you did this from a controller action, the request would block until the perform method completed.
I think your approach of using priority queues for critical tasks with Sidekiq is the way to go. As long as you have enough Sidekiq workers, and your queue isn't backlogged, the task should run almost immediately so the benefit of running your worker in-process is pretty much nil. So I'd say yes, it does make sense to queue in this case.
Also, you're probably aware of this, but sidekiq has a great monitoring UI: https://github.com/mperham/sidekiq/wiki/Monitoring. This should should make it easy to get reliable, detailed metrics on the performance of your workers.
should I use the combination of a (critical) queue?
Me:
Yes you can use critical queue if you feel so. A queue with a weight of 2 will be checked twice as often as a queue with a weight of 1.
Tips:
Keep the number of queues fewer as possible. Sidekiq is not designed to handler tremendous number of queues.
Also keep weights as simple as possible. If you want queues always processed in a specific order, just declare them in order without weights.
the new (GeneratePDFWorker.new.perform) method?
Me: No, using sidekiq in the same thread asynchronously is bad in the first place. This will hamper your application's performance as your application-server will be busy for longer. This will be very expensive for you. Then what will be the point of using sidekiq?
With delayed_job, I was able to do simple operations like this:
#foo.delay.increment!(:myfield)
Is it possible to do the same with Rails' new ActiveJob? (without creating a whole bunch of job classes that do these small operations)
ActiveJob is merely an abstraction on top of various background job processors, so many capabilities depend on which provider you're actually using. But I'll try to not depend on any backend.
Typically, a job provider consists of persistence mechanism and runners. When offloading a job, you write it into persistence mechanism in some way, then later one of the runners retrieves it and runs it. So the question is: can you express your job data in a format, compatible with any action you need?
That will be tricky.
Let's define what is a job definition then. For instance, it could be a single method call. Assuming this syntax:
Model.find(42).delay.foo(1, 2)
We can use the following format:
{
class: 'Model',
id: '42', # whatever
method: 'foo',
args: [
1, 2
]
}
Now how do we build such a hash from a given call and enqueue it to a job queue?
First of all, as it appears, we'll need to define a class that has a method_missing to catch the called method name:
class JobMacro
attr_accessor :data
def initialize(record = nil)
self.data = {}
if record.present?
self.data[:class] = record.class.to_s
self.data[:id] = record.id
end
end
def method_missing(action, *args)
self.data[:method] = action.to_s
self.data[:args] = args
GenericJob.perform_later(data)
end
end
The job itself will have to reconstruct that expression like so:
data[:class].constantize.find(data[:id]).public_send(data[:method], *data[:args])
Of course, you'll have to define the delay macro on your model. It may be best to factor it out into a module, since the definition is quite generic:
def delay
JobMacro.new(self)
end
It does have some limitations:
Only supports running jobs on persisted ActiveRecord models. A job needs a way to reconstruct the callee to call the method, I've picked the most probable one. You can also use marshalling, if you want, but I consider that unreliable: the unmarshalled object may be invalid by the time the job gets to execute. Same about "GlobalID".
It uses Ruby's reflection. It's a tempting solution to many problems, but it isn't fast and is a bit risky in terms of security. So use this approach cautiously.
Only one method call. No procs (you could probably do that with ruby2ruby gem). Relies on job provider to serialize arguments properly, if it fails to, help it with your own code. For instance, que uses JSON internally, so whatever works in JSON, works in que. Symbols don't, for instance.
Things will break in spectacular ways at first.
So make sure to set up your debugging tools before starting off.
An example of this is Sidekiq's backward (Delayed::Job) compatibility extension for ActiveRecord.
As far as I know, this is currently not supported. You can easily simulate this feature using a custom-defined proxy-job that accepts a model or instance, a method to be performed and a list of arguments.
However, for the sake of code testing and maintainability, this shortcut is not a good approach. It's more effective (even if you need to write a little bit more of code) to have a specific job for everything you want to enqueue. It forces you to think more about the design of your app.
I wrote a gem that can help you with that https://github.com/cristianbica/activejob-perform_later. But be aware that I believe that having methods all around your code that might be executed in workers is the perfect recipe for disaster is not handled carefully :)
I have a controller that spins off 6 sidekiq threads for faster parallel processing of a large file. Before that however I want to provide these threads with a few variables that should be available accross all threads because they variables themselves are fairly memory intensive. (it is only reading from that, not writing, so the concurrency issues doesn't exist)
In other words my controller looks like this
def foo
$bar1 = ....
$bar2 = ...
worker.perform_async()...
worker2.perform_async()...
end
I don't want to put those global vars into the perform methods because serializing those to redis chokes the entire thing. My issue is that the workers cannot see these variables and die because of a no method error (i.e. trying to call .first on on of them gives that error because the var is nil for the workers).
How come? Is there any other way to do this that won't kill my memory? (i.e. I don't want to take up most of the mem with 6x the same large array)
Sidekiq runs on a separate process, so it doesn't share the same memory as the initiator of the worker.
If the data is static, you might want to load it on the start of the sidekiq process (maybe when you configure the sidekiq server).
If it changes per task, you should model it in a way where you can create a global repository to hold it (if redis is not good for this, maybe you can try memcached)...
Is there anyway to move a resque job between two different queues?
We sometimes get in the situation that we have a big queue and a job that is near the end we find a need to "bump up its priority." We thought it might be an easy way to simply move it to another queue that had a worker waiting for any high priority jobs.
This happens rarely and is usually a case where we get a special call from a customer, so scaling, re-engineering don't seem totally necessary.
There is nothing built-in in Resque. You can use rpoplpush like:
module Resque
def self.move_queue(source, destination)
r = Resque.redis
r.llen("queue:#{source}").times do
r.rpoplpush("queue:#{source}", "queue:#{destination}")
end
end
end
https://gist.github.com/rafaelbandeira3/7088498
If it's a rare occurrence you're probably better off just manually pushing a new job into a shorter queue. You'll want to make sure that your system has a way to identify that the job has already run and to bail out so that when the job in the long queue is finally reached it is not processed again (if double processing is a problem for you).