If I spin up a dask cluster with N workers and then submit more than N jobs using cluster.compute, does dask try to run all the jobs simultaneously (by scheduling more than 1 job on each worker) or are the jobs queued and run sequentially ?
My recent experience of doing this seems to suggest the latter. Each job is pretty memory intensive and submitting more jobs than workers causes them all to crash due to memory issues.
Is there a way to force dask to strictly run only 1 job on 1 worker at a time and queue the other jobs ?
The default behavior is set by the size of the cluster. If the number of workers is greater than 4 dask tries to guess at a good number of threads to use in each worker. If you want to change this behavior you can change the number of threads per worker with the threads_per_worker keyword argument when creating the cluster:
cluster = LocalCluster(threads_per_worker=1)
client = Client(cluster)
cluster.compute(...)
If you're using an SSHCluster you need to pass the number of threads per worker as an argument to the worker:
cluster = SSHCluster(worker_options={"nthreads": 1})
client = Client(cluster)
cluster.compute(...)
Related
I submit a Dask task like that :
client = Client(cluster)
future = client.submit(
# dask task
my_dask_task, # a task that consume at most 100MiB
# task arguments
arg1,
arg2,
)
Everything work fine.
Now I set some constraints :
client = Client(cluster)
future = client.submit(
# dask task
my_dask_task, # a task that consume at most 100MiB
# task arguments
arg1,
arg2,
# resource constraints at the Dask scheduler level
resources={
'process': 1,
'memory': 100*1024*1024 # 100MiB
}
)
The problem is, in that case, the future is never resolved. And the Python program wait for ever. Even with only 'process': 1 and/or setting very few amount of ram like 'memory': 10. So its weird.
Along this reduced example, in my real world application, a given Dask worker is configured to have multiples processes, and thus, may run at the same times multiples tasks.
So I want to set the RAM amount of each task, to avoid the Dask scheduler to run tasks on a given Dask worker, that can lead to out of memory errors.
Why it doesn't work as expected ? How to debug ?
Thank you
Adding to #pavithraes's comment - the resources argument to client.submit and other scheduling calls does NOT modify the available workers. Instead, it creates a constraint on the workers that can be used for the given tasks. Importantly, the terms you use here, "process" and "memory" are not interpreted by dask in terms of physical hardware - they are simply qualifiers you can define that dask uses to filter the available workers to only those which match your tag criteria.
From the dask docs:
Resources listed in this way are just abstract quantities. We could equally well have used terms “mem”, “memory”, “bytes” etc. above because, from Dask’s perspective, this is just an abstract term. You can choose any term as long as you are consistent across workers and clients.
It’s worth noting that Dask separately track number of cores and available memory as actual resources and uses these in normal scheduling operation.
Because of this, your tasks hang forever because the scheduler is actually waiting for workers which meet your conditions to appear so that it can schedule these tasks. Unless you create workers with these tags applied, the jobs will never start.
See the dask docs on specifying and using worker resources, and especially the section on Specifying Resources, for more information about how to configure workers such that such resource constraints can be applied.
I've been using Dask for a good while but I still I don't know if there is a queue system for tasks by default. Let's say we have a local cluster and a client for it:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
cli = Client(cluster)
I'd like to run, not in parallel but task after task (submit after submit, or future after future) the following:
import time
a, b = cli.submit(time.sleep, 5), cli.submit(time.sleep, 1)
It's easy to see that these run at the same time since future b finishes before future a. My question is the following
Is it possible to force that future b does not start before future a finishes?
If tasks are heavy, I don't want that all of them are running in the same time, I'd like some queue system. Is dask-jobqueue way to go or what? I have no external queue systems available (slurm etc.)
Or does the Dask-Scheduler somehow take care that it won't burden workers too much by scheduling too many simultaneous tasks?
To have one task depend on another, you have some options:
don't launch tasks until the previous ones have finished, e.g., by using the .results attribute to wait for them. In this case, Dask isn't doing much for you.
set up your cluster to limit the number of worker threads to as many tasks as you think can comfortably run simultaneously, with appropriate arguments to LocalCluster (this is the preferred solution)
have a task explicitly depend on a previous one, e.g.,
.
def sleepme(t, *args):
time.sleep(t)
print("done", t)
f1 = client.submit(sleepme, 5)
f2 = client.submit(sleepme, 1, f1) # won't run until f1 is done
I have some function which uses image processing functions which are itself multithreaded. I distribute many of those function calls on a dask cluster.
First, I started a scheduler on a host: dask-scheduler. The I started the workers: dask-worker --nthreads 1 --memory-limit 0.9 tcp://scheduler:8786.
The python code looks similar to this:
import SimpleITK as sitk
def func(filename):
sitk.ProcessObject.SetGlobalDefaultNumberOfThreads(4) # limit to four threads
img = sitk.ReadImage(filename)
# Do more stuff and store resulting image
# SimpleITK is already multithreaded
return 'someresult'
# [...]
from distributed import Client
client = Client('tcp://scheduler:8786')
futures = client.map(func, ['somefile', 'someotherfile'])
for result in client.gather(futures):
print(result)
Right now, I have to set the number of threads for each worker to one, in order not to overcommit the CPU on the worker node. But in some cases it makes sense to limit the number of cores used by SimpleITK, because the gain is not so high. Instead, I could run multiple function calls in parallel on the same host.
But in that case I would have to calculate all the core usages by hand.
Ideally, I would like to set an arbitrary number of cores each function can use and dask should decide how many parallel functions invocations are started on each node, given the number of available threads. I.e. is it possible to specify the number of threads a function will use?
No, Dask is not able to either limit the number of threads spawned by some function, and doesn't attempt to measure this either.
The only thing I could think you might want to do is use Dask's abstract rsources, where you control how much of each labelled quantity is available per worker and how much each task needs to run.
futures = client.map(func, ['somefile', 'someotherfile'], resources=...)
I don't see an obvious way to assign resources to workers using Cluster() (i.e., the default LocalCluster), you may need to use the CLI.
Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate containers, which will use Dask LocalCluster with a certain number of workers, threads, processes passed into the LocalCluster object.
Our journey on Dask will look very much like this:
Continue using single machine LocalCluster until we out grow max cpu/memory allowed
When we out grow a single container, spawn additional worker containers on the initial container (a la dask-kubernetes) and join them to the LocalCluster.
We're currently starting with containers that have 256 cpu(.25 vCPU) and 512 memory and pinning the LocalCluster to 1 n_workers and 3 threads_per_worker to get a reasonable amount of parallelism. However, this really is guess work. 1 n_workers since its a machine with less than 1 vcpu and 3 threads because that doesn't sound crazy to me based on my previous experience running other python based applications in Fargate. This seems to work fine in a very simply example that just maps a function against a list of items.
RENEWAL_TABLES = [
'Activity',
'CurrentPolicyTermStaus',
'PolicyRenewalStatus',
'PolicyTerm',
'PolicyTermStatus',
'EndorsementPolicyTerm',
'PolicyLifeState'
]
RENEWAL_TABLES_PAIRS = [
(i, 1433 + idx) for idx, i in enumerate(RENEWAL_TABLES)
]
#task(state_handlers=[HANDLER])
def dummy_step():
LOGGER.info('Dummy Step...')
sleep(15)
#task(state_handlers=[HANDLER])
def test_map(table):
LOGGER.info('table: {}...'.format(table))
sleep(15)
with Flow(Path(__file__).stem, SCHEDULE, state_handlers=[HANDLER]) as flow:
first_step = dummy_step()
test_map.map(RENEWAL_TABLES_PAIRS).set_dependencies(upstream_tasks=[first_step])
I see no more than 3 tasks executed at once.
I would really like to understand how to best configure n_workers(single machinne), threads, processes as we expand the size of the single machine out to adding remote workers. I know it depends on my workload, but you could see a combination of things in a single flow where one task does an extract from a database to a csv and another task run a pandas computation. I have seen things online where it seems like it should be threads = number of cpus requested for the documentation, but it seems like you can still achieve parallelism with less than one cpu in Fargate.
Any feedback would be appreciated and could help others looking to leverage Dask in a more ephemeral nature.
Given that Fargate increments from .25 -> .50 -> 1 -> 2 -> 4 for vCPU, I think it’s safe to go with a 1 worker to 1 vcpu setup. However, would be helpful to understand how to choose a good upper limit for number of threads per worker given how Fargate vcpu allotment works.
My Heroku Rails app maintains a large frequently-changing list of keywords.
I want to spawn up N amount of workers that will equally divide up this list of keywords and work on them until they are restarted (I restart them every time the list of keywords changes). Once restarted, they divide up the keywords again and churn away.
For example: Let's say I have 1,000 keywords.
If I spawn up 1 worker, that worker will take 1,000 keywords.
If I spawn up 10 workers, each worker will take 100 keywords.
If I spawn up 1,000 workers, each worker will take 1 keyword.
Workers basically just open a connection with Twitter for their set of keywords and process incoming tweets that match those keywords.
Any ideas on how to set up the Procfile and delegate X keywords between N workers?
Here's a naive/pseudo manual approach just for demonstration. However, I want to be able to spawn up an arbitrary amount of workers that will automatically split the keywords amongst themselves.
Procfile:
keywordstreamer0: bundle exec ruby keyword_streamer.rb 0
keywordstreamer1: bundle exec ruby keyword_streamer.rb 1
keyword_streamer.rb
streamer_id = ARGV.shift # 0 or 1
# Split all keywords into array of two groups and take the group
# that matches this worker id (so the two workers take different groups)
keywords = Keyword.all.split_into_groups_of(2)[streamer_id]
# Example work loop
TwitterStream.track(keywords).each do |incoming_tweet|
process(incoming_tweet)
end
Then, in my app, when I need to restart my keyword workers:
["keywordstreamer0", "keywordstreamer1"].each do |streamer|
restart(streamer)
end
I'd like to instead be able to spawn N amount of these workers but I'm am having trouble parceling out a solution. I'd appreciate any high-level overview suggestions!
If you're just processing one keyword at a time, in no particular order or grouping, you could just use a queue.
Each worker simply fetches the next keyword off the queue (or perhaps the next batch of keywords, for performance), does the work, and then saves the results somewhere. You don't need to worry about partitioning the workload, since the workers will simply ask for more work when they're ready, allowing you to scale to N workers without needing each worker to know about the total size of the available workload.
There are many possible ways you can implement queues for your data. A couple of more specialized ones that I've used before are AMQP and Redis, but that's hardly an exhaustive list.
I'm going to take a guess and say that since you've got Keyword.all in your example code, and you're on Heroku, that you're using postgres. You can also emulate a queue in postgres without too much difficulty, although it obviously won't perform as well as a purpose-built queue.
Here's one way of doing it:
Add a status column to your keywords. It will have 3 values: ready, in-progress, and complete. The default value for the status column is ready.
The pseudocode for your worker would look like this:
loop do
keyword = Keyword.where(:status => "ready").limit(1).first
keyword.update_attributes!(:status => "in-progress")
result = process(keyword)
save_result_somewhere(result)
keyword.update_attributes!(:status => "complete")
end
I've left out a bunch of implementation details like gracefully handling the queue being empty, initial setup of the queue, batching, and so on. But that's the gist of it. This should perform adequately for modest sizes of N, probably at least 10 or more workers. Beyond that you may want to consider a purpose-built queuing technology.
Once your queue is set up, every single worker is identical and autonomous. Just heroku ps:scale worker=N and you're done!