Streamz + Dask worker occupancy is lower than I'd expect - dask

I've got a Dask cluster with 32 workers running on a local machine, and have tried to run the following Streamz workflow against it:
I'm only seeing a couple of the workers occupied at any given time:
I see increased occupancy when running locally using:
client = Client(n_workers=32, processes=True, threads_per_worker=1, memory_limit='32GB')
but still nowhere near 32 workers are occupied at any given time (max about 8).
Why is this, and why does the task stream appear to show more tasks running in parallel than the occupancy would suggest?

Related

How many dask jobs per worker

If I spin up a dask cluster with N workers and then submit more than N jobs using cluster.compute, does dask try to run all the jobs simultaneously (by scheduling more than 1 job on each worker) or are the jobs queued and run sequentially ?
My recent experience of doing this seems to suggest the latter. Each job is pretty memory intensive and submitting more jobs than workers causes them all to crash due to memory issues.
Is there a way to force dask to strictly run only 1 job on 1 worker at a time and queue the other jobs ?
The default behavior is set by the size of the cluster. If the number of workers is greater than 4 dask tries to guess at a good number of threads to use in each worker. If you want to change this behavior you can change the number of threads per worker with the threads_per_worker keyword argument when creating the cluster:
cluster = LocalCluster(threads_per_worker=1)
client = Client(cluster)
cluster.compute(...)
If you're using an SSHCluster you need to pass the number of threads per worker as an argument to the worker:
cluster = SSHCluster(worker_options={"nthreads": 1})
client = Client(cluster)
cluster.compute(...)

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate containers, which will use Dask LocalCluster with a certain number of workers, threads, processes passed into the LocalCluster object.
Our journey on Dask will look very much like this:
Continue using single machine LocalCluster until we out grow max cpu/memory allowed
When we out grow a single container, spawn additional worker containers on the initial container (a la dask-kubernetes) and join them to the LocalCluster.
We're currently starting with containers that have 256 cpu(.25 vCPU) and 512 memory and pinning the LocalCluster to 1 n_workers and 3 threads_per_worker to get a reasonable amount of parallelism. However, this really is guess work. 1 n_workers since its a machine with less than 1 vcpu and 3 threads because that doesn't sound crazy to me based on my previous experience running other python based applications in Fargate. This seems to work fine in a very simply example that just maps a function against a list of items.
RENEWAL_TABLES = [
'Activity',
'CurrentPolicyTermStaus',
'PolicyRenewalStatus',
'PolicyTerm',
'PolicyTermStatus',
'EndorsementPolicyTerm',
'PolicyLifeState'
]
RENEWAL_TABLES_PAIRS = [
(i, 1433 + idx) for idx, i in enumerate(RENEWAL_TABLES)
]
#task(state_handlers=[HANDLER])
def dummy_step():
LOGGER.info('Dummy Step...')
sleep(15)
#task(state_handlers=[HANDLER])
def test_map(table):
LOGGER.info('table: {}...'.format(table))
sleep(15)
with Flow(Path(__file__).stem, SCHEDULE, state_handlers=[HANDLER]) as flow:
first_step = dummy_step()
test_map.map(RENEWAL_TABLES_PAIRS).set_dependencies(upstream_tasks=[first_step])
I see no more than 3 tasks executed at once.
I would really like to understand how to best configure n_workers(single machinne), threads, processes as we expand the size of the single machine out to adding remote workers. I know it depends on my workload, but you could see a combination of things in a single flow where one task does an extract from a database to a csv and another task run a pandas computation. I have seen things online where it seems like it should be threads = number of cpus requested for the documentation, but it seems like you can still achieve parallelism with less than one cpu in Fargate.
Any feedback would be appreciated and could help others looking to leverage Dask in a more ephemeral nature.
Given that Fargate increments from .25 -> .50 -> 1 -> 2 -> 4 for vCPU, I think it’s safe to go with a 1 worker to 1 vcpu setup. However, would be helpful to understand how to choose a good upper limit for number of threads per worker given how Fargate vcpu allotment works.

Why does dask worker fails due to MemoryError on "small" size task? [Dask.bag]

I am running a pipeline on multiple images. The pipeline consist of reading the images from file system, doing so processing on each of them, then saving the images to file system. However the dask worker fails due to MemoryError.
Is there a way to assure the dask workers don't load too many images in memory? i.e. Wait until there is enough space on a worker before starting the processing pipeline on a new image.
I have one scheduler and 40 workers with 4 cores, 15GB ram and running Centos7. I am trying to process 125 images in a batch; each image is fairly large but small enough to fit on a worker; around 3GB require for the whole process.
I tried to process a smaller amount of images and it works great.
EDITED
from dask.distributed import Client, LocalCluster
# LocalCluster is used to show the config of the workers on the actual cluster
client = Client(LocalCluster(n_workers=2, resources={'process': 1}))
paths = ['list', 'of', 'paths']
# Read the file data from each path
data = client.map(read, path, resources={'process': 1)
# Apply foo to the data n times
for _ in range(n):
data = client.map(foo, x, resources={'process': 1)
# Save the processed data
data.map(save, x, resources={'process': 1)
# Retrieve results
client.gather(data)
I expected the images to be process as space was available on the workers but it seems like the images are all loaded simultaneously on the different workers.
EDIT:
My issues is that all task get assigned to workers and they don't have enough memory. I found how to limit the number of task a worker handle at a single moment [https://distributed.readthedocs.io/en/latest/resources.html#resources-are-applied-separately-to-each-worker-process](see here).
However, with that limit, when I execute my task they all finish the read step, then the process step and finally the save step. This is an issues since the image are spilled to disk.
Would there be a way to make every task finish before starting a new one?
e.g. on Worker-1: read(img1)->process(img1)->save(img1)->read(img2)->...
Dask does not generally know how much memory a task will need, it can only know the size of the outputs, and that, only once they are finished. This is because Dask simply executes a pthon function and then waits for it to complete; but all osrts of things can happen within a python function. You should generally expect as many tasks to begin as you have available worker cores - as you are finding.
If you want a smaller total memory load, then your solution should be simple: have a small enough number of workers, so that if all of them are using the maximum memory that you can expect, you still have some spare in the system to cope.
To EDIT: you may want to try running optimize on the graph before submission (although this should happen anyway, I think), as it sounds like your linear chains of tasks should be "fused". http://docs.dask.org/en/latest/optimize.html

Dask scheduler behavior while reading/retrieving large datasets

This is a follow-up to this question.
I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a backing Lustre filesystem.
Problem 1:
ds = DataSlicer(dataset) # ~600 GB dataset
dask_array = dask.array.from_array(ds, chunks=(13507, -1, -1), name=False) # ~22 GB chunks
dask_array = client.persist(dask_array)
When inspecting the Dask status dashboard, I see all 28 tasks being assigned to and processed by one worker while the other workers do nothing. Additionally, when every task has finished processing and the tasks are all in the "In memory" state, only 22 GB of RAM (i.e. the first chunk of the dataset) is actually stored on the cluster. Access to the indices within the first chunk are fast, but any other indices force a new round of reading and loading the data before the result returns. This seems contrary to my belief that .persist() should pin the complete dataset across the memory of the workers once it finishes execution. In addition, when I increase the chunk size, one worker often runs out of memory and restarts due to it being assigned multiple huge chunks of data.
Is there a way to manually assign chunks to workers instead of the scheduler piling all of the tasks on one process? Or is this abnormal scheduler behavior? Is there a way to ensure that the entire dataset is loaded into RAM?
Problem 2
I found a temporary workaround by treating each chunk of the dataset as its own separate dask array and persisting each one individually.
dask_arrays = [da.from_delayed(lazy_slice, shape, dtype, name=False) for \
lazy_slice, shape in zip(lazy_slices, shapes)]
for i in range(len(dask_arrays)):
dask_arrays[i] = client.persist(dask_arrays[i])
I tested the bandwidth from persisted and published dask arrays to several parallel readers by calling .compute() on different chunks of the dataset in parallel. I could never achieve more than 2 GB/s aggregate bandwidth from the dask cluster, far below our network's capabilities.
Is the scheduler the bottleneck in this situation, i.e. is all data being funneled through the scheduler to my readers? If this is the case, is there a way to get in-memory data directly from each worker? If this is not the case, what are some other areas in dask I may be able to investigate?

Spark JobServer, memory settings for release

I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
Sync with the main remote database, it makes a dump of some of the server's tables, reduce and aggregates the data, save the result as a parquet file and cache it as a sql table in memory. This operation will be done every day;
Queries, when the sync operation is finished, users can perform SQL complex queries on the aggregated dataset, (eventually) exporting the result as csv file. Every user can do only one query at time, and wait for its completion.
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you
Take an example,
if you have a server with 32g ram.
set the following parameters :
spark.executor.memory = 32g
Take a note:
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17 --executor-cores 5
--executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors. --executor-memory was
derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47
~ 19.
This is explained here if you want to know more :
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Resources