Local dask allows using process scheduler. Workers in dask distributed are using ThreadPoolExecutor to compute tasks. Is it possible to replace ThreadPoolExecutor with ProcessPoolExecutor in dask distributed? Thanks.
The distributed scheduler allows you to work with any number of processes, via any of the deployment options. Each of these can have one or more threads. Thus, you have the flexibility to choose your favourite mix of threads and processes as you see fit.
The simplest expression of this is with the LocalCluster (same as Client() by default):
cluster = LocalCluster(n_workers=W, threads_per_worker=T, processes=True)
makes W workers with T threads each (which can be 1).
As things stand, the implementation of workers uses a thread pool internally, and you cannot swap in a process pool in its place.
Related
in Temporal why single worker for single service is sufficient? doesn't it become a bottleneck for the system to scale? does Worker a single threaded or multi-threaded process?
I have gone through the Temporal documentation but couldn't understand why single Worker per client service is sufficient.
I also tried creating different task queue for different workflows and created new worker(using workerfactory.newWorker(..) method creating 2 workers in the same process) to listen on the new task queue. When I observed the workers in the temporal-UI I see the same worker id for both the task queues.
In many production scenarios, a single Worker is not sufficient, and people run a pool of multiple Workers, each with the same Workflows and/or Activities registered, and polling the same Task Queue.
To tell when a single Worker isn't sufficient, you can look at metrics:
https://docs.temporal.io/application-development/worker-performance
Worker a single threaded or multi-threaded process?
It depends on which SDK. The Java SDK has multi-threaded Workers: see for example
https://www.javadoc.io/static/io.temporal/temporal-sdk/1.7.0/io/temporal/worker/WorkerFactoryOptions.Builder.html#setMaxWorkflowThreadCount-int-
You can give different Worker instances different identities with:
https://www.javadoc.io/static/io.temporal/temporal-sdk/1.7.0/io/temporal/client/WorkflowClientOptions.Builder.html#setIdentity-java.lang.String-
I'm setting up Dask, and I can use dask for multiprocessing just fine.
I run into issues, however, when I want to use pre-configured Dask workers. They don't have the same imports I do with my main process.
I was wondering. How do I add custom imports into dask workers so all futures accessing those workers can operate effectively.
Ideally you Dask workers should all have the same software environment. Typically this is guaranteed outside of Dask with Docker images or with a Network File System (NFS). There are some other solutions like Client.upload_file, which can be useful for small scripts.
We have jobs which interact with native code and there are unavoidable memory leaks while the worker is processing the task. The simple solution for our problems has been to restart the worker after a specified number of tasks.
We are migrating from python's multiprocessing which has a useful maxtasksperchild option which closes down the workers after a specified number of tasks.
Is there something built-in in dask that is comparable to maxtasksperchild?
As a workaround, we are keeping track of the workers who have completed a task by appending their worker address to the result payload and calling retire_workers on the client side manually.
No, there is no such equivalent in Dask
When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing?
Dask workers have the following threads:
A pool of threads in which to run tasks. This is typically somewhere between 1 and the number of logical cores on the computer
One administrative thread to manage the event loop, communication over (non-blocking) sockets, responding to fast queries, the allocation of tasks onto worker threads, etc..
A couple of threads that are used for optional compression and (de)serialization of messages during communication
One thread to monitor and profile the two items above
Additionally, by default there is an additional Nanny process that watches the worker. This process has a couple of its own threads for administration.
These are internal details as of October 2018 and may change without notice.
People who run into "too many threads" issues often are running tasks that are themselves multi-threaded, and so get an N-squared threading issue. Often the solution here is to use environment variables like OMP_NUM_THREADS=1 but this depends on the exact libraries that you're using.
I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.
The terms I came across are: thread, process, processor, node, worker, scheduler.
My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example:
1 worker per node with n processes for the n cores on the node
threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client
Any other suggestions?
By "node" people typically mean a physical or virtual machine. That node can run several programs or processes at once (much like how my computer can run a web browser and text editor at once). Each process can parallelize within itself with many threads. Processes have isolated memory environments, meaning that sharing data within a process is free, while sharing data between processes is expensive.
Typically things work best on larger nodes (like 36 cores) if you cut them up into a few processes, each of which have several threads. You want the number of processes times the number of threads to equal the number of cores. So for example you might do something like the following for a 36 core machine:
Four processes with nine threads each
Twelve processes with three threads each
One process with thirty-six threads
Typically one decides between these choices based on the workload. The difference here is due to Python's Global Interpreter Lock, which limits parallelism for some kinds of data. If you are working mostly with Numpy, Pandas, Scikit-Learn, or other numerical programming libraries in Python then you don't need to worry about the GIL, and you probably want to prefer few processes with many threads each. This helps because it allows data to move freely between your cores because it all lives in the same process. However, if you're doing mostly Pure Python programming, like dealing with text data, dictionaries/lists/sets, and doing most of your computation in tight Python for loops then you'll want to prefer having many processes with few threads each. This incurs extra communication costs, but lets you bypass the GIL.
In short, if you're using mostly numpy/pandas-style data, try to get at least eight threads or so in a process. Otherwise, maybe go for only two threads in a process.