Dask: Would storage network speed cause a worker to die - dask

I am running a process that writes large files across the storage network. I can run the process using a simple loop and I get no failures. I can run using distributed and jobqueue during off peak hours and no workers fail. However when I run the same command during peak hours, I get worker killing themselves.
I have ample memory for the task and plenty of workers, so I am not sitting in a queue.
The error logs usually has a bunch of over garbage collection limits followed by a Worker killed with Signal 9

Signal 9 suggests that the process has violated some system limit, not that Dask has decided for the worker to die. Since this only happens on high disk IO at busy times, indeed I agree that the network storage is the likely culprit, e.g., a lot of writes have been buffered, but are not being cleared through the relatively low bandwidth.
Dask also uses local storage for temporary files, and "local" might be the network storage. If you have real local disks on the nodes, you should use that, or if not, maybe turn off disk-spilling altogether. https://docs.dask.org/en/latest/setup/hpc.html#local-storage

Related

are there any limits on number of the dask workers/cores/threads?

I am seeing some performance degradation on my data analysis when I go more than 25 workers, each with 192 threads. Are there any limits on scheduler? there is no load footprint on communication(ib is used) or cpu or ram).
for example initially I have 170K hdf files on the lustrefs:
ddf=dd.read_hdf(hdf5files,key="G18",mode="r")
ddf.repartition(npartitions=4096).to_parquet(splitspath+"gdr3-input-cache")
the code is running slower on 64 workers than 25. looks like the scheduler on initial tasks design phase is very overloaded.
EDIT:
dask-2021.06.0
distributed-2021.06.0
There are many potential bottlenecks. Here are some hints.
Yes, the scheduler is a single process through which all tasks must pass, and it introduces an overhead per task (<1ms) just to manipulate its internal state and send . So, if you have many tasks per second, you will see the overhead take a larger fraction of the total time.
Similarly, if you have a lot of workers, you will have a lot of network traffic for both distribution of tasks and any data shuffling between workers. More workers, more traffic.
Thirdly, python uses a global lock, the GIL, when running code. Even when your tasks are GIL-friendly (e.g., array/dataframe ops), threads may still need the GIL sometimes, and this can cause contention and degraded performance.
Finally, you say you are using lustre, so you have many tasks simultaneously hitting network storage, which will have its own limitations both for metadata access and for data traffic.

Difference between dask distributed wait(future) vs future.result()?

For waiting from future completion in Dask distributed cluster, what's the difference between these two APIs? Are there any?
wait: https://docs.dask.org/en/latest/futures.html#waiting-on-futures
result(): tttps://docs.dask.org/en/latest/futures.html#distributed.Future.result
If there's any difference, what would be the more efficient way to block until result is available?
Thanks!
wait blocks further execution until the futures are completed, and once they are, the code proceeds. result transfers the result of the future from the worker to the client computer. In most cases, it’s probably more efficient to leave future with workers until the client needs them.
For example, imagine that you are coordinating calculations using a small laptop with 10GB ram which is connected to a cluster that has workers with memory of 50GB each. If the data you are processing is around 20GB, then the workers will have no problem doing calculations, however if you try to use .result() with the intention to just wait for execution to complete, then the workers will try to send to you 20GB of data each, which will crash your laptop session.

What happens to ECS containers that exceed soft memory limit when there is memory contention?

Say I have an instance with 2G memory, and a task/container with 0.5G soft memory limit, and 0.75G hard memory limit.
The instance is running 3 containers, each consuming 0.6G memory. Now a 4th container needs to be added? What happens to the 3 running containers? Is their memory allocation reduced? Or are they migrated to another instance? What if there is no other instance, will the 4th container be placed?
I understand how soft and hard CPU limits work since CPU is a dynamic resource (the application can handle spikes in free CPU). In case of memory, however, you cannot really take away memory from a container that is already using it.
The 4th container will not be able to spawn and you will get the below error.
(service sample) was unable to place a task because no container instance met all of its requirements. The closest matching (container-instance 05016874-f518-4b7a-a817-eb32a4d387f1) has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.
You need to add another ecs instance if you want to schedule the 4th container. all other 3 containers will be in the steady state. Nothing like memory allocation reduced happened in the cluster. If there is no instance your service will always be in an unsteady state and continue to give you the above errors.
Ref: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
Actually, memory can be reclaimed from running processes. For example the kernel may evict memory that is backed by files (like the code of the process itself). If the data ends up being needed again the kernel can page it back in. This is explained a little in this blog post: https://chrisdown.name/2018/01/02/in-defence-of-swap.html
If the task is scheduled on that node but the kernel fails to reclaim enough memory to avoid an out-of-memory situation then one of the processes will get killed by the kernel, which docker will detect and kill the container, which ECS will notice. I'm not sure if ECS will try to reschedule the dead task on the same instance or a different one. It probably depends.

Multiple workers on machine - Memory management ( Resque - Rails )

We've migrated our Resque background workers from a ton of individual instances on Heroku to between four and ten m4.2xl (32GB Mem) instances on EC2, which is much cheaper and faster.
A few things I've noticed are that: The Heroku instances we were using had 1GB of RAM and rarely ran out of memory. I am currently allocating 24 workers to one machine, so about 1.3GB of memory per worker. However, because the processing power on these machines is so much greater, I think the OS has trouble reclaiming the memory fast enough and each worker ends of consuming more on average. Most of the day the system has 17-20GB memory free but when the memory intensive jobs are run, all 24 workers grab a job almost at the same time and then start growing. They get through a few jobs but then the system hasn't had time to reap memory and crashes if there is no intervention.
I've written a daemon to pause the workers before a crash and wait for the OS to free memory. I could reduce the number of workers per machine overall or have half of them unsubscribe from the problematic queues, I just feel there must be a better way to manage this. I would prefer to be making usage of more than 20% of memory 99% of the day.
The workers are setup to fork a process when they pick up a job from the queue. The master-worker processes are run as services managed with Upstart. I'm aware there are a number of managers which simply restart the process when it consumes a certain amount of memory such as God and Monit. That seems like a heavy handed solution which will end with too many jobs killed under normal circumstances.
Is there a better strategy I can use to get higher utilization with a lowered risk of running into Errno::ENOMEM?
System specs:
OS : Ubuntu 12.04
Instance : m4.2xlarge
Memory : 32 GB

When is it appropriate to increase the async-thread size from zero?

I have been reading the documentation trying to understand when it makes sense to increase the async-thread pool size via the +A N switch.
I am perfectly prepared to benchmark, but I was wondering if there were a rule-of-thumb for when one ought to suspect that growing the pool size from 0 to N (or N to N+M) would be helpful.
Thanks
The BEAM runs Erlang code in special threads it calls schedulers. By default it will start a scheduler for every core in your processor. This can be controlled and start up time, for instance if you don't want to run Erlang on all cores but "reserve" some for other things. Normally when you do a file I/O operation then it is run in a scheduler and as file I/O operations are relatively slow they will block that scheduler while they are running. Which can affect the real-time properties. Normally you don't do that much file I/O so it is not a problem.
The asynchronous thread pool are OS threads which are used for I/O operations. Normally the pool is empty but if you use the +A at startup time then the BEAM will create extra threads for this pool. These threads will then only be used for file I/O operations which means that the scheduler threads will no longer block waiting for file I/O and the real-time properties are improved. Of course this costs as OS threads aren't free. The threads don't mix so scheduler threads are just scheduler threads and async threads are just async threads.
If you are writing linked-in drivers for ports these can also use the async thread pool. But you have to detect when they have been started yourself.
How many you need is very much up to your application. By default none are started. Like #demeshchuk I have also heard that Riak likes to have a large async thread pool as they open many files. My only advice is to try it and measure. As with all optimisation?
By default, the number of threads in a running Erlang VM is equal to the number of processor logical cores (if you are using SMP, of course).
From my experience, increasing the +A parameter may give some performance improvement when you are having many simultaneous file I/O operations. And I doubt that increasing +A might increase the overall processes performance, since BEAM's scheduler is extremely fast and optimized.
Speaking of the exact numbers – that totally depends on your application I think. Say, in case of Riak, where the maximum number of opened files is more or less predictable, you can set +A to this maximum, or several times less if it's way too big (by default it's 64, BTW). If your application contains, like, millions of files, and you serve them to web clients – that's another story; most likely, you might want to run some benchmarks with your own code and your own environment.
Finally, I believe I've never seen +A more than a hundred. Doesn't mean you can't set it, but there's likely no point in it.

Resources