Difference between dask distributed wait(future) vs future.result()? - dask

For waiting from future completion in Dask distributed cluster, what's the difference between these two APIs? Are there any?
wait: https://docs.dask.org/en/latest/futures.html#waiting-on-futures
result(): tttps://docs.dask.org/en/latest/futures.html#distributed.Future.result
If there's any difference, what would be the more efficient way to block until result is available?
Thanks!

wait blocks further execution until the futures are completed, and once they are, the code proceeds. result transfers the result of the future from the worker to the client computer. In most cases, it’s probably more efficient to leave future with workers until the client needs them.
For example, imagine that you are coordinating calculations using a small laptop with 10GB ram which is connected to a cluster that has workers with memory of 50GB each. If the data you are processing is around 20GB, then the workers will have no problem doing calculations, however if you try to use .result() with the intention to just wait for execution to complete, then the workers will try to send to you 20GB of data each, which will crash your laptop session.

Related

are there any limits on number of the dask workers/cores/threads?

I am seeing some performance degradation on my data analysis when I go more than 25 workers, each with 192 threads. Are there any limits on scheduler? there is no load footprint on communication(ib is used) or cpu or ram).
for example initially I have 170K hdf files on the lustrefs:
ddf=dd.read_hdf(hdf5files,key="G18",mode="r")
ddf.repartition(npartitions=4096).to_parquet(splitspath+"gdr3-input-cache")
the code is running slower on 64 workers than 25. looks like the scheduler on initial tasks design phase is very overloaded.
EDIT:
dask-2021.06.0
distributed-2021.06.0
There are many potential bottlenecks. Here are some hints.
Yes, the scheduler is a single process through which all tasks must pass, and it introduces an overhead per task (<1ms) just to manipulate its internal state and send . So, if you have many tasks per second, you will see the overhead take a larger fraction of the total time.
Similarly, if you have a lot of workers, you will have a lot of network traffic for both distribution of tasks and any data shuffling between workers. More workers, more traffic.
Thirdly, python uses a global lock, the GIL, when running code. Even when your tasks are GIL-friendly (e.g., array/dataframe ops), threads may still need the GIL sometimes, and this can cause contention and degraded performance.
Finally, you say you are using lustre, so you have many tasks simultaneously hitting network storage, which will have its own limitations both for metadata access and for data traffic.

Do you have to use worker pools in Erlang?

I have a server I am creating (a messaging service) and I am doing some preliminary tests to benchmark it. So far, the fastest way to process the data is to do it directly on the process of the user and to use worker pools. I have tested spawning and that is unbelievable slow.
The test is just connecting 10k users, and having each one send 15kb of data a couple of times at the same time(or trying too atleast) and having the server process the data (total length, headers, and payload).
The issue I have with worker pools is its only fast when you have enough workers to offset the amount of connections. For example, if you have 500k, or 1 million users, you would need more workers to process all the concurrent data coming in. And, as for my testing, having 1000 workers would make it unusable.
So my question is the following: When does it make sense to use pools of workers? Will there be a tipping point where I would have to use workers to process the data to free up the user process? How many workers is too much, is 500,000 too much?
And, if workers are the way to go (for those massive concurrent distributed servers), I am guessing you can dynamically create/delete as you need?
Any literature is also appreciated!
Thanks for your answer!
Maybe worker pools are not the best tool for your problem. If I were you I would try using Jay Nelson's epocxy, which gives you a very basic backpressure mechanism while still letting you parallelize your tasks. From that library I would check either concurrency fount or concurrency control tools.

Execution window time

I've read an article in the book elixir in action about processes and scheduler and have some questions:
Each process get a small execution window, what is does mean?
Execution windows is approximately 2000 function calls?
What is a process implicitly yield execution?
Let's say you have 10,000 Erlang/Elixir processes running. For simplicity, let's also say your computer only has a single process with a single core. The processor is only capable of doing one thing at a time, so only a single process is capable of being executed at any given moment.
Let's say one of these processes has a long running task. If the Erlang VM wasn't capable of interrupting the process, every single other process would have to wait until that process is done with its task. This doesn't scale well when you're trying to handle tens of thousands of requests.
Thankfully, the Erlang VM is not so naive. When a process spins up, it's given 2,000 reductions (function calls). Every time a function is called by the process, it's reduction count goes down by 1. Once its reduction count hits zero, the process is interrupted (it implicitly yields execution), and it has to wait its turn.
Because Erlang/Elixir don't have loops, iterating over a large data structure must be done recursively. This means that unlike most languages where loops become system bottlenecks, each iteration uses up one of the process' reductions, and the process cannot hog execution.
The rest of this answer is beyond the scope of the question, but included for completeness.
Let's say now that you have a processor with 4 cores. Instead of only having 1 scheduler, the VM will start up with 4 schedulers (1 for each core). If you have enough processes running that the first scheduler can't handle the load in a reasonable amount of time, the second scheduler will take control of the excess processes, executing them in parallel to the first scheduler.
If those two schedulers can't handle the load in a reasonable amount of time, the third scheduler will take on some of the load. This continues until all of the processors are fully utilized.
Additionally, the VM is smart enough not to waste time on processes that are idle - i.e. just waiting for messages.
There is an excellent blog post by JLouis on How Erlang Does Scheduling. I recommend reading it.

Should spawn be used in Erlang whenever I have a non dependent asynchronous function?

If I have a function that can be executed asynchronously without any dependencies and no other functions require its results directly, should I use spawn ? In my scenario I want to proceed to consume a message queue, so spawning would relif my blocking loop, but if there are other situations where I can distribute function calls as much as possible, will that affect negatively my application ?
Overall, what would be the pros and cons of using Spawn.
Unlike operating system processes or threads, Erlang processes are very light weight. There is minimal overhead in starting, stopping, and scheduling new processes. You should be able to spawn as many of them as you need (the max per vm is in the hundreds of thousands). The Actor model Erlang implements allows you to think about what is actually happening in parallel and write your programs to express that directly. Avoid complicating your logic with work queues if you can avoid it.
Spawn a process whenever it makes logical sense, and optimize only when you have to.
The first thing that come in mind is the size of parameters. They will be copied from your current process to the new one and if the parameters are huge it may be inefficient.
Another problem that may arise is bloating VM with such amount of processes that your system will become irresponsive. You can overcome this problem by using pool of worker processes or special monitor process that will allow to work only limited amount of such processes.
so spawning would relif my blocking loop
If you are in the situation that a loop will receive many messages requiring independant actions, don't hesitate and spawn new processes for each message processing, this way you will take advantage of the multicore capabilities (if any) of your computer. As kjw0188 says, the Erlang processes are very light weight and if the system hits the limit of process numbers alive in parallel (with the assumption that you are doing reasonable code) it is more likely that the application is overloading the capability of the node.

When is it appropriate to increase the async-thread size from zero?

I have been reading the documentation trying to understand when it makes sense to increase the async-thread pool size via the +A N switch.
I am perfectly prepared to benchmark, but I was wondering if there were a rule-of-thumb for when one ought to suspect that growing the pool size from 0 to N (or N to N+M) would be helpful.
Thanks
The BEAM runs Erlang code in special threads it calls schedulers. By default it will start a scheduler for every core in your processor. This can be controlled and start up time, for instance if you don't want to run Erlang on all cores but "reserve" some for other things. Normally when you do a file I/O operation then it is run in a scheduler and as file I/O operations are relatively slow they will block that scheduler while they are running. Which can affect the real-time properties. Normally you don't do that much file I/O so it is not a problem.
The asynchronous thread pool are OS threads which are used for I/O operations. Normally the pool is empty but if you use the +A at startup time then the BEAM will create extra threads for this pool. These threads will then only be used for file I/O operations which means that the scheduler threads will no longer block waiting for file I/O and the real-time properties are improved. Of course this costs as OS threads aren't free. The threads don't mix so scheduler threads are just scheduler threads and async threads are just async threads.
If you are writing linked-in drivers for ports these can also use the async thread pool. But you have to detect when they have been started yourself.
How many you need is very much up to your application. By default none are started. Like #demeshchuk I have also heard that Riak likes to have a large async thread pool as they open many files. My only advice is to try it and measure. As with all optimisation?
By default, the number of threads in a running Erlang VM is equal to the number of processor logical cores (if you are using SMP, of course).
From my experience, increasing the +A parameter may give some performance improvement when you are having many simultaneous file I/O operations. And I doubt that increasing +A might increase the overall processes performance, since BEAM's scheduler is extremely fast and optimized.
Speaking of the exact numbers – that totally depends on your application I think. Say, in case of Riak, where the maximum number of opened files is more or less predictable, you can set +A to this maximum, or several times less if it's way too big (by default it's 64, BTW). If your application contains, like, millions of files, and you serve them to web clients – that's another story; most likely, you might want to run some benchmarks with your own code and your own environment.
Finally, I believe I've never seen +A more than a hundred. Doesn't mean you can't set it, but there's likely no point in it.

Resources