When is it appropriate to increase the async-thread size from zero? - erlang

I have been reading the documentation trying to understand when it makes sense to increase the async-thread pool size via the +A N switch.
I am perfectly prepared to benchmark, but I was wondering if there were a rule-of-thumb for when one ought to suspect that growing the pool size from 0 to N (or N to N+M) would be helpful.
Thanks

The BEAM runs Erlang code in special threads it calls schedulers. By default it will start a scheduler for every core in your processor. This can be controlled and start up time, for instance if you don't want to run Erlang on all cores but "reserve" some for other things. Normally when you do a file I/O operation then it is run in a scheduler and as file I/O operations are relatively slow they will block that scheduler while they are running. Which can affect the real-time properties. Normally you don't do that much file I/O so it is not a problem.
The asynchronous thread pool are OS threads which are used for I/O operations. Normally the pool is empty but if you use the +A at startup time then the BEAM will create extra threads for this pool. These threads will then only be used for file I/O operations which means that the scheduler threads will no longer block waiting for file I/O and the real-time properties are improved. Of course this costs as OS threads aren't free. The threads don't mix so scheduler threads are just scheduler threads and async threads are just async threads.
If you are writing linked-in drivers for ports these can also use the async thread pool. But you have to detect when they have been started yourself.
How many you need is very much up to your application. By default none are started. Like #demeshchuk I have also heard that Riak likes to have a large async thread pool as they open many files. My only advice is to try it and measure. As with all optimisation?

By default, the number of threads in a running Erlang VM is equal to the number of processor logical cores (if you are using SMP, of course).
From my experience, increasing the +A parameter may give some performance improvement when you are having many simultaneous file I/O operations. And I doubt that increasing +A might increase the overall processes performance, since BEAM's scheduler is extremely fast and optimized.
Speaking of the exact numbers – that totally depends on your application I think. Say, in case of Riak, where the maximum number of opened files is more or less predictable, you can set +A to this maximum, or several times less if it's way too big (by default it's 64, BTW). If your application contains, like, millions of files, and you serve them to web clients – that's another story; most likely, you might want to run some benchmarks with your own code and your own environment.
Finally, I believe I've never seen +A more than a hundred. Doesn't mean you can't set it, but there's likely no point in it.

Related

are there any limits on number of the dask workers/cores/threads?

I am seeing some performance degradation on my data analysis when I go more than 25 workers, each with 192 threads. Are there any limits on scheduler? there is no load footprint on communication(ib is used) or cpu or ram).
for example initially I have 170K hdf files on the lustrefs:
ddf=dd.read_hdf(hdf5files,key="G18",mode="r")
ddf.repartition(npartitions=4096).to_parquet(splitspath+"gdr3-input-cache")
the code is running slower on 64 workers than 25. looks like the scheduler on initial tasks design phase is very overloaded.
EDIT:
dask-2021.06.0
distributed-2021.06.0
There are many potential bottlenecks. Here are some hints.
Yes, the scheduler is a single process through which all tasks must pass, and it introduces an overhead per task (<1ms) just to manipulate its internal state and send . So, if you have many tasks per second, you will see the overhead take a larger fraction of the total time.
Similarly, if you have a lot of workers, you will have a lot of network traffic for both distribution of tasks and any data shuffling between workers. More workers, more traffic.
Thirdly, python uses a global lock, the GIL, when running code. Even when your tasks are GIL-friendly (e.g., array/dataframe ops), threads may still need the GIL sometimes, and this can cause contention and degraded performance.
Finally, you say you are using lustre, so you have many tasks simultaneously hitting network storage, which will have its own limitations both for metadata access and for data traffic.

Best practices in setting number of dask workers

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.
The terms I came across are: thread, process, processor, node, worker, scheduler.
My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example:
1 worker per node with n processes for the n cores on the node
threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client
Any other suggestions?
By "node" people typically mean a physical or virtual machine. That node can run several programs or processes at once (much like how my computer can run a web browser and text editor at once). Each process can parallelize within itself with many threads. Processes have isolated memory environments, meaning that sharing data within a process is free, while sharing data between processes is expensive.
Typically things work best on larger nodes (like 36 cores) if you cut them up into a few processes, each of which have several threads. You want the number of processes times the number of threads to equal the number of cores. So for example you might do something like the following for a 36 core machine:
Four processes with nine threads each
Twelve processes with three threads each
One process with thirty-six threads
Typically one decides between these choices based on the workload. The difference here is due to Python's Global Interpreter Lock, which limits parallelism for some kinds of data. If you are working mostly with Numpy, Pandas, Scikit-Learn, or other numerical programming libraries in Python then you don't need to worry about the GIL, and you probably want to prefer few processes with many threads each. This helps because it allows data to move freely between your cores because it all lives in the same process. However, if you're doing mostly Pure Python programming, like dealing with text data, dictionaries/lists/sets, and doing most of your computation in tight Python for loops then you'll want to prefer having many processes with few threads each. This incurs extra communication costs, but lets you bypass the GIL.
In short, if you're using mostly numpy/pandas-style data, try to get at least eight threads or so in a process. Otherwise, maybe go for only two threads in a process.

How to determine concurrency (threads) while using shoryuken for background jobs?

In my Ruby on Rails application, I'm using shoryouken for background processing. I've many sqs queues (6-7) in my application. One of the queue has 2000-3000 jobs and it takes around 3 hours for the worker to process these 2-3k jobs with a default concurrency of 25. So based on what factors can we decide to increase the concurrency (which is the number of threads to process jobs). Please do comment if anything is unclear in the question.
Concurrency defaults to 25, but can be changed by altering your shoryuken.yml configuration (see below) or by adding the concurrency argument as so: shoryuken -c {desiredCount}
concurrency: 25 # Update with your desired value.
delay: 25 # The delay in seconds to pause a queue when it's empty. Default 0
queues:
- [high_priority, 6]
- [default, 2]
- [low_priority, 1]
You will need to test the optimal value for performance as you'll run into I/O and CPU bottlenecks as number of concurrent threads rises. Once you've reached the optimal value for your instance(s), you'll need to either increase the number of instances running this job or upgrade the instance(s).
If the bottleneck exists instead on your DB or other resource, you'll need to adjust it accordingly. (Not likely to be the case, but included for thoroughness' sake)
EDIT: Optimizing Performance
In response to your question on optimizing the thread count, the quickest/best way to determine the optimal concurrency value is to change concurrency and measure real-world throughput. There's other approaches, but the golden rule for performance is always to measure in a live production environment. Synthetic benchmarks are only helpful to the extent that they mirror real-time performance. (See also: premature optimization).
This is a case where you can easily end up overthinking things (then again, overthinking things is a perennial problem in development). Just measure with the appropriate metrics (CPU utilization, memory utilization, number of jobs completed per minute), and change the number of threads until you either maximize throughput or run into a bottleneck.
If your tasks are CPU bound you'll see your CPU utilization maxing out. If your tasks are I/O bound you'll see that after some point an increase in concurrent threads does not translate to an increase in throughput even though your CPU utilization fails to rise.
An I/O bottleneck can happen when any of the resources you're reading/writing are unable to keep up with your CPU demands. This includes system resources (memory, disk space), your database performance (DB CPU utilization, read/write limits), as well as other APIs you're connecting with. Network capacity is also a theoretical bottleneck but if it was you'd be big enough to have hired someone with experience in this area. Because there's so many different ways for this to happen, the only real way to figure it out what the bottlenecks are is to have your monitoring in place.
Re: formula, the short answer is that there's no one formula that you can use in this case. The long answer is probably yes, but you'd arrive at the optimum value in the course of collecting all the values you'd need to calculate it.
EDIT 2 : Concurrency, Latency, and Throughput
I realized I forgot to add one more piece of advice. When you're working with background tasks that users are not waiting for, your throughput (jobs per unit of time) is the only thing you want to optimize. Do not optimize for individual job time. It also means you cannot profile the current (and presumably un-bound) performance and get useful data because bottlenecks/constraints are target dependent. The constraints that exist for throughput will NOT be the same as the constraints that exist for individual task time.
(Technically speaking, your concurrency setting is your current constraint)
Three main factors are
Number of Cores
Type of Job - I/O or CPU bound
Is there another application or process running on server
Ideally for a cpu bound task keep number of thread to number of cpu cores.
For I/O bound task it requires benchmarking and calculating wait time for an I/O, and then you can decide the optimal value. For rough estimate if you have 4 cores than for I/O bound task you must keep at max 8 threads.
If you have your rails app running on the same then you will need to reduce number of cores.
Increasing the number of cores will not increase your performance if your system doesnt support.
Refer : http://baddotrobot.com/blog/2013/06/01/optimum-number-of-threads/

Should spawn be used in Erlang whenever I have a non dependent asynchronous function?

If I have a function that can be executed asynchronously without any dependencies and no other functions require its results directly, should I use spawn ? In my scenario I want to proceed to consume a message queue, so spawning would relif my blocking loop, but if there are other situations where I can distribute function calls as much as possible, will that affect negatively my application ?
Overall, what would be the pros and cons of using Spawn.
Unlike operating system processes or threads, Erlang processes are very light weight. There is minimal overhead in starting, stopping, and scheduling new processes. You should be able to spawn as many of them as you need (the max per vm is in the hundreds of thousands). The Actor model Erlang implements allows you to think about what is actually happening in parallel and write your programs to express that directly. Avoid complicating your logic with work queues if you can avoid it.
Spawn a process whenever it makes logical sense, and optimize only when you have to.
The first thing that come in mind is the size of parameters. They will be copied from your current process to the new one and if the parameters are huge it may be inefficient.
Another problem that may arise is bloating VM with such amount of processes that your system will become irresponsive. You can overcome this problem by using pool of worker processes or special monitor process that will allow to work only limited amount of such processes.
so spawning would relif my blocking loop
If you are in the situation that a loop will receive many messages requiring independant actions, don't hesitate and spawn new processes for each message processing, this way you will take advantage of the multicore capabilities (if any) of your computer. As kjw0188 says, the Erlang processes are very light weight and if the system hits the limit of process numbers alive in parallel (with the assumption that you are doing reasonable code) it is more likely that the application is overloading the capability of the node.

Rails app connection pool size, avoiding max pool size issues

I am running a JRuby on Rails application. I see a lot of this randomly in my logs:
The max pool size is currently 5; consider increasing it
I understand I can increase the max pool size in my configuration to address this. The problem I'm looking to address is to understand what the optimal number should be. I am trying to avoid contention issues for connections. Clearly setting this number to something obnoxiously large will not work either.
Is there a general protocol to follow to know your apps optimal pool size setting?
From here,
The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. On an N-processor system for a work queue that will hold entirely compute-bound tasks, you will generally achieve maximum CPU utilization with a thread pool of N or N+1 threads.
For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.
Processor utilization is not the only consideration in tuning the thread pool size. As the thread pool grows, you may encounter the limitations of the scheduler, available memory, or other system resources, such the number of sockets, open file handles, or database connections.
So profile your application, if your threads are mostly cpu bound, then set the thread pools size to number of cores, or number of cores + 1. If you are spending most of your time waiting for database calls to complete, then experiment with a fairly large number of threads, and see how the application performs.

Resources