How do I allocate same amount of resource for all my tasks deployed on a Celery cluster? - docker

To compare and contrast the performances of three different algorithms in a scientific experiment, I am planning to use Celery scheduler. These algorithms are implemented by three different tools. They may or may not have parallelism implemented which I don't want to make any prior assumption about. The dataset contains 10K data points. All three tools are supposed to run on all the data points; which translates to 30K tasks scheduled by the scheduler. All I want is to allocate the same amount of resources to all the tools, across all the executions.
Assume, my physical Ubuntu 18.04 server is equipped with 24 cores and 96 GB of RAM. Tasks are scheduled by 4 Celery workers, each handling a single task. I want to put an upper limit of 4 CPU cores and 16 GB of memory per task. Moreover, no two tasks should race for the same cores, i.e., 4 tasks should be using 16 cores in total, each scheduled on its own set of cores.
Is there any means to accomplish this setup, either through Celery, or cgroup, or by any other mechanism? I want to refrain from using docker, kubernetes, or any VM based approach, unless it is absolutely required.

Dealing with CPU cores should be fairly easy by specifying concurrency to 6. But limiting memory usage is hard part of the requirement and I believe you can accomplish that by making worker processes be owned by particular cgroup that you specified memory limit on.
An alternative would be to run Celery workers in containers with specified limits.
I prefer not to do this as there may be tasks (or task with particular arguments) that allocate tiny amount of RAM so it would be wasteful if you can't use 4G of RAM while such task runs.
Pity Celery autoscaling is deprecated (it is one of the coolest features of Celery, IMHO). It should not be a difficult task to implement Celery autoscaler that scales up/down depending on memory utilization.

Related

are there any limits on number of the dask workers/cores/threads?

I am seeing some performance degradation on my data analysis when I go more than 25 workers, each with 192 threads. Are there any limits on scheduler? there is no load footprint on communication(ib is used) or cpu or ram).
for example initially I have 170K hdf files on the lustrefs:
ddf=dd.read_hdf(hdf5files,key="G18",mode="r")
ddf.repartition(npartitions=4096).to_parquet(splitspath+"gdr3-input-cache")
the code is running slower on 64 workers than 25. looks like the scheduler on initial tasks design phase is very overloaded.
EDIT:
dask-2021.06.0
distributed-2021.06.0
There are many potential bottlenecks. Here are some hints.
Yes, the scheduler is a single process through which all tasks must pass, and it introduces an overhead per task (<1ms) just to manipulate its internal state and send . So, if you have many tasks per second, you will see the overhead take a larger fraction of the total time.
Similarly, if you have a lot of workers, you will have a lot of network traffic for both distribution of tasks and any data shuffling between workers. More workers, more traffic.
Thirdly, python uses a global lock, the GIL, when running code. Even when your tasks are GIL-friendly (e.g., array/dataframe ops), threads may still need the GIL sometimes, and this can cause contention and degraded performance.
Finally, you say you are using lustre, so you have many tasks simultaneously hitting network storage, which will have its own limitations both for metadata access and for data traffic.

Best practices in setting number of dask workers

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.
The terms I came across are: thread, process, processor, node, worker, scheduler.
My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example:
1 worker per node with n processes for the n cores on the node
threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client
Any other suggestions?
By "node" people typically mean a physical or virtual machine. That node can run several programs or processes at once (much like how my computer can run a web browser and text editor at once). Each process can parallelize within itself with many threads. Processes have isolated memory environments, meaning that sharing data within a process is free, while sharing data between processes is expensive.
Typically things work best on larger nodes (like 36 cores) if you cut them up into a few processes, each of which have several threads. You want the number of processes times the number of threads to equal the number of cores. So for example you might do something like the following for a 36 core machine:
Four processes with nine threads each
Twelve processes with three threads each
One process with thirty-six threads
Typically one decides between these choices based on the workload. The difference here is due to Python's Global Interpreter Lock, which limits parallelism for some kinds of data. If you are working mostly with Numpy, Pandas, Scikit-Learn, or other numerical programming libraries in Python then you don't need to worry about the GIL, and you probably want to prefer few processes with many threads each. This helps because it allows data to move freely between your cores because it all lives in the same process. However, if you're doing mostly Pure Python programming, like dealing with text data, dictionaries/lists/sets, and doing most of your computation in tight Python for loops then you'll want to prefer having many processes with few threads each. This incurs extra communication costs, but lets you bypass the GIL.
In short, if you're using mostly numpy/pandas-style data, try to get at least eight threads or so in a process. Otherwise, maybe go for only two threads in a process.

Multiple workers on machine - Memory management ( Resque - Rails )

We've migrated our Resque background workers from a ton of individual instances on Heroku to between four and ten m4.2xl (32GB Mem) instances on EC2, which is much cheaper and faster.
A few things I've noticed are that: The Heroku instances we were using had 1GB of RAM and rarely ran out of memory. I am currently allocating 24 workers to one machine, so about 1.3GB of memory per worker. However, because the processing power on these machines is so much greater, I think the OS has trouble reclaiming the memory fast enough and each worker ends of consuming more on average. Most of the day the system has 17-20GB memory free but when the memory intensive jobs are run, all 24 workers grab a job almost at the same time and then start growing. They get through a few jobs but then the system hasn't had time to reap memory and crashes if there is no intervention.
I've written a daemon to pause the workers before a crash and wait for the OS to free memory. I could reduce the number of workers per machine overall or have half of them unsubscribe from the problematic queues, I just feel there must be a better way to manage this. I would prefer to be making usage of more than 20% of memory 99% of the day.
The workers are setup to fork a process when they pick up a job from the queue. The master-worker processes are run as services managed with Upstart. I'm aware there are a number of managers which simply restart the process when it consumes a certain amount of memory such as God and Monit. That seems like a heavy handed solution which will end with too many jobs killed under normal circumstances.
Is there a better strategy I can use to get higher utilization with a lowered risk of running into Errno::ENOMEM?
System specs:
OS : Ubuntu 12.04
Instance : m4.2xlarge
Memory : 32 GB

How to determine concurrency (threads) while using shoryuken for background jobs?

In my Ruby on Rails application, I'm using shoryouken for background processing. I've many sqs queues (6-7) in my application. One of the queue has 2000-3000 jobs and it takes around 3 hours for the worker to process these 2-3k jobs with a default concurrency of 25. So based on what factors can we decide to increase the concurrency (which is the number of threads to process jobs). Please do comment if anything is unclear in the question.
Concurrency defaults to 25, but can be changed by altering your shoryuken.yml configuration (see below) or by adding the concurrency argument as so: shoryuken -c {desiredCount}
concurrency: 25 # Update with your desired value.
delay: 25 # The delay in seconds to pause a queue when it's empty. Default 0
queues:
- [high_priority, 6]
- [default, 2]
- [low_priority, 1]
You will need to test the optimal value for performance as you'll run into I/O and CPU bottlenecks as number of concurrent threads rises. Once you've reached the optimal value for your instance(s), you'll need to either increase the number of instances running this job or upgrade the instance(s).
If the bottleneck exists instead on your DB or other resource, you'll need to adjust it accordingly. (Not likely to be the case, but included for thoroughness' sake)
EDIT: Optimizing Performance
In response to your question on optimizing the thread count, the quickest/best way to determine the optimal concurrency value is to change concurrency and measure real-world throughput. There's other approaches, but the golden rule for performance is always to measure in a live production environment. Synthetic benchmarks are only helpful to the extent that they mirror real-time performance. (See also: premature optimization).
This is a case where you can easily end up overthinking things (then again, overthinking things is a perennial problem in development). Just measure with the appropriate metrics (CPU utilization, memory utilization, number of jobs completed per minute), and change the number of threads until you either maximize throughput or run into a bottleneck.
If your tasks are CPU bound you'll see your CPU utilization maxing out. If your tasks are I/O bound you'll see that after some point an increase in concurrent threads does not translate to an increase in throughput even though your CPU utilization fails to rise.
An I/O bottleneck can happen when any of the resources you're reading/writing are unable to keep up with your CPU demands. This includes system resources (memory, disk space), your database performance (DB CPU utilization, read/write limits), as well as other APIs you're connecting with. Network capacity is also a theoretical bottleneck but if it was you'd be big enough to have hired someone with experience in this area. Because there's so many different ways for this to happen, the only real way to figure it out what the bottlenecks are is to have your monitoring in place.
Re: formula, the short answer is that there's no one formula that you can use in this case. The long answer is probably yes, but you'd arrive at the optimum value in the course of collecting all the values you'd need to calculate it.
EDIT 2 : Concurrency, Latency, and Throughput
I realized I forgot to add one more piece of advice. When you're working with background tasks that users are not waiting for, your throughput (jobs per unit of time) is the only thing you want to optimize. Do not optimize for individual job time. It also means you cannot profile the current (and presumably un-bound) performance and get useful data because bottlenecks/constraints are target dependent. The constraints that exist for throughput will NOT be the same as the constraints that exist for individual task time.
(Technically speaking, your concurrency setting is your current constraint)
Three main factors are
Number of Cores
Type of Job - I/O or CPU bound
Is there another application or process running on server
Ideally for a cpu bound task keep number of thread to number of cpu cores.
For I/O bound task it requires benchmarking and calculating wait time for an I/O, and then you can decide the optimal value. For rough estimate if you have 4 cores than for I/O bound task you must keep at max 8 threads.
If you have your rails app running on the same then you will need to reduce number of cores.
Increasing the number of cores will not increase your performance if your system doesnt support.
Refer : http://baddotrobot.com/blog/2013/06/01/optimum-number-of-threads/

When is it appropriate to increase the async-thread size from zero?

I have been reading the documentation trying to understand when it makes sense to increase the async-thread pool size via the +A N switch.
I am perfectly prepared to benchmark, but I was wondering if there were a rule-of-thumb for when one ought to suspect that growing the pool size from 0 to N (or N to N+M) would be helpful.
Thanks
The BEAM runs Erlang code in special threads it calls schedulers. By default it will start a scheduler for every core in your processor. This can be controlled and start up time, for instance if you don't want to run Erlang on all cores but "reserve" some for other things. Normally when you do a file I/O operation then it is run in a scheduler and as file I/O operations are relatively slow they will block that scheduler while they are running. Which can affect the real-time properties. Normally you don't do that much file I/O so it is not a problem.
The asynchronous thread pool are OS threads which are used for I/O operations. Normally the pool is empty but if you use the +A at startup time then the BEAM will create extra threads for this pool. These threads will then only be used for file I/O operations which means that the scheduler threads will no longer block waiting for file I/O and the real-time properties are improved. Of course this costs as OS threads aren't free. The threads don't mix so scheduler threads are just scheduler threads and async threads are just async threads.
If you are writing linked-in drivers for ports these can also use the async thread pool. But you have to detect when they have been started yourself.
How many you need is very much up to your application. By default none are started. Like #demeshchuk I have also heard that Riak likes to have a large async thread pool as they open many files. My only advice is to try it and measure. As with all optimisation?
By default, the number of threads in a running Erlang VM is equal to the number of processor logical cores (if you are using SMP, of course).
From my experience, increasing the +A parameter may give some performance improvement when you are having many simultaneous file I/O operations. And I doubt that increasing +A might increase the overall processes performance, since BEAM's scheduler is extremely fast and optimized.
Speaking of the exact numbers – that totally depends on your application I think. Say, in case of Riak, where the maximum number of opened files is more or less predictable, you can set +A to this maximum, or several times less if it's way too big (by default it's 64, BTW). If your application contains, like, millions of files, and you serve them to web clients – that's another story; most likely, you might want to run some benchmarks with your own code and your own environment.
Finally, I believe I've never seen +A more than a hundred. Doesn't mean you can't set it, but there's likely no point in it.

Resources