Can I repeat slow cluster node jobs at the end of a gnu parallel job for faster completion - gnu-parallel

At the end of processing a large suite of jobs on an inhomogeneous cluster it is highly likely that the slowest node will still be processing its last job while the fastest nodes will be idle. If there are orders of magnitude difference in job processing times between nodes, then this could result in a significant loss of time.
Is there a way for nodes, which become idle once all jobs have been distributed, to be assigned replicate jobs for those not yet returned?
I see the --resume-failed option but this isnt exactly what I want. It's not that the slow node has failed to process a job per se. It's that a fast node, that has become idle at the end of the set of jobs, could do the job of the slowest node before the slowest node completes.
I assume that this isnt already the default behaviour of gnu-parallel because I can add a very slow node to my cluster resulting in a much slower (orders of magnitude) overall processing time.

Related

How to batch schedule dask_jobqueue jobs in DASK instead of concurrent?

By my reading of Dask-Jobqueue (https://jobqueue.dask.org/en/latest/), and by testing on our SLURM cluster, it seems when you set cluster.scale(n), and create client = Client(cluster), none of your jobs are able to start until all n of your jobs are able to start.
Suppose you have 999 jobs to run, and a cluster with 100 nodes or slots; worse yet, suppose other people share the cluster, and maybe some of them have long-running jobs. Admins sometimes need to do maintenance on some of the nodes, so they add and remove nodes. You never know how much parallelism you'll be able to get. You want the cluster scheduler to simply take 999 jobs (in slurm, these would be submitted via sbatch), run them in any order on any available nodes, store results in a shared directory, and have a dependent job (in slurm, that would be sbatch --dependency=) process the shared directory after all 999 jobs completed. Is this possible with DASK somehow?
It seems a fundamental limitation of the architecture, that all the jobs are expected to run in parallel, and the user must specify the degree of parallelism.
Your understanding is not correct. Dask can run with fewer than the specified number of jobs, just as you've asked for. It will use whatever resources arrive.

Is there any way to set numWorkers dynamically in the middle of dataflow job running?

I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.

Execution window time

I've read an article in the book elixir in action about processes and scheduler and have some questions:
Each process get a small execution window, what is does mean?
Execution windows is approximately 2000 function calls?
What is a process implicitly yield execution?
Let's say you have 10,000 Erlang/Elixir processes running. For simplicity, let's also say your computer only has a single process with a single core. The processor is only capable of doing one thing at a time, so only a single process is capable of being executed at any given moment.
Let's say one of these processes has a long running task. If the Erlang VM wasn't capable of interrupting the process, every single other process would have to wait until that process is done with its task. This doesn't scale well when you're trying to handle tens of thousands of requests.
Thankfully, the Erlang VM is not so naive. When a process spins up, it's given 2,000 reductions (function calls). Every time a function is called by the process, it's reduction count goes down by 1. Once its reduction count hits zero, the process is interrupted (it implicitly yields execution), and it has to wait its turn.
Because Erlang/Elixir don't have loops, iterating over a large data structure must be done recursively. This means that unlike most languages where loops become system bottlenecks, each iteration uses up one of the process' reductions, and the process cannot hog execution.
The rest of this answer is beyond the scope of the question, but included for completeness.
Let's say now that you have a processor with 4 cores. Instead of only having 1 scheduler, the VM will start up with 4 schedulers (1 for each core). If you have enough processes running that the first scheduler can't handle the load in a reasonable amount of time, the second scheduler will take control of the excess processes, executing them in parallel to the first scheduler.
If those two schedulers can't handle the load in a reasonable amount of time, the third scheduler will take on some of the load. This continues until all of the processors are fully utilized.
Additionally, the VM is smart enough not to waste time on processes that are idle - i.e. just waiting for messages.
There is an excellent blog post by JLouis on How Erlang Does Scheduling. I recommend reading it.

How to limit both total number of Quartz jobs and number running on single node in a cluster

If I have a 3 node cluster. I need to run a specific Quartz job as follows:
There is at a given time, many (say 30) of these jobs that need to be run.
Limit the number of a that Quartz job running on all clusters combined at the same time (to 10, because of system resources)
Limit the number of a that Quartz job running on a single server at the same time (to 5, because of CPU load)
How do I limit both the total number of simultaneous job instances to 10, and the number running on any one host to 5? Is this even possible?
Note that I cannot limit the number of threads as I have other jobs that need to run on the same servers at the same time, and those need threads as well.
Thanks.
While not exactly limiting the consecutive job count, you can limit the maximum thread count with the thread pool configuration. See Quartz Configuration Reference.
The Grails Quartz plugin comes with a handy script for installing the config file:
grails install-quartz-config
org.quartz.threadPool.threadCount
Can be any positive integer, although you should realize that only
numbers between 1 and 100 are very practical. This is the number of
threads that are available for concurrent execution of jobs. If you
only have a few jobs that fire a few times a day, then 1 thread is
plenty! If you have tens of thousands of jobs, with many firing every
minute, then you probably want a thread count more like 50 or 100
(this highly depends on the nature of the work that your jobs perform,
and your systems resources!).
I think that I found the answer.
The answer is to run two (or more) separate Quartz schedulers. A job in the first scheduler would schedule the job for the second scheduler, and the second would run them. The second scheduler could then be limited to (in this case) 5 threads, although the first scheduler could have more.
Some information about this can be found in
http://quartz-scheduler.org/documentation/quartz-2.2.x/cookbook/MultipleSchedulers
However I do not know how to implement two separate Quartz Schedulers in Grails. If anyone could help with that I would appreciate it. There is an existing Stack Overflow question about this though, although it is unanswered.

Erlang: Job Scheduling Over a Dynamic Set of Nodes

I need some advice writing a Job scheduler in Erlang which is able to distribute jobs ( external os processes) over a set of worker nodes. A job can last from a few milliseconds to a few hours. The "scheduler" should be a global registry where jobs come in, get sorted and then get assigned and executed on connected "worker nodes". Worker nodes should be able to register on the scheduler by telling how many jobs they are able to process in parallel (slots). Worker nodes should be able to join and leave at any time.
An Example:
Scheduler has 10 jobs waiting
Worker Node A connects and is able to process 3 jobs in parallel
Worker Node B connects and is able to process 1 job in parallel
Some time later, another worker node joins which is able to process 2 jobs in parallel
Questions:
I seriously spent some time thinking about the problem but I am still not sure which way to go. My current solution is to have a globally registered gen_server for the scheduler which holds the jobs in its state. Every worker node spawns N worker processes and registers them on the scheduler. The worker processes then pull a job from the scheduler (which is an infinite blocking call with {noreply, ...} if no jobs are currently availale).
Here are some questions:
Is it a good idea to assign every new job to an existing worker, knowing that I will have to re-assign the job to another worker at the time new workers connect? (I think this is how the Erlang SMP scheduler does things, but reassigning jobs seems like a big headache to me)
Should I start a process for every worker processing slot and where should this process live: on the scheduler node or on the worker node? Should the scheduler make rpc calls to the worker node or would it be better for the worker nodes to pull new jobs and then execute them on their own?
And finally: Is this problem already solved and where to find the code for it? :-)
I already tried RabbitMQ for job scheduling but custom job sorting and deployment adds a lot of complexity.
Any advice is highly welcome!
Having read your answer in the comments I'd still recommend to use pool(3):
Spawning 100k processes is not a big deal for Erlang because spawning a process is much cheaper than in other systems.
One process per job is a very good pattern in Erlang, start a new process run the job in the process keeping all the state in the process and terminate the process after the job is done.
Don't bother with worker processes that process a job and wait for a new one. This is the way to go if you are using OS-processes or threads because spawning is expensive but in Erlang this only adds unnecessary complexity.
The pool facility is useful as a low level building block, the only thing it misses your your functionality is the ability to start additional nodes automatically. What I would do is start with pool and a fixed set of nodes to get the basic functionality.
Then add some extra logic that watches the load on the nodes e.g. also like pool does it with statistics(run_queue). If you find that all nodes are over a certain load threshold just slave:start/2,3 a new node on a extra machine and use pool:attach/1to add it to your pool.
This won't rebalance old running jobs but new jobs will automatically be moved to the newly started node since its still idle.
With this you can have a fast pool controlled distribution of incoming jobs and a slower totally separate way of adding and removing nodes.
If you got all this working and still find out -- after some real world benchmarking please -- you need rebalancing of jobs you can always build something into the jobs main loops, after a message rebalance it can respawn itself using the pool master passing its current state as a argument.
Most important just go ahead and build something simple and working and optimize it later.
My solution to the problem:
"distributor" - gen_server,
"worker" - gen_server.
"distributor" starts "workers" using slave:start_link, each "worker" is started with max_processes parameter,
"distributor" behavior:
handle_call(submit,...)
* put job to the queue,
* cast itself check_queue
handle_cast(check_queue,...)
* gen_call all workers for load (current_processes / max_processes),
* find the least busy,
* if chosen worker load is < 1 gen_call(submit,...) worker
with next job if any, remove job from the queue,
"worker" behavior (trap_exit = true):
handle_call(report_load, ...)
* return current_process / max_process,
handle_call(submit, ...)
* spawn_link job,
handle_call({'EXIT', Pid, Reason}, ...)
* gen_cast distributor with check_queue
In fact it is more complex than that as I need to track running jobs, kill them if I need to, but it is easy to implement in such architecture.
This is not a dynamic set of nodes though, but you can start new node from the distributor whenever you need.
P.S. Looks similar to pool, but in my case I am submitting port processes, so I need to limit them and have better control of what is going where.

Resources