How does GNU Parallel handle multithreaded jobs? - gnu-parallel

Does GNU Parallel change job assignment based upon how many threads each job uses?
For example, on an 8 core machine, does
parallel 4ThreadedProcess
automatically detect that each job uses 4 threads and thus run only 2 jobs at a time?

Related

Google Cloud Dataflow + Java 8 vs Java 11: same pipeline, different CPU utilization on workers

I have a Beam 2.25.0 pipeline that gets some data, generates a bunch more data (does a fanout), repartitions the new data, and runs computations on that generated data in parallel. The machines I specify for the job are n1-highmem-4 and I specify 40 workers max.
It works fine under Java 8: all workers provided to the job are fully utilized (>90% CPU). Throughput is 40 elements/s.
When I recompile and re-run the pipeline to use Java 11, the same number of workers are provided to the job, but they only reach 30% CPU utilization, and throughput is less, under 18/s.
In order for me to get the job to reach the same throughput numbers, I have to specify the --numberOfWorkerHarnessThreads=4 flag, and even then, throughput is still not 40/s like when I run the pipeline under Java 8.
What could be the difference between using Java 8 vs Java 11 for the pipeline? And why wouldn't the pipeline running under Java 11 automatically utilize the workers the same way as under Java 8?
I also tried recompiling and using Beam 2.26.0 for the Java 11 pipeline execution, but it had the same throughput.
There is one bug in Beam that makes the pipeline to default to only use 1 harness thread for Batch in Java 11. Specifying numberOfWorkerHarnessThreads=4 makes the pipeline to use 4 harness would make it to use 4 thread.
You can see the workers did use around ~25% Cpu, which (since you were using a 4 core machine as n1-highmem-4, as it looks from the post), means 100%/4 cores = 25%.
Looking at the Jira, it should be fixed in 2.26.0, but maybe it was delayed to 2.27.0

Massive-Distributed Parallel Execution of tasks

We are currently struggling with the following task. We need to run a windows application (single instance only working) 1000 times with different input parameters. One run of this application can take up to multiple hours. It feels like we have the same problem like any video rendering farm – each picture of a video should be calculated independently and parallel – but it is not rendering.
Currently we tried to execute it with Jenkins and Pipeline jobs. We used the parallel steps in pipeline and lets Jenkins queue and execute the application. We use the Jenkins Label Expression to lets Jenkins choose which job can be run on which node.
The limitation in Jenkins is currently with massive parallel jobs (https://issues.jenkins-ci.org/browse/JENKINS-47724). When the queue contains multiple hundred jobs adding new jobs took much longer – will become even worse by increasing queue. And main problem: Jenkins will start the execution of parallel pipeline part-jobs only after finishing adding all to the queue.
We already investigated ideas how to solve this problem:
Python Distributed: https://distributed.readthedocs.io/en/latest/
a. For single functions it looks great, but for the complete run like we have in Jenkins => Deploy and collect results looks complex
b. Client->Server bidirectional communication needed – no chance to bring it online through a NAT (VM Server)
BOINC: https://boinc.berkeley.edu/
a. for our understanding we had to extend the backend in a massive way to bring our jobs working => to configure the jobs in BOINC we had to write a lot of new automating code
b. currently we need a predeployed application which can differ between different inputs => no equivalent of Jenkins Label Expression
Any ideas how to solve it?
Thanks in advance

Difference between agents and worker threads

I'm working on running NUnit console runners using Jenkins. These tests connect to a Selenium Grid (which is also run by Jenkins), so I want to limit their level of parallelism in order to avoid getting agents starving while waiting for a free node on the grid.
So far I haven't managed to figure out what exactly is the difference between an agent and a worker thread in NUnit... I suspect the agent can manage threads, but it's only a guess. Thanks :)
An agent is a separate process running tests for an assembly. A worker is a thread, within a process, running the tests for a particular assembly.
Theoretically, an agent process could have multiple appdomains, each domain could have multiple assemblies and each assembly could have multiple worker threads.
Practically, however, the normal thing to do is to have one process per assembly, so that there is no need for multiple domains, and each process will run some specified number of worker threads to run tests for the assembly. In some contexts, you may prefer to only run processes in parallel and not have any parallelism within the assembly - it's the approach that is most likely to work without any change to your tests, which you may not have designed with parallelism in mind.
Agents do not "manage" threads. They simply run the framework in a process and the framework decides how many threads to use depending on the attributes you have applied.
Using multiple agents is the only way to run nunit V2 tests in parallel, since the v2 framework is ignorant of parallelism.

How to limit concurrent matrix/multi-configuration builds in Jenkins

I have a multi-configuration job that uses a large amount of VMs for testing.
The Axis are something like:
30 VM slaves, 5 configurations, 5 different configurations
I would not like to run these sequentially, as the jobs would take forever. However, the default number of simultaneous runs is using up enough resources that I am getting random failures and disconnects.
Is there are way to specify the maximum number of simultaneous tests within this single running job?
I think you have to use a matrix job to trigger the builds of a separate job doing the real build. Then
you can use the Throttle Concurrent Builds Plugin to limit the number of parallel executions of that job you start by the matrix.
For multi project configuration
First you need to create a throttle category. In this case, the name is qa-aut and I limiting the number of execution to 2 for concurrent builds and concurrent builds per node. The node will have 4 executors available.
In your job configuration, make sure you don't run the multi-project sequentially:
Set up throttling builds, selecting "Throttle this project as part of one or more categories", "Multi-Project Throttle Category"(qa-aut) and "Throttle Matrix configuration builds". You can leave in blank the rest of the values
Make sure your node/master has enough executors available. In this case, the master will have available 4 executors
Execute your multi-project job
Instead of using 4 executors (all the availability), you will see it's using only 2 executors (2 threads) as specified in the category.

Is it possible to force concurrent jobs to run in separate Sidekiq processes?

One of the benefits of Sidekiq over Resqueue is that it can run multiple jobs in the same process. The drawback, however, is I can't figure out how to force a set of concurrent jobs to run in different processes.
Here's my use case: say I have to generate 64M rows of data, and I have 8 vCPUs on an amazon EC2 instance. I'd like to carve the task up into 8 concurrent jobs generating 8M rows each. The problem is that if I'm running 8 sidekiq processes, sometimes sidekiq will decide to run 2 or more of the jobs in the same process, and so it doesn't use all 8 vCPUs and takes much longer to finish. Is there any way to tell sidekiq which worker to use or to force it to spread jobs in a group evenly amongst processes?
Answer is you can't easily, by design. Specialization is what leads to SPOFs.
You can create a custom queue for each process and then create one job for each queue.
You can use JRuby which doesn't suffer the same flaw.
You can execute the processing as a rake task which will spawn one process per job, ensuring an even load.
You can carve up 64 jobs instead of 8 and get a more even load that way.
I would probably do the latter unless the resulting I/O crushes the machine.

Resources