PBS array job parallelization - gnu-parallel

I am trying to submit a job on high compute cluster that needs to run a python code lets say 10000 times. I used gnu parallel but then IT team sent me a mail stating that my job is creating too many ssh login logs in their monitoring system. They asked me to use job arrays instead. My code takes about 12 seconds to run. I believe I need to use #PBS -J statement in my PBS script. Then, I am not sure if it will run in parallel. I need to execute my code lets say on 10 nodes 16 cores each i.e. 160 instances of my code running in parallel. How can I parallelize it i.e. run many instances of my code at a given time utilizing all the resources I have?
Below is the initial pbs script with gnu parallel:
#PBS -P My_project
#PBS -N my_job
#PBS -l select=10:ncpus=16:mem=4GB
#PBS -l walltime=01:30:00
module load anaconda
module load parallel
parallel --joblog jobs.log --wd $PBS_O_WORKDIR -j $JOBSPERNODE --sshloginfile $PBS_NODEFILE --env PATH "python $PBS_O_WORKDIR/xyz.py" :::: inputs.txt
inputs.txt is a fie with integer values 0-9999 in each line which is fed to my python code as an argument. Code is highly independent and output of one instance does not affect another.

a little late but thought I'd answer anyway.
Arrays will run in parallel, but the number of jobs running at once will depend on the availability of nodes and the limit of jobs per user per queue. Essentially, each HPC will be slightly different.
Adding #PBS -J 1-10000 will create an array of 10000 jobs, and assuming the syntax is the same as the HPC I use, something like ID=$(sed -n "${PBS_ARRAY_INDEX}p" /path/to/inputs.txt) will then be the integers from inputs.txt whereby PBS array number 123 will return the 123rd line of inputs.txt.
Alternatively, since you're on an HPC, if the jobs are only taking 12 seconds each, and you have 10000 iterations, then a for loop will also complete the entire process in 33.33 hours.


Using parallel on a slurm cluster

I am looking for a command to parallelize the following command with Gnu Parallel:
OpenSees 1.tcl
OpenSees is an exe file which is OpenSees.exe in windows and OpenSees in Linux. I want to do parallel processing with parameter study. OpenSees is a seismic analysis tool. 1.tcl is an input file for it.
Please bear in mind that the 1.tcl will go from 1.tcl to 360.tcl and I would like to define the number of processors (In example how many parallel executions will go side by side). In normal conditions there are parallel versions with mpi for OpenSees but this is the sequential version I am asking for.
This is the slurm sh script I used, but it only worked for one machine, I could not add more than one machine, so I build more shell scripts with numbers >28 to 360. These are the necessary parts of the script.
#SBATCH -n 28 # total number of cores
#SBATCH -N 1 # machine number
parallel --bar ./OpenSees {}.tcl ::: {1..28}

Adding 15K tasks to Jenkins build queue

I have a Jenkins server with about 50 slaves attached.
I'm trying to do some stress test on Jenkins Build Queue because I haven't found any documentation about it.
I have a simple parameterized job with just one step, BRANCH and COUNT are the job parameters. The job sleep for a random amount of time between 10 to 30 seconds :
SEC=$(shuf -i10-30 -n1)
sleep $SEC
I would like to run this job 15K times.
At first I tried to use Jenkins REST API from command line :
for c in $(seq 1 15000); do curl -X POST http://<server ip>:8080/job/TEST_SIMPLE/buildWithParameters --data-urlencode "token=TEST" --data-urlencode "BRANCH=<branch name>" --data-urlencode "COUNT=${c}"; done
But after an hour only 4K tasks were submitted, so I killed the loop and purge Jenkins build queue.
My second try was to use another job which trigger this 'TEST_SIMPLE' job by using system groovy script and calling 'job.scheduleBuild' API. It is currently running for 1.5 hours and submitted only 8K tasks out of 15K.
It seems that tasks are added to the queue only when a slave takes one from the queue
The purpose of this effort is to replace a very old executor/dispatcher for our test suite which contain many many tests (~ 15K) and I doing a POC with Jenkins because we are already using it for our builds and to run this old executor.
So my questions are:
1. Is there a limit on the size of the build queue ?
2. Is there a way to submit so many requests very fast ?
If I understand your post correctly you would like to know:
Is there a limit to the build queue?
How can I submit lots of requests to Jenkins very fast?
Let me answer the second question first. To submit requests faster you can use parallel xargs with curl. This will allow you to use multiple threads to submit your curl requests.
seq 1 20 | xargs -n 1 -P 10 -I cnt curl -X POST http://<server ip>:8080/job/TEST_SIMPLE/buildWithParameters --data-urlencode "token=TEST" --data-urlencode "BRANCH=<branch name>" --data-urlencode "COUNT=cnt"
The xargs -n 1 -P 10 -I cnt command passes arguments one at a time (-n 1) to curl using the variable name 'cnt' (-I cnt) in parallel (-P 10 with 10 threads).
Depending on the the number of processors on the machine you use to generate load you could use a very high number of threads. Go too high and your ability to generate load will drop off as the threads compete for processor. On most modern laptops I would start at 50 and push it till the fans sounded like a jet engine, but thats just me.
As far as the queue limit, it will definitely be bounded by memory and disk but I would have to dig into source to see if there are other constraints.

Use all cores to make OpenCV 3 [duplicate]

Quick question: what is the compiler flag to allow g++ to spawn multiple instances of itself in order to compile large projects quicker (for example 4 source files at a time for a multi-core CPU)?
You can do this with make - with gnu make it is the -j flag (this will also help on a uniprocessor machine).
For example if you want 4 parallel jobs from make:
make -j 4
You can also run gcc in a pipe with
gcc -pipe
This will pipeline the compile stages, which will also help keep the cores busy.
If you have additional machines available too, you might check out distcc, which will farm compiles out to those as well.
There is no such flag, and having one runs against the Unix philosophy of having each tool perform just one function and perform it well. Spawning compiler processes is conceptually the job of the build system. What you are probably looking for is the -j (jobs) flag to GNU make, a la
make -j4
Or you can use pmake or similar parallel make systems.
People have mentioned make but bjam also supports a similar concept. Using bjam -jx instructs bjam to build up to x concurrent commands.
We use the same build scripts on Windows and Linux and using this option halves our build times on both platforms. Nice.
If using make, issue with -j. From man make:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option, the last one is effective.
If the -j option is given without an argument, make will not limit the
number of jobs that can run simultaneously.
And most notably, if you want to script or identify the number of cores you have available (depending on your environment, and if you run in many environments, this can change a lot) you may use ubiquitous Python function cpu_count():
Like this:
make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))')
If you're asking why 1.5 I'll quote user artless-noise in a comment above:
The 1.5 number is because of the noted I/O bound problem. It is a rule of thumb. About 1/3 of the jobs will be waiting for I/O, so the remaining jobs will be using the available cores. A number greater than the cores is better and you could even go as high as 2x.
make will do this for you. Investigate the -j and -l switches in the man page. I don't think g++ is parallelizable.
distcc can also be used to distribute compiles not only on the current machine, but also on other machines in a farm that have distcc installed.
I'm not sure about g++, but if you're using GNU Make then "make -j N" (where N is the number of threads make can create) will allow make to run multple g++ jobs at the same time (so long as the files do not depend on each other).
GNU parallel
I was making a synthetic compilation benchmark and couldn't be bothered to write a Makefile, so I used:
sudo apt-get install parallel
ls | grep -E '\.c$' | parallel -t --will-cite "gcc -c -o '{.}.o' '{}'"
{.} takes the input argument and removes its extension
-t prints out the commands being run to give us an idea of progress
--will-cite removes the request to cite the software if you publish results using it...
parallel is so convenient that I could even do a timestamp check myself:
ls | grep -E '\.c$' | parallel -t --will-cite "\
if ! [ -f '{.}.o' ] || [ '{}' -nt '{.}.o' ]; then
gcc -c -o '{.}.o' '{}'
xargs -P can also run jobs in parallel, but it is a bit less convenient to do the extension manipulation or run multiple commands with it: Calling multiple commands through xargs
Parallel linking was asked at: Can gcc use multiple cores when linking?
TODO: I think I read somewhere that compilation can be reduced to matrix multiplication, so maybe it is also possible to speed up single file compilation for large files. But I can't find a reference now.
Tested in Ubuntu 18.10.

watching memory in PBS

I'm running a job on a cluster (using PBS) that runs out of memory. I'm trying to print the memory status for each node separately while my other job is running. I created a shell script and included a call to that script from inside my job submission script. But when I'm submitting my job it gives me permission denied error on the line that calls the script. I don't understand why do I get that error.
Secondly, I was thinking that I can have a 'watch free' or 'watch ps aux' in my script file but now I'm thinking if that will cause my submitted job to get stuck in memory-watching script and never continue to get to the main line that calls my parallel program.
After all, how can I achieve logging my memory in PBS for the jobs I'm submitting. My code is a C++ program using MRMPI (MPI MapReduce) library.
To see how much memory is being used throughout the job, run qstat -f:
$ qstat -f | grep used
resources_used.cput = 00:02:51
resources_used.energy_used = 0
resources_used.mem = 6960kb
resources_used.vmem = 56428kb
resources_used.walltime = 00:01:26
To examine past jobs you can look in the accounting file. This is located in the server_priv/accounting directory, the default is /var/spool/torque/server_priv/accounting/.
The entries look like this:
09/14/2015 10:52:11;E;202.napali;user=dbeer group=company jobname=intense.sh queue=batch ctime=1442248534 qtime=1442248534 etime=1442248534 start=1442248536 owner=dbeer#napali exec_host=napali/0-2 Resource_List.neednodes=1:ppn=3 Resource_List.nodect=1 Resource_List.nodes=1:ppn=3 session=20415 total_execution_slots=3 unique_node_count=1 end=0 Exit_status=0 resources_used.cput=1989 resources_used.energy_used=0 resources_used.mem=9660kb resources_used.vmem=58500kb resources_used.walltime=995
NOTE: if your ssh access to computing nodes of the cluster is closed, this method won't work!
This is how I ended up doing this. It might not be the best way but it works:
In summary, I added some short sleep periods in between my map and reduce steps by calling c++ sleep() function. And also wrote a script that ssh's to the nodes my job is running on and then gets the memory status on those nodes writing them in a file (using 'free' or 'top' commands).
More detailed: in my PBS job script, somewhere before the call to my binary, I added this line:
#this goes in job script, before the call to the job binary:
cat $PBS_NODEFILE > /some/path/nodelist.log
This writes a list of the nodes that my job runs on, into a file.
I have a second script "watchmem.sh":
for i in $(seq 60)
while read line;
ssh $line 'bash -s' < /some/path/remote.sh "$line"
done < /some/path/nodelist.log
sleep 10
This script reads the file nodelist.log that we generated before, performs an ssh into each node and calls a third (and last script), remote.sh, on each of those nodes.
remote.sh contains the commands that we run on every node of our job. In this case it prints the current time and the result of 'free' into separate files for each node:
echo "Current time : $(date)" >> $1
free >> $1 #this can be replaced by top by specifying a -n for it
Comparing the times from these files and the times I'm printing from my binary let's me find out the memory consumption (alloc/dealloc) in each step.
The sleep periods in my job is to make sure my scripts capture the memory status in between steps. 'sleep 10' in my script is to avoid unnecessary writes to the file; this period should be comparable to the sleep duration in the main job.

To use lsf bsub command without all the verbosity output

My problem is that: I have a bash script that do something and then call 800 bsub jobs like this:
rm -f ~/.count-*
for i in `ls _some_files_`; do
bsub -I "grep _something_ $i > $of" &
pids="${!} ${pids}"
wait ${pids}
Then the scripts process the output files $of and echo the results.
The trouble is that I got a lot of lines like:
Job <7536> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on hostA>>
It's actually 800 times the 3 lines above. Is there a way of suppressing this LSF lines?
I've tried in the loop above:
bsub -I "grep _something_ $i > $of" &> /dev/null
I does remove the LSF verbosity but instead of submitting almost all 800 jobs at once and then take less than 4 min to run, it submits just few jobs at a time and I have to wait more than an hour for the script to finish.
AFAIK lsf bsub doesn't seem to have a option to surpress all this verbosity. What can I do here?
You can suppress this output by setting the environment variable BSUB_QUIET to any value (including empty) before your bsub. So, before your loop say you could add:
export BSUB_QUIET=
Then if you want to return it back to normal you can clear the variable with:
Hope that helps you out.
Have you considered using job dependencies and post-process the logfiles?
1) Run each "child" job (removing the "-Is") and output the IO to separate output file. Each job should be submitted with a jobname (see -J). the jobname could form an array.
2) Your final job would be dependent on the children finishing (see -w).
Besides running concurrent across the cluster, another advantage of this approach is that your overall process is not susceptible to IO issues.
