Use all cores to make OpenCV 3 [duplicate] - opencv

Quick question: what is the compiler flag to allow g++ to spawn multiple instances of itself in order to compile large projects quicker (for example 4 source files at a time for a multi-core CPU)?

You can do this with make - with gnu make it is the -j flag (this will also help on a uniprocessor machine).
For example if you want 4 parallel jobs from make:
make -j 4
You can also run gcc in a pipe with
gcc -pipe
This will pipeline the compile stages, which will also help keep the cores busy.
If you have additional machines available too, you might check out distcc, which will farm compiles out to those as well.

There is no such flag, and having one runs against the Unix philosophy of having each tool perform just one function and perform it well. Spawning compiler processes is conceptually the job of the build system. What you are probably looking for is the -j (jobs) flag to GNU make, a la
make -j4
Or you can use pmake or similar parallel make systems.

People have mentioned make but bjam also supports a similar concept. Using bjam -jx instructs bjam to build up to x concurrent commands.
We use the same build scripts on Windows and Linux and using this option halves our build times on both platforms. Nice.

If using make, issue with -j. From man make:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option, the last one is effective.
If the -j option is given without an argument, make will not limit the
number of jobs that can run simultaneously.
And most notably, if you want to script or identify the number of cores you have available (depending on your environment, and if you run in many environments, this can change a lot) you may use ubiquitous Python function cpu_count():
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count
Like this:
make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))')
If you're asking why 1.5 I'll quote user artless-noise in a comment above:
The 1.5 number is because of the noted I/O bound problem. It is a rule of thumb. About 1/3 of the jobs will be waiting for I/O, so the remaining jobs will be using the available cores. A number greater than the cores is better and you could even go as high as 2x.

make will do this for you. Investigate the -j and -l switches in the man page. I don't think g++ is parallelizable.

distcc can also be used to distribute compiles not only on the current machine, but also on other machines in a farm that have distcc installed.

I'm not sure about g++, but if you're using GNU Make then "make -j N" (where N is the number of threads make can create) will allow make to run multple g++ jobs at the same time (so long as the files do not depend on each other).

GNU parallel
I was making a synthetic compilation benchmark and couldn't be bothered to write a Makefile, so I used:
sudo apt-get install parallel
ls | grep -E '\.c$' | parallel -t --will-cite "gcc -c -o '{.}.o' '{}'"
Explanation:
{.} takes the input argument and removes its extension
-t prints out the commands being run to give us an idea of progress
--will-cite removes the request to cite the software if you publish results using it...
parallel is so convenient that I could even do a timestamp check myself:
ls | grep -E '\.c$' | parallel -t --will-cite "\
if ! [ -f '{.}.o' ] || [ '{}' -nt '{.}.o' ]; then
gcc -c -o '{.}.o' '{}'
fi
"
xargs -P can also run jobs in parallel, but it is a bit less convenient to do the extension manipulation or run multiple commands with it: Calling multiple commands through xargs
Parallel linking was asked at: Can gcc use multiple cores when linking?
TODO: I think I read somewhere that compilation can be reduced to matrix multiplication, so maybe it is also possible to speed up single file compilation for large files. But I can't find a reference now.
Tested in Ubuntu 18.10.

Related

Optimizing of Opengrok on large base

i have a server instance here with 4 Cores and 32 GB RAM and Ubuntu 20.04.3 LTS installed. On this machine there is an opengrok-instance running as docker container.
Inside of the docker container it uses AdoptOpenJDK:
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
Eclipse OpenJ9 VM AdoptOpenJDK-11.0.11+9 (build openj9-0.26.0, JRE 11 Linux amd64-64-Bit Compressed References 20210421_975 (JIT enabled, AOT enabled)
OpenJ9 - b4cc246d9
OMR - 162e6f729
JCL - 7796c80419 based on jdk-11.0.11+9)
The code-base that the opengrok-indexer scans is 320 GB big and tooks 21 hours.
What i am figured is out was, that i've am disable the history-option it tooks lesser time. Is there a possibility to reduce this time, if the history-flag is set.
Here are my index-command:
opengrok-indexer -J=-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -J=-Djava.util.logging.config.file=/usr/share/tomcat10/conf/logging.properties -J=-XX:-UseGCOverheadLimit -J=-Xmx30G -J=-Xms30G -J=-server -a /var/opengrok/dist/lib/opengrok.jar -- -R /var/opengrok/etc/read-only.xml -m 256 -c /usr/bin/ctags -s /var/opengrok/src/ -d /var/opengrok/data --remote on -H -P -S -G -W /var/opengrok/etc/configuration.xml --progress -v -O on -T 3 --assignTags --search --remote on -i *.so -i *.o -i *.a -i *.class -i *.jar -i *.apk -i *.tar -i *.bz2 -i *.gz -i *.obj -i *.zip"
Thank you for your help in advance.
Kind Regards
Siegfried
You should try to increase the number of threads using the following options:
--historyThreads number
The number of threads to use for history cache generation on repository level. By default the number of threads will be set to the number of available CPUs.
Assumes -H/--history.
--historyFileThreads number
The number of threads to use for history cache generation when dealing with individual files.
By default the number of threads will be set to the number of available CPUs.
Assumes -H/--history.
-T, --threads number
The number of threads to use for index generation, repository scan
and repository invalidation.
By default the number of threads will be set to the number of available
CPUs. This influences the number of spawned ctags processes as well.
Take a look at the "renamedHistory" option too. Theoretically "off" is the default option but this has a huge impact on the index time, so it's worth the check:
--renamedHistory on|off
Enable or disable generating history for renamed files.
If set to on, makes history indexing slower for repositories
with lots of renamed files. Default is off.

PBS array job parallelization

I am trying to submit a job on high compute cluster that needs to run a python code lets say 10000 times. I used gnu parallel but then IT team sent me a mail stating that my job is creating too many ssh login logs in their monitoring system. They asked me to use job arrays instead. My code takes about 12 seconds to run. I believe I need to use #PBS -J statement in my PBS script. Then, I am not sure if it will run in parallel. I need to execute my code lets say on 10 nodes 16 cores each i.e. 160 instances of my code running in parallel. How can I parallelize it i.e. run many instances of my code at a given time utilizing all the resources I have?
Below is the initial pbs script with gnu parallel:
#!/bin/bash
#PBS -P My_project
#PBS -N my_job
#PBS -l select=10:ncpus=16:mem=4GB
#PBS -l walltime=01:30:00
module load anaconda
module load parallel
cd $PBS_O_WORKDIR
JOBSPERNODE=16
parallel --joblog jobs.log --wd $PBS_O_WORKDIR -j $JOBSPERNODE --sshloginfile $PBS_NODEFILE --env PATH "python $PBS_O_WORKDIR/xyz.py" :::: inputs.txt
inputs.txt is a fie with integer values 0-9999 in each line which is fed to my python code as an argument. Code is highly independent and output of one instance does not affect another.
a little late but thought I'd answer anyway.
Arrays will run in parallel, but the number of jobs running at once will depend on the availability of nodes and the limit of jobs per user per queue. Essentially, each HPC will be slightly different.
Adding #PBS -J 1-10000 will create an array of 10000 jobs, and assuming the syntax is the same as the HPC I use, something like ID=$(sed -n "${PBS_ARRAY_INDEX}p" /path/to/inputs.txt) will then be the integers from inputs.txt whereby PBS array number 123 will return the 123rd line of inputs.txt.
Alternatively, since you're on an HPC, if the jobs are only taking 12 seconds each, and you have 10000 iterations, then a for loop will also complete the entire process in 33.33 hours.

Compare perf-stat results to that of likwid-perfctr results

I want to do some comparison between the outputs of perf-stat to that of likwid-perfctr. Is there a way to do that. I tried running two commands, one for perf-stat, and the other for liquid-perfctr.
The commands are:
sudo perf stat -C 2 -e instructions, BR_INST_RETIRED.ALL_BRANCHES,branches,rc004,INST_RETIRED.ANY ./loop
sudo likwid-perfctr -C 2 -g MYLIST1 -f ./loop
The first instruction is related to perf-stat which captures importantly branches, and instructions count redundantly. The second instruction is related to likwid-perfctr which captures similar data. Just to mention I wrote my own group called MYLIST1 for likwid-perfctr.
But when I compare both the results, its turning out to be quite different.
Output Comparison
So, when we look into the output, INSTR_RETIRED_ANY in perf stat are: 15552, to that of likwid-perfctr are: 190594. And branches are: 3168 vs 42744.
I'm not sure what I'm doing wrong. Or is there any way to properly do that.

Using parallel on a slurm cluster

I am looking for a command to parallelize the following command with Gnu Parallel:
OpenSees 1.tcl
OpenSees is an exe file which is OpenSees.exe in windows and OpenSees in Linux. I want to do parallel processing with parameter study. OpenSees is a seismic analysis tool. 1.tcl is an input file for it.
Please bear in mind that the 1.tcl will go from 1.tcl to 360.tcl and I would like to define the number of processors (In example how many parallel executions will go side by side). In normal conditions there are parallel versions with mpi for OpenSees but this is the sequential version I am asking for.
This is the slurm sh script I used, but it only worked for one machine, I could not add more than one machine, so I build more shell scripts with numbers >28 to 360. These are the necessary parts of the script.
#!/bin/bash
#SBATCH -n 28 # total number of cores
#SBATCH -N 1 # machine number
parallel --bar ./OpenSees {}.tcl ::: {1..28}

GNU parallel: deleting line from joblog breaks parallel updating it

If you run GNU parallel with --joblog path/to/logfile and then delete a line from said logfile while parallel is running, GNU parallel is no longer able to append future completed jobs to it.
Execute this MWE:
#!/usr/bin/bash
parallel -j1 -n0 --joblog log sleep 1 ::: $(seq 10) &
sleep 5 && sed -i '$ d' log
If you tail -f log prior to execution, you can see that parallel keeps writing to this file. However, if you cat log after 10 seconds, you will see that nothing was written to the actual file now on disk after the third entry or so.
What's the reason behind this? Is there a way to delete something from the file and have GNU parallel be able to still write to it?
Some background as to why this happened:
Using GNU parallel, I started a few jobs on remote machines with --sshloginfile. I then needed to pkill a few jobs on one of the machines because a colleague needed to use it (and I subsequently removed the machine from the sshloginfile so that parallel wouldn't reuse it for new runs). If you pkill those processes started on the remote machine, they get an Exitval of 0 (it looks like they finished without issues; you can't tell that they were killed). I wanted to remove them immediately from the joblog so that when I restart parallel --resume later, parallel can have a look at the joblog and determine what's missing.
Turns out, this was a bad idea, as now my joblog is useless.
While #MarkSetchell is absolutely right in his comment, root problem here is due to man sed lying:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
sed -i does not edit files in place.
What it does is to make a temporary file in the same dir, copy the input file to the temporary file while doing the editing, and finally renaming the temporary file to the input file's name. Similar to this:
sed '$ d' log > sedXxO11P
mv sedXxO11P log
It is clear that the original log and sedXxO11P have different inodes - let us call them ino1 and ino2. GNU Parallel has ino1 open and really does not know about the existence of ino2. GNU Parallel will happily append to ino1 completely unaware that when it closes the file, the file will vanish because it has already been unlinked.
So you need to change the content of the file without changing the inode:
#!/usr/bin/bash
seq 10 | parallel -j1 -n0 --joblog log sleep 1 &
sleep 5
# Obvious race condition here:
# Anything appended to log before sed is done is lost.
# This can be avoided by suspending parallel while running this
tmp=$RANDOM$$
cp log $tmp
(rm $tmp; sed '$ d' >log) < $tmp
wait
cat log
This works right now. But do not expect this to be a supported feature - ever.

Resources