Limit total number of jobs within nested or concurrent GNU Parallel invocations - gnu-parallel

This is a continuation of this question and this question on nested GNU Parallel. Utlimately what I want to achieve is to leave my Makefile untouched except by changing the SHELL= variable and distribute jobs using parallel across all my machines.
Is there a way to ensure that concurrent executions of GNU Parallel respect the --jobs clause specified in the outer invocation? Or some other way to get a limit on the total number of jobs across parallel invocations? For example: I'd like the inner slot in the output below to always be 1, i.e., the slot 1-2 on line three of the output violates the condition.
~• inner_par="parallel -I // --slotreplace '/%/' --seqreplace '/#/'"
~• cmd='echo id {#}-/#/, slot {%}-/%/, arg {}-//'
~• seq 2 | parallel -j 1 "seq {} | $inner_par $cmd"
id 1-1, slot 1-1, arg 1-1
id 2-1, slot 1-1, arg 2-1
id 2-2, slot 1-2, arg 2-2
~•

Are you looking for sem?
parallel -j 10 parallel -j 20 sem -j 30 --id myid mycmd
This will start 200 sems, but only run 30 mycmds in parallel.

Related

GNU Parallel: thread id

In GNU Parallel with the -j option it is possible to specify the number concurrent jobs.
Is it possible to get an id of the thread running the job?. With thread id I mean a number from
1 to 12 on my machine with 12 threads. As of now I use the following workaround:
doit() {
let var=$1*12+$2
echo $var $2
}
export -f doit
for ((i=0;i<2;++i))
do
parallel -j12 doit ::: $i ::: {1..12}
done
This has the problem that every iteration of the loop waits for all 12 threads to finish.
I am only interested in not running iterations with the same thread id concurrently.
My motivation for this is that every thread uses a writelock on one of 12 files. I got exactly 12 files and if a thread on one file finishes, the next thread could immediately use this file again.
As #MarkSetchell writes you should use the replacement string {%} which gives the jobslot number:
parallel --line-buffer -j12 'echo starting job {#} on {%}; sleep {=$_=rand()*30=}; echo finishing job {#} on {%}' ::: {1..50}

How to get GNU Parallel report every file processed?

I would like to keep track of GNU parallel in a simple log file and would like it to emit the name of each as it starts / ends (either or both are equally fine). It seems verbose is too verbose for this.
If you make a profile that does the logging:
echo 'echo {} >> my.log;' > ~/.parallel/log
Then you can do this:
parallel -J log seq {} ::: 1 2 3
But since the profile uses {} you need to mention {} explicitly.
THIS DOES NOT WORK:
parallel -J log seq ::: 1 2 3
If you are not looking for --joblog then please explain how your needs differ.
--joblog is covered in 7.7 (p. 59) in GNU Parallel 2018 (paper copy: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html or download it at: https://doi.org/10.5281/zenodo.1146014).

Snakemake memory limiting

In Snakemake, I have 5 rules. For each I set the memory limit by resources mem_mb option.
It looks like this:
rule assembly:
input:
file1 = os.path.join(MAIN_DIR, "1.txt"), \
file2 = os.path.join(MAIN_DIR, "2.txt"), \
file3 = os.path.join(MAIN_DIR, "3.txt")
output:
foldr = dir, \
file4 = os.path.join(dir, "A.png"), \
file5 = os.path.join(dir, "A.tsv")
resources:
mem_mb=100000
shell:
" pythonscript.py -i {input.file1} -v {input.file2} -q {input.file3} --cores 5 -o {output.foldr} "
I want to limit the memory usage of the whole Snakefile by doing something like:
snakamake --snakefile mysnakefile_snakefile --resources mem_mb=100000
So not all jobs would use 100GB each ( if I have 5 rules, meaning as 500GB memory allocation), but all of their executions will be maximum 100GB ( 5 jobs, total of 100 GB allocation?)
The command line argument sets the total limit. The Snakemake scheduler will ensure that for the set of running jobs, the sum of the mem_mb resources will not exceed the total limit.
I think this is exactly what you want, isn't it? You just need to set the per-job expected memory in the rule itself. Note that Snakemake does not measure this for you. You have to define that value yourself in the rule. E.g., if you expect your job to use 100MB memory, put mem_mb=100 into that rule.

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

How can I stop gnu parallel jobs when any one of them terminates?

Suppose I am running N jobs with the following gnu parallel command:
seq $N | parallel -j 0 --progress ./job.sh
How can I invoke parallel to kill all running jobs and accept no more as soon as any one of them exits?
You can use --halt:
seq $N | parallel -j 0 --halt 2 './job.sh; exit 1'
A small problem with that solution is that you cannot tell if job.sh failed.
You may also use killall perl. It's not accurate way, but easy to remember

Resources