How to get GNU Parallel report every file processed? - gnu-parallel

I would like to keep track of GNU parallel in a simple log file and would like it to emit the name of each as it starts / ends (either or both are equally fine). It seems verbose is too verbose for this.

If you make a profile that does the logging:
echo 'echo {} >> my.log;' > ~/.parallel/log
Then you can do this:
parallel -J log seq {} ::: 1 2 3
But since the profile uses {} you need to mention {} explicitly.
THIS DOES NOT WORK:
parallel -J log seq ::: 1 2 3

If you are not looking for --joblog then please explain how your needs differ.
--joblog is covered in 7.7 (p. 59) in GNU Parallel 2018 (paper copy: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html or download it at: https://doi.org/10.5281/zenodo.1146014).

Related

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.
Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

Replacement string not working in GNU parallel

I have the script run_md.py which produces the file test.dcd from the input file named test.pdb.
I want to execute the same command on multiple input files (test*.pdb) on a remote server using GNU parallel and transfer the result back to the local computer. Therefore, I'm using the following command:
parallel --trc {.}.dcd -j 2 -S $SERVER1 './run_md.py {} 1000' ::: test*.pdb
The command is running as expected on the server using 2 slots. However, the files are not transferred back and I get the following error:
rsync: link_stat "/home/bougui/{.}.dcd" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
It looks like the replacement string is not working. How can I make it works?
Below is the output of parallel --version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
What you are doing is 100% correct. So something on your system is breaking this. Please try this on another system and if possible follow REPORTING BUGS from man parallel.
The bug reported in that thread has been fixed and this feature works well with the latest version of GNU parallel (20160622). The GNU parallel version 20130922 packaged with Debian 8.5 is buggy for the usage of {.} string replacement, as described below:
With more test I found that the output file must be specified with a replacement string in the command run in parallel.
For testing purpose, you can find below a complete example that others can run:
echo This is input_file > input_file && parallel --trc {}.out -S $SERVER1 cat {} ">"{}.out ::: input_file
The example above works well. When I use the substitution string {.} as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"{.}.out ::: input_file
It works, as well. However, if I didn't specify {.}.out in the command run in parallel as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"input_file.out ::: input_file
... I reproduce the error:
rsync: link_stat "/home/bouvier/{.}.out" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
rsync: [Receiver] write error: Broken pipe (32)
Therefore the output file must be specified in the command run in parallel.

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

How can I stop gnu parallel jobs when any one of them terminates?

Suppose I am running N jobs with the following gnu parallel command:
seq $N | parallel -j 0 --progress ./job.sh
How can I invoke parallel to kill all running jobs and accept no more as soon as any one of them exits?
You can use --halt:
seq $N | parallel -j 0 --halt 2 './job.sh; exit 1'
A small problem with that solution is that you cannot tell if job.sh failed.
You may also use killall perl. It's not accurate way, but easy to remember

bash gnu parallel argfile syntax

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?
Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.
After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Resources