bash gnu parallel argfile syntax - gnu-parallel

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?

Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.

After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Related

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.
Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

Curl a URL list from a file and make it faster with parallel

Right now i'm using the followin code:
while read num;
do M=$(curl "myurl/$num")
echo "$M"
done < s.txt
where s.txt contains a list (1 per line) of a part of the url.
Is it correct to assume that curl is running sequentially?
Or is it running in thread/jobs/multiple conn at a time?
I've found this online:
parallel -k curl -s "http://example.com/locations/city?limit=100\&offset={}" ::: $(seq 100 100 30000) > out.txt
The problem is that my sequence is coming from a file or from a variable (one element per line) and i can't adapt it to my needs
I've not fully understood how to pass the list to parallel
Should i save all the curl commands in the list and run it with parallel -a ?
Regards,
parallel -j100 -k curl myurl/{} < s.txt
Consider spending an hour walking through man parallel_tutorial. Your command line will love you for it.

Replacement string not working in GNU parallel

I have the script run_md.py which produces the file test.dcd from the input file named test.pdb.
I want to execute the same command on multiple input files (test*.pdb) on a remote server using GNU parallel and transfer the result back to the local computer. Therefore, I'm using the following command:
parallel --trc {.}.dcd -j 2 -S $SERVER1 './run_md.py {} 1000' ::: test*.pdb
The command is running as expected on the server using 2 slots. However, the files are not transferred back and I get the following error:
rsync: link_stat "/home/bougui/{.}.dcd" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
It looks like the replacement string is not working. How can I make it works?
Below is the output of parallel --version:
GNU parallel 20130922
Copyright (C) 2007,2008,2009,2010,2011,2012,2013 Ole Tange and Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
GNU parallel comes with no warranty.
Web site: http://www.gnu.org/software/parallel
When using GNU Parallel for a publication please cite:
O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.
What you are doing is 100% correct. So something on your system is breaking this. Please try this on another system and if possible follow REPORTING BUGS from man parallel.
The bug reported in that thread has been fixed and this feature works well with the latest version of GNU parallel (20160622). The GNU parallel version 20130922 packaged with Debian 8.5 is buggy for the usage of {.} string replacement, as described below:
With more test I found that the output file must be specified with a replacement string in the command run in parallel.
For testing purpose, you can find below a complete example that others can run:
echo This is input_file > input_file && parallel --trc {}.out -S $SERVER1 cat {} ">"{}.out ::: input_file
The example above works well. When I use the substitution string {.} as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"{.}.out ::: input_file
It works, as well. However, if I didn't specify {.}.out in the command run in parallel as below:
echo This is input_file > input_file.in && parallel --trc {.}.out -S $SERVER1 cat {} ">"input_file.out ::: input_file
... I reproduce the error:
rsync: link_stat "/home/bouvier/{.}.out" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [Receiver=3.1.1]
rsync: [Receiver] write error: Broken pipe (32)
Therefore the output file must be specified in the command run in parallel.

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

GNU Parallel: suppress warning when input is read from terminal

When input is read from terminal, GNU Parallel always displays a warning:
parallel: Warning: Input is read from the terminal. Only experts do this on purpose. Press CTRL-D to exit.
But sometimes I do want to read from terminal (e.g., when I'm copy & pasting stuff from elsewhere entry by entry). Is it possible to turn off this warning? I couldn't find such an option in man parallel or man parallel_tutorial.
Note that I don't want a cheap solution like 2>/dev/null, since warning messages from other programs will be turned off, too. For instance, consider the following simple script:
#!/bin/bash
function print12 () {
echo "printing $1 to stdout"
echo "printing $1 to stderr" >/dev/stderr
}
export -f print12
SHELL=/bin/bash parallel -k print12 2>/dev/null
Messages printed to stderr will all be suppressed.
Just realized that I can do a cat or some read </dev/tty to achieve my desired effect. But let's just focus on the original question.
It cannot be turned off. But see it as a praise: Since you are doing it on purpose, you are an expert (at least in the eyes of GNU Parallel).
As it is just a warning, you are free to paste your arguments and have them run: The warning does not stop GNU Parallel from reading your input.
If you really do not like the warning:
cat | parallel ...

Resources