Executing bash script on multiple lines inside multiple files in parallel using GNU parallel - gnu-parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.

Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

Related

klocwork in parallel: how append kwtables into existing build or how kwadmin publish results from mutiple kwtable folders

Due to our projects is really huge and combined all targets into single compilation + analysis + publish will take much time to finished. So I'd like to running the klocwork analysis in parallel.
Here what I've got right now (split the targets in various sub tasks):
kw-analysis -> kw-analysis-sub-1
| -> kw-analysis-sub-2
| -> ...
| -> kw-analysis-sub-n
| |
| + the sub-task will handle:
| 1. compile single target and generate spec: kwinject_<target_name>.out :
| $ export KWWRAP_HOOKS_DIR='/temp/kw/hooks'
| $ export PATH=${KWWRAP_HOOKS_DIR}:$PATH
| $ make <target_name>
| $ kwinject --trace-in "/temp/kw/kwwrap.trace" --output "kwinject_<target_name>.out"
| 2. trace and anslysis for each target:
| $ kwbuildproject --url "<https://url:port>/<project_name>" [-I] --table-directory kwtable_<target_name> kwinject_<target_name>.out
| 3. archive kwtable_<target_name> folder
|
+ leading job will do:
1. copy all kwtable_<target_name> from sub-analysis jobs (downstream jobs)
2. deploy and publish the result into klocwork sever once for all
<<<<<< this is the key point of parallel analysis
As I know to publish single kwtable can be:
$ kwadmin --url <https://url:port> load --name <build_name> <project_name> kwtable_<target_name>
However, seems kwadmin neither support multiple kwtabels via:
kwadmin load --name <build_name> ... kwtable_<target_name_1> kwtable_<target_name_2> ...
nor support add additional result in exists build via
$ kwadmin load --name <build_name> ... kwtable_<target_name_1>
|
+ create build first
$ kwadmin "append" --name <build_name> ... kwtable_<target_name_2>
*
+ append new result in <build_name> for anothers kwtables folder
So, is there any way I can run klocwork analysis in parallel. Btw I'm using the Jenkins as integration tool
Running the parallel analysis by breaking the project into multiple pieces may take more time than building it as a single project with Klocwork sometimes. the reason behind this is, Klocwork is going to analyze all the dependent files multiple times as you perform multiple builds/analyses which are actually a part of a single project. (Parallel analysis can be a benefit when you do not have dependencies on different modules/files/pieces that you are building in parallel).
Klocwork can perform incremental/delta analysis when you pass --incremental argument as part kwbuildproject command. This should save the build time.

Delete lines of many files using grep and GNU parallel

I have a directory with many files that all end in "_all.txt". I want to delete all lines in each of these files containing either a "*" or a "-" and send them to files ending in "_all_cleaned.txt".
Right now I am using a for loop as follows:
for file in *_all.txt;
do
filename=$(echo $file | cut -d '_' -f 1)
grep -vwE "(*|-)" ${file}> "${filename}_all_cleaned.txt"
done
I would like to be able to do this in parallel using GNU parallel so that the command will be executed on each file on a different compute node instead of waiting for one node to do all in a row.
How can I incorporate
If the files are in the login dir on the servers (i.e. the dir you get by ssh server1 pwd):
parallel -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt
If it is the same dir relative to $HOME (e.g. /home/me/my/dir):
parallel --wd . -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt
If it is /different/dir:
parallel --wd /different/dir -Sserver1,server2 'grep -vwE "(*|-)" {} > {=s/.txt$/_cleaned.txt=}' ::: *.txt

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

How can I stop gnu parallel jobs when any one of them terminates?

Suppose I am running N jobs with the following gnu parallel command:
seq $N | parallel -j 0 --progress ./job.sh
How can I invoke parallel to kill all running jobs and accept no more as soon as any one of them exits?
You can use --halt:
seq $N | parallel -j 0 --halt 2 './job.sh; exit 1'
A small problem with that solution is that you cannot tell if job.sh failed.
You may also use killall perl. It's not accurate way, but easy to remember

bash gnu parallel argfile syntax

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?
Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.
After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Resources