How can I stop gnu parallel jobs when any one of them terminates? - gnu-parallel

Suppose I am running N jobs with the following gnu parallel command:
seq $N | parallel -j 0 --progress ./job.sh
How can I invoke parallel to kill all running jobs and accept no more as soon as any one of them exits?

You can use --halt:
seq $N | parallel -j 0 --halt 2 './job.sh; exit 1'
A small problem with that solution is that you cannot tell if job.sh failed.

You may also use killall perl. It's not accurate way, but easy to remember

Related

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.
Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of
#!/bin/bash
#SBATCH <options>
# Running the actual job in background
srun my_program input.in output.out &
# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
#update job status
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
if [ "$JobStatus" == "RUNNING" ]; then
if [ $FIRST -eq 0 ]; then
sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
FIRST=1
else
sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
fi
sleep $STIME
elif [ "$JobStatus" == "PENDING" ]; then
sleep $STIME
else
sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
JobStatus="COMPLETED"
break
fi
done
However, I'm not really convinced of this solution:
sstat unfortunately doesn't show how many cpus are used at the
moment (only average)
MaxRSS is also not helpful if I try to record memory usage over time
there still seems to be some error (script doesn't stop after job finishes)
Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.
Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.
You can activate it with
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
See the documentation here.
To check that this plugin is installed, run
scontrol show config | grep AcctGatherProfileType
It should output AcctGatherProfileType = acct_gather_profile/hdf5.
The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)
As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:
pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt
This will give you CPU and memory usage per process.
As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

Convert unix script to use gnu parallel

I have the following piece of code, which works as expected. It ensures that 2 processes are always spawned, and if any process fails, the script comes to a halt.
I have worked with GNU parallel earlier on simple one line scripts and they have worked really well.I'm sure the one below too can be made simpler.
The sleeper function in reality is MUCH more complex than one shown below.
The objective is that GNU parallel will call sleeper function in parallel and also do error handling
`sleeper(){
stat=$1
sleep 5
echo "Status is $1"
return $1
}
PROCS=2
errfile="errorfile"
rm "$errfile"
while read LINE && [ ! -f "$errfile" ]
do
while [ ! -f "$errfile" ]
do
NUM=$(jobs | wc -l)
if [ $NUM -lt $PROCS ]; then
(sleeper $LINE || echo "bad exit status" > "$errfile") &
break
else
sleep 2
fi
done
done<sleep_file
wait`
Thanks
What you are looking for is --halt (requires version 20150622):
sleeper(){
stat=$1
sleep 5
echo "Status is $1"
return $1
}
export -f sleeper
parallel -j2 --halt now,fail=1 -v sleeper ::: 0 0 0 1 0 1 0
If you do not want the sleeper to get killed (maybe you want it to finish so it cleans up), then use --halt soon,fail=1 to let the running jobs complete without starting new ones.

Trigger an action on timeout using gnu parallel

Is there a way to trigger an action (such as sending an email to an administrator) if a task spawned by gnu parallel times out?
Use --joblog. Exitval=-1 means timed out.
seq 100000 | parallel --joblog jl.log echo >> foo &
# Parse jl.log and do something with that
tail -n+1 -f jl.log | parallel --header : echo {Exitval}

bash gnu parallel argfile syntax

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?
Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.
After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Resources