I have the following piece of code, which works as expected. It ensures that 2 processes are always spawned, and if any process fails, the script comes to a halt.
I have worked with GNU parallel earlier on simple one line scripts and they have worked really well.I'm sure the one below too can be made simpler.
The sleeper function in reality is MUCH more complex than one shown below.
The objective is that GNU parallel will call sleeper function in parallel and also do error handling
`sleeper(){
stat=$1
sleep 5
echo "Status is $1"
return $1
}
PROCS=2
errfile="errorfile"
rm "$errfile"
while read LINE && [ ! -f "$errfile" ]
do
while [ ! -f "$errfile" ]
do
NUM=$(jobs | wc -l)
if [ $NUM -lt $PROCS ]; then
(sleeper $LINE || echo "bad exit status" > "$errfile") &
break
else
sleep 2
fi
done
done<sleep_file
wait`
Thanks
What you are looking for is --halt (requires version 20150622):
sleeper(){
stat=$1
sleep 5
echo "Status is $1"
return $1
}
export -f sleeper
parallel -j2 --halt now,fail=1 -v sleeper ::: 0 0 0 1 0 1 0
If you do not want the sleeper to get killed (maybe you want it to finish so it cleans up), then use --halt soon,fail=1 to let the running jobs complete without starting new ones.
Related
I made a bash script for Nagios to test with Nagiosgraph. Rrd files are however not being created for this script. Default plugins that come with Nagios work well with Nagiosgraph and rrd files of those plugins are also present.
Here is the script:
#!/bin/bash
checkgpu=$( nvidia-smi --format=csv --query-gpu=utilization.gpu | awk '/[[:digit:]]+[[:space:]]%/ { tot+=$1;cnt++ } END { print tot/cnt }' | cut -d$
output="Load Average: $checkgpu"
if [ $checkgpu -ge 0 ]
then
echo "OK- $output"
exit 0
elif [ $checkgpu -eq 101 ]
then
echo "WARNING- $output"
exit 1
elif [ $checkgpu -eq 102 ]
then
echo "CRITICAL- $output"
exit 2
else
echo "UNKNOWN- $output"
exit 3
fi
What should i do to make this script work with Nagiosgraph/Performance data ?
Have a look at the development guidelines: https://nagios-plugins.org/doc/guidelines.html#AEN200
The expected format for perfdata is 'label'=value[UOM];[warn];[crit];[min];[max] which can look something like this:
PING ok - Packet loss = 0%, RTA = 0.80 ms | percent_packet_loss=0, rta=0.80
The pipe (|) character tells Nagios that the plugin output has ended and performance data starts.
Note that the above example does not specify UOM (unit of measurement, like percent), nor does it specify any warn/crit thresholds for the data, or min/max values for the graphs. These are all optional.
I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of
#!/bin/bash
#SBATCH <options>
# Running the actual job in background
srun my_program input.in output.out &
# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
#update job status
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
if [ "$JobStatus" == "RUNNING" ]; then
if [ $FIRST -eq 0 ]; then
sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
FIRST=1
else
sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
fi
sleep $STIME
elif [ "$JobStatus" == "PENDING" ]; then
sleep $STIME
else
sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
JobStatus="COMPLETED"
break
fi
done
However, I'm not really convinced of this solution:
sstat unfortunately doesn't show how many cpus are used at the
moment (only average)
MaxRSS is also not helpful if I try to record memory usage over time
there still seems to be some error (script doesn't stop after job finishes)
Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.
Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.
You can activate it with
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
See the documentation here.
To check that this plugin is installed, run
scontrol show config | grep AcctGatherProfileType
It should output AcctGatherProfileType = acct_gather_profile/hdf5.
The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)
As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:
pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt
This will give you CPU and memory usage per process.
As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.
Suppose I am running N jobs with the following gnu parallel command:
seq $N | parallel -j 0 --progress ./job.sh
How can I invoke parallel to kill all running jobs and accept no more as soon as any one of them exits?
You can use --halt:
seq $N | parallel -j 0 --halt 2 './job.sh; exit 1'
A small problem with that solution is that you cannot tell if job.sh failed.
You may also use killall perl. It's not accurate way, but easy to remember
I've got a small script called "onewhich". Its purpose is to behave like which, except that it will only give the FIRST occurrence of any executables specified as options, as found in the order they'd appear in the path.
So for example, if my path is /opt/bin:/usr/bin:/bin, and I have both /opt/bin/runme and /usr/bin/runme, then the command onewhich runme would return /opt/bin/runme.
But if I also have a /usr/bin/doit, then the command onewhich doit runme would return /usr/bin/doit instead.
The idea is to walk through the path, check for each executable specified, and if it exists, show it and exit.
Here's the script so far.
#!/bin/sh
for what in "$#"; do
for loc in `echo "${PATH}" | awk -vRS=: 1`; do
if [ -f "${loc}/${what}" ]; then
echo "${loc}/${what}"
exit 0
fi
done
done
exit 1
The problem is, I want to be better about PATH directories with special characters. Every second shell question here on StackOverflow talks about how bad it is to parse paths with tools like awk and sed. There's even a bash faq entry about it. (Proviso: I'm not using bash for this, but the recommendation is still valid.)
So I tried rewriting the script to separate paths in a pipe, like this"
#!/bin/sh
for what in "$#"; do
echo "${PATH}" | awk -vRS=: 1 | while read loc ; do
if [ -f "${loc}/${what}" ]; then
echo "${loc}/${what}"
exit 0
fi
done
done
exit 1
I'm not sure if this gives me any real advantage (since $loc is still inside quotes), but it also doesn't work because for some reason, the exit 0 seems to be ignored. Or ... it exits something (the sub-shell with the while loop that terminates the pipe, maybe), but the script exits with a value of 1 every time.
What's a better way to step through directories in ${PATH} without the risk that special characters will confuse things?
Alternately, am I reinventing the wheel? Is there maybe a way to do this that's built in to existing shell tools?
This needs to run in both Linux and FreeBSD, which is why I'm writing it in Bourne instead of bash.
Thanks.
This doesn't directly answer your question, but does eliminate the need to parse PATH at all:
onewhich () {
for what in "$#"; do
which "$what" 2>/dev/null && break
done
}
This just calls which on each command on the input list until it finds a match.
To parse PATH, you can simply set `IFS=':'.
if [ "${IFS:-x}" = "${IFS-x}" ]; then
# Only preserve the value of IFS if it is currently set
OLDIFS=$IFS
fi
IFS=":"
for f in $PATH; do # Do not quote $PATH, to allow word splitting
echo $f
done
if [ "${OLDIFS:-x}" = "${OLDIFS-x}" ]; then
IFS=$OLDIFS
fi
The above will fail if any of the directories in PATH actually contain colons.
Your first method looks to me as if it should work. In practical terms, if it's really the $PATH you'll be searching, it's unlikely you'll have spaces and newlines embedded in directories there. If you do, it's probably time to refactor.
But still, I don't think you're at risk from the possibility of bad names clobbering your loop, since you're wrapping variables in quotes. At worst, I suspect you might miss the odd valid executable, but I can't see how the script would generate errors. (I don't see how the script would miss valid executables, and I haven't tested - I'm just saying I don't see problems at first glance.)
As for your second question, about the loop, I think you've hit the nail on the head. When you run a pipe like this | that | while condition; do things; done, the while loop runs in its own shell at the end of the pipe. Exiting that shell may terminate the actions of the pipe, but that only brings you back to the parent shell, which has its own thread of execution that terminates with exit 1.
As for a better way to do this, I would consider which.
#!/bin/sh
for what in "$#"; do
which "$what"
done | head -1
And if you really want the exit values as well:
#!/bin/sh
for what in "$#"; do
which "$what" && exit 0
done
exit 1
The second might even be fewer resources, as it doesn't have to open a file handle and pipe through head.
You can also split your path using IFS. For example, if you wanted to wrap your loops the other way around, you could do this:
#!/bin/sh
IFS=":"
for loc in $PATH; do
for what in "$#"; do
if [ -x "$loc"/"$what" ]; then
echo "$loc"/"$what"
exit 0
fi
done
done
exit 1
Note that under normal circumstances, you might want to save the old value of $IFS, but you seem to be doing things in a stand-alone script, so the "new" value gets thrown out when the script exits.
All the above code is untested. YMMV.
Another way to get around the need to parse PATH at all is to run the builtin type command in new shell with a stripped environment (i. e. there simply are no functions or aliases to look up; cf. env -i sh -c 'type cmd 2>/dev/null).
# using `cmd` instead of $(cmd) for portability
onewhich() {
ec=0 # exit code
for cmd in "$#"; do
command -p env -i PATH="$PATH" sh -c '
export LC_ALL=C LANG=C
cmd="$1"
path="`type "$cmd" 2>/dev/null`"
if [ X"$path" = "X" ]; then
printf "%s\n" "error: command \"${cmd}\" not found in PATH" 1>&2
exit 1
else
case "$path" in
*\ /*)
path="/${path#*/}"
printf "%s\n" "$path";;
*)
printf "%s\n" "error: no disk file: $path" 1>&2
exit 1;;
esac
exit 0
fi
' _ "$cmd"
[ $? != 0 ] && ec=1
done
[ $ec != 0 ] && return 1
}
onewhich awk ls sed
onewhich builtin
onewhich if
Since which on success returns two full command paths if two commands are specified as arguments, exit 0 in the first onewhich script above aborts the program prematurely. In addition, if two commands are specified as arguments to which, the exit code of which is set to 1 even if only one command lookup failed (cf. which awk sedxyz ls; echo $?). To mimic this behaviour of the which command it is necessary to toggle on/off two variables (cnt and nomatches below).
onewhich() (
IFS=":"
nomatches=0
for cmd in "$#"; do
cnt=0
for loc in $PATH ; do
if [ $cnt = 0 ] && [ -x "$loc"/"$cmd" ]; then
echo "$loc"/"$cmd"
cnt=1
fi
done
[ $cnt = 0 ] && nomatches=1
done
[ $nomatches = 1 ] && exit 1 || exit 0 # exit 1: at least one cmd was not in PATH
)
onewhich awk ls sed
onewhich awk lsxyz sed
onewhich builtin
onewhich if
I'm piping some output of a command to egrep, which I'm using to make sure a particular failure string doesn't appear in.
The command itself, unfortunately, won't return a proper non-zero exit status on failure, that's why I'm doing this.
command | egrep -i -v "badpattern"
This works as far as giving me the exit code I want (1 if badpattern appears in the output, 0 otherwise), BUT, it'll only output lines that don't match the pattern (as the -v switch was designed to do). For my needs, those lines are the most interesting lines.
Is there a way to have grep just blindly pass through all lines it gets as input, and just give me the exit code as appropriate?
If not, I was thinking I could just use perl -ne "print; exit 1 if /badpattern/". I use -n rather than -p because -p won't print the offending line (since it prints after running the one-liner). So, I use -n and call print myself, which at least gives me the first offending line, but then output (and execution) stops there, so I'd have to do something like
perl -e '$code = 0; while (<>) { print; $code = 1 if /badpattern/; } exit $code'
which does the whole deal, but is a bit much, is there a simple command line switch for grep that will just do what I'm looking for?
Actually, your perl idea is not bad. Try:
perl -pe 'END { exit $status } $status=1 if /badpattern/;'
I bet this is at least as fast as the other options being suggested.
$ tee /dev/tty < ~/.bashrc | grep -q spam && echo spam || echo no spam
How about doing a redirect to /dev/null, hence removing all lines, but you still get the exit code?
$ grep spam .bashrc > /dev/null
$ echo $?
1
$ grep alias .bashrc > /dev/null
$ echo $?
0
Or you can simply use the -q switch
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit
immediately with zero status if any match is found, even if an
error was detected. Also see the -s or --no-messages option.
(-q is specified by POSIX.)