docker-compose healthcheck retry frequency != interval - docker

I recently set up healthchecks in my docker-compose config.
It is doing great and I like it. Here's a typical example:
services:
app:
healthcheck:
test: curl -sS http://127.0.0.1:4000 || exit 1
interval: 5s
timeout: 3s
retries: 3
start_period: 30s
My container is quite slow to boot, hence I set up a 30 seconds start_period.
But it doesn't really fit my expectation: I don't need check every 5 seconds, but I need to know when the container is ready for the first time as soon as possible for my orchestration, and since my start_period is approximative, if it is not ready yet at first check, I have to wait for interval before retry.
What I'd like to have is:
While container is not healthy, retry every 5 seconds
Once it is healthy, check every 1 minute
Ain't there a way to achieve this out-of-the-box with docker-compose?
I could write a custom script to achieve this, but I'd rather have a native solution if it is possible.

Unfortunately, this is not possible out of the box.
All the duration set are final. They can't be changed depending on the container state.
However, according to the documentation, the probe does not seem to wait for the start_period to finish before checking your test. The only thing it does is that any failure hapenning during start_period will not be considered as an error.
Below is the sentence that make me think that :
start_period provides initialization time for containers that need time to bootstrap. Probe failure during that period will not be counted towards the maximum number of retries. However, if a health check succeeds during the start period, the container is considered started and all consecutive failures will be counted towards the maximum number of retries.
I encourage you to test if this is really the case as I've never really paid any attention if the healthcheck is tested during the start period or not.
And if it is the case, you can probably increase your start_period if you're unsure about the duration and also increase the interval in order to find a good compromise.

I wrote a script that does this, though I'd rather find a native solution:
#!/bin/sh
HEALTHCHECK_FILE="/root/.healthchecked"
COMMAND=${*?"Usage: healthcheck_retry <COMMAND>"}
if [ -r "$HEALTHCHECK_FILE" ]; then
LAST_HEALTHCHECK=$(date -r "$HEALTHCHECK_FILE" +%s)
# FIVE_MINUTES_AGO=$(date -d 'now - 5 minutes' +%s)
FIVE_MINUTES_AGO=$(echo "$(( $(date +%s)-5*60 ))")
echo "Healthcheck file present";
# if (( $LAST_HEALTHCHECK > $FIVE_MINUTES_AGO )); then
if [ $LAST_HEALTHCHECK -gt $FIVE_MINUTES_AGO ]; then
echo "Healthcheck too recent";
exit 0;
fi
fi
if $COMMAND ; then
echo "\"$COMMAND\" succeed: updating file";
touch $HEALTHCHECK_FILE;
exit 0;
else
echo "\"$COMMAND\" failed: exiting";
exit 1;
fi
Which I use: test: /healthcheck_retry.sh curl -fsS localhost:4000/healthcheck
The pain is that I need to make sure the script is available in every container, so I have to create an extra volume for this:
image: postgres:11.6-alpine
volumes:
- ./scripts/utils/healthcheck_retry.sh:/healthcheck_retry.sh

Related

In Jenkins job, behave tests stops after any failure

I have created a jenkins "freestyle" job, in which I am trying to run multiple BDD testing process. Following is the "commands" I have put in "Jenins/Build/execute shell" section:
cd ~/FEXT_BETA_BDD
rm -rf allure_reports allure-reports allure-results
pip install behave
pip install selenium
pip install -r features/requirements.txt
# execute features in plan section
behave -f allure_behave.formatter:AllureFormatter -f pretty -o ./allure-reports
./features/plan/*.feature
# execute features in blueprint section
behave -f allure_behave.formatter:AllureFormatter -f pretty -o ./allure-reports
./features/blueprint/*.feature
What I have found is in Jenkins, if there is any test case intermittent failure, such message is shown in the Console Output:
"
...
0 features passed, 1 failed, 0 skipped
0 scenarios passed, 1 failed, 0 skipped
3 steps passed, 1 failed, 1 skipped, 0 undefined
Took 2m48.770s
Build step 'Execute shell' marked build as failure
"
And the leftover test cases are skipped. But if I was to run the behave command on my local host directly, I don't get this type of behaviour. The failure will be detected and the remaining test cases continues till all are finished.
So How may I work around this issue in Jenkins ?
Thanks,
Jack
You may try the following syntax:
set +e
# execute features in plan section
behave -f allure_behave.formatter:AllureFormatter -f pretty -o ./allure-reports
./features/plan/*.feature || echo 'ALERT: Build failed while running the plan section'
# execute features in blueprint section
behave -f allure_behave.formatter:AllureFormatter -f pretty -o ./allure-reports
./features/blueprint/*.feature || echo 'ALERT: Build failed while running the blueprint section'
# Restoring original configuration
set -e
Note:
Goal of set -e is to cause the shell to abort any time an error occurs. If you will see your log output, you will notice sh -xe at the start of execution which confirms that Execute Shell in Jenkins uses -e option. So, to disable it, you can use +e instead. However, it's good to restore it once your purpose is fulfilled so that subsequent commands produce expected result.
Ref: https://superuser.com/questions/1113014/what-would-set-e-and-set-x-commands-do-in-the-context-of-a-shell-script
The ConsoleOutput from the SummaryReporter above indicates that you have only one feature with one scenario (that fails). Behave has no such thing that it stops when the first scenario fails.
An early abortion of the test run can only occur if critical things happen:
A failure/exception in the before_all() hook occurs
A critical exception is raised (SystemExit, KeyboardInterrupt) to end the test run
Your implementation tells behave to abort the test run (make sense on critical failures when all other tests will also fail; why waste the time)
BUT: If the test run is aborted early, all the features/scenarios that are not executed yet are reported as untested counts in the SummaryReporter.
...
0 features passed, 1 failed, 0 skipped, 2 untested
0 scenarios passed, 1 failed, 0 skipped, 3 untested
0 steps passed, 1 failed, 0 skipped, 0 undefined, 6 untested
HINT: Untested counts are normally hidden. They are only shown if the counter is not zero (greater than zero).
This is not the case in your description.
SEE ALSO:
behave: features/runner.abort_by_user.feature

check_cpu + nsclient : set critical threshold only on 5min period

I am using centreon (nagios) to monitor the CPUs of some VMs using NSClient. In my case it makes only sense to set the critical state of the cpu probe if the average cpu load is > 95 over the 5m period. Is this achievable ?
I cannot find documentation on how to specify that in the critical param
Default command
check_cpu
Returns
CPU Load ok
'total 5m load'=0%;80;90 'total 1m load'=0%;80;90 'total 5s load'=7%;80;90
Command with specific threshold (but all time period can match)
check_cpu "critical=load > 90"
It is not exactly what I wanted to do but what I did is the following
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "crit=load > 95" "warn=load > 90" time=5m
Which limits the output to the 5m time period.
Note that to execute this from centreon you have to set the following variables inside the nsclient.ini file (waisted a lot of time on that one)
[/settings/NRPE/server]
allow nasty characters=true
[/settings/external scripts]
allow nasty characters=true
Check this script,
define service{
use generic-service
host_name xxx
service_description CPU Load
check_command check_nrpe!check_load
contact_groups sysadmin
}
---
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
You can try something like that
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "warning=time = '5m' and load > 80" "critical=time = '5m' and load > 90" show-all
You can also check the documentation for more info.

How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of
#!/bin/bash
#SBATCH <options>
# Running the actual job in background
srun my_program input.in output.out &
# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
#update job status
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
if [ "$JobStatus" == "RUNNING" ]; then
if [ $FIRST -eq 0 ]; then
sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
FIRST=1
else
sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
fi
sleep $STIME
elif [ "$JobStatus" == "PENDING" ]; then
sleep $STIME
else
sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
JobStatus="COMPLETED"
break
fi
done
However, I'm not really convinced of this solution:
sstat unfortunately doesn't show how many cpus are used at the
moment (only average)
MaxRSS is also not helpful if I try to record memory usage over time
there still seems to be some error (script doesn't stop after job finishes)
Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.
Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.
You can activate it with
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
See the documentation here.
To check that this plugin is installed, run
scontrol show config | grep AcctGatherProfileType
It should output AcctGatherProfileType = acct_gather_profile/hdf5.
The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)
As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:
pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt
This will give you CPU and memory usage per process.
As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

xvfb-run unreliable when multiple instances invoked in parallel

Can you help me, why I get sometimes (50:50):
webkit_server.NoX11Error: Cannot connect to X. You can try running with xvfb-run.
When I start the script in parallel as:
xvfb-run -a python script.py
You can reproduce this yourself like so:
for ((i=0; i<10; i++)); do
xvfb-run -a xterm &
done
Of the 10 instances of xterm this starts, 9 of them will typically fail, exiting with the message Xvfb failed to start.
Looking at xvfb-run 1.0, it operates as follows:
# Find a free server number by looking at .X*-lock files in /tmp.
find_free_servernum() {
# Sadly, the "local" keyword is not POSIX. Leave the next line commented in
# the hope Debian Policy eventually changes to allow it in /bin/sh scripts
# anyway.
#local i
i=$SERVERNUM
while [ -f /tmp/.X$i-lock ]; do
i=$(($i + 1))
done
echo $i
}
This is very bad practice: If two copies of find_free_servernum run at the same time, neither will be aware of the other, so they both can decide that the same number is available, even though only one of them will be able to use it.
So, to fix this, let's write our own code to find a free display number, instead of assuming that xvfb-run -a will work reliably:
#!/bin/bash
# allow settings to be updated via environment
: "${xvfb_lockdir:=$HOME/.xvfb-locks}"
: "${xvfb_display_min:=99}"
: "${xvfb_display_max:=599}"
# assuming only one user will use this, let's put the locks in our own home directory
# avoids vulnerability to symlink attacks.
mkdir -p -- "$xvfb_lockdir" || exit
i=$xvfb_display_min # minimum display number
while (( i < xvfb_display_max )); do
if [ -f "/tmp/.X$i-lock" ]; then # still avoid an obvious open display
(( ++i )); continue
fi
exec 5>"$xvfb_lockdir/$i" || continue # open a lockfile
if flock -x -n 5; then # try to lock it
exec xvfb-run --server-num="$i" "$#" || exit # if locked, run xvfb-run
fi
(( i++ ))
done
If you save this script as xvfb-run-safe, you can then invoke:
xvfb-run-safe python script.py
...and not worry about race conditions so long as no other users on your system are also running xvfb.
This can be tested like so:
for ((i=0; i<10; i++)); do xvfb-wrap-safe xchat & done
...in which case all 10 instances correctly start up and run in the background, as opposed to:
for ((i=0; i<10; i++)); do xvfb-run -a xchat & done
...where, depending on your system's timing, nine out of ten will (typically) fail.
This questions was asked in 2015.
In my version of xvfb (2:1.20.13-1ubuntu1~20.04.2), this problem has been fixed.
It looks at /tmp/.X*-lock to find an available port, and then runs Xvfb. If Xvfb fails to start, it finds a new port and retries, up to 10 times.

Supervisord keeps repeating a process with an exit of zero.

I am working on having supervisord send a curl request upon start. however, despite my best efforts, I set auto restart to false, it keeps running the script.
[program:slack-client]
priority=99
autorestart=false
command=bash -c 'SOME BASH COMMAND'
The error that keeps repeating is
INFO exited: slack-client (exit status 0; not expected)
I faced with similar behavior of rsyslog. This workaround helped me (but I'm not sure if that's exactly your case):
[program:rsyslog]
command = rsyslogd -n -c3
startsecs = 5
stopwaitsecs = 5

Resources