check_cpu + nsclient : set critical threshold only on 5min period - monitor

I am using centreon (nagios) to monitor the CPUs of some VMs using NSClient. In my case it makes only sense to set the critical state of the cpu probe if the average cpu load is > 95 over the 5m period. Is this achievable ?
I cannot find documentation on how to specify that in the critical param
Default command
check_cpu
Returns
CPU Load ok
'total 5m load'=0%;80;90 'total 1m load'=0%;80;90 'total 5s load'=7%;80;90
Command with specific threshold (but all time period can match)
check_cpu "critical=load > 90"

It is not exactly what I wanted to do but what I did is the following
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "crit=load > 95" "warn=load > 90" time=5m
Which limits the output to the 5m time period.
Note that to execute this from centreon you have to set the following variables inside the nsclient.ini file (waisted a lot of time on that one)
[/settings/NRPE/server]
allow nasty characters=true
[/settings/external scripts]
allow nasty characters=true

Check this script,
define service{
use generic-service
host_name xxx
service_description CPU Load
check_command check_nrpe!check_load
contact_groups sysadmin
}
---
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20

You can try something like that
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "warning=time = '5m' and load > 80" "critical=time = '5m' and load > 90" show-all
You can also check the documentation for more info.

Related

Vertica's vsql.exe returns errorlevel 0 when facing ERROR 3326: Execution time exceeded run time cap

I am using vsql.exe on an external Vertica database for which I don't have any administrative access. I use some views with simple SELECT+FROM+WHERE queries.
These queries 90% of the time work just fine, but some times, randomly, I get this error:
ERROR 3326:  Execution time exceeded run time cap of 00:00:45
The strange thing is that this error can happen way after those 45 seconds, even after 3 minutes. I've been told this is related to having different resource pools, but anyway I don't want to dig into that.
The problem is that when this occurs, vsql.exe returns errorlevel 0 and there is (apparently almost) no way to know this failed.
The output of the query is stored in a csv file. When it succeeds, it ends with (#### rows). But when it fails with this error, it just stops at any point of the csv, and its resulting size is around half of what's expected. This is of course not what you would expect when an error occurs, like no output or an empty one.
If there is a connection error or if the query has syntax errors, the errorlevel is not 0, so in those cases it behaves as expected.
I've tried many things, like increasing the timeout or adding -v ON_ERROR_STOP=ON to the vsql.exe parameters, but none of that helped.
I've googled a lot and found many people having this error, but the solutions are mostly related to increasing the timeouts, not related to the errorlevel returned.
Any help will be greatly appreciated.
TL;DR: how can I detect an error 3326 in a batch file like this?
#echo off
vsql.exe -h <hostname> -U <user> -w <pwd> -o output.cs -Ac "SELECT ....;"
echo %errorlevel% is always 0
if errorlevel 1 echo Error!! But this is never displayed.
Now that's really unexpected to me. I don't have Windows available just now, but trying on my Mac - at first just triggering a deliberate error:
$ vsql -h zbook -d sbx -U dbadmin -w $VSQL_PASSWORD -v ON_ERROR_STOP=ON -Ac "select * from foobarfoo"
ERROR 4566: Relation "foobarfoo" does not exist
$ echo $?
1
With ON_ERROR_STOP set to ON, this should be the behaviour everywhere.
Could you try what I did above through Windows, just with echo %ERRORLEVEL% instead of echo $?, just from the Windows command prompt and not in a batch file?
Next test: I run on resource pool general in my little test database, so I temporarily modify it to a runtime cap of 30 sec, run a silly query that will take over 30 seconds with ON_ERROR_STOP set to ON, collect the value returned by vsql and set the runtime cap of general back to NONE. I also have the %VSQL_* % env variables set so I don't have to repeat them all the time:
rem Windows way to set environment variables for vsql:
set VSQL_HOST=zbook
set VSQL_DATABASE=sbx
set VSQL_USER=dbadmin
set VSQL_PASSWORD=***masked***
Now for the test (backslashes, in Linux/MacOs escape a new line, which enables you to "word wrap" a shell command. Use the caret (^) in Windows for that):
marco ~/1/Vertica/supp $ # set a runtime cap
marco ~/1/Vertica/supp $ vsql -i -c \
"alter resource pool general runtimecap '00:00:30'"
ALTER RESOURCE POOL
Time: First fetch (0 rows): 116.326 ms. All rows formatted: 116.730 ms
marco ~/1/Vertica/supp $ vsql -v ON_ERROR_STOP=ON -iAc \
"select count(*) from one_million_rows a cross join one_million_rows b"
ERROR 3326: Execution time exceeded run time cap of 00:00:30
marco ~/1/Vertica/supp $ # test the return code
marco ~/1/Vertica/supp $ echo $?
1
marco ~/1/Vertica/supp $ # clear the runtime cap
marco ~/1/Vertica/supp $ vsql -i -c \
"alter resource pool general runtimecap NONE "
ALTER RESOURCE POOL
Time: First fetch (0 rows): 11.148 ms. All rows formatted: 11.383 ms
So it works in my case. Your line:
if errorlevel 1 echo Error!! But this is never displayed.
... never echoes anything because the previous line, with echo will return 0 to the shell, overriding the previous errorlevel.
Try it command by command on your Windows command prompt, and see what happens. Just echo %errorlevel%, without evaluating it.
And I notice that you are trying to export to CSV format. Then, try this:
Format the output unaligned (-A)
set the field separator to comma (-F ',')
remove the footer '(n rows)' (-P footer)
limit the output to 5 rows in the query for test
(I show the output before redirecting to file):
marco ~/1/Vertica/supp $ vsql -A -F ',' -P footer -c "select * from one_million_rows limit 5"
id,id_desc,dob,category,busid,revenue
0,0,1950-01-01,1,====== boss ========,0.000
1,-1,1950-01-02,2,kbv-000001kbv-000001,0.010
2,-2,1950-01-03,3,kbv-000002kbv-000002,0.020
3,-3,1950-01-04,4,kbv-000003kbv-000003,0.030
4,-4,1950-01-05,5,kbv-000004kbv-000004,0.040
Not aligning is much faster than aligning.
Then, as you spend most time in the fetching of the rows (that's because you get a timeout in the middle of an output file write process), try fetching more rows at a time than the default 1000. You will need to play with the value, depending on the network settings at your site until you get your best value:
-v ROWS_AT_A_TIME=10000
Once you're happy with the tested output, try this command (change the SELECT for your needs, of course ....):
marco ~/1/Vertica/supp $ vsql -A -F ',' -P footer \
-v ON_ERROR_STOP=ON -v ROWS_AT_A_TIME=10000 -o one_million_rows.csv \
-c "select * from one_million_rows"
marco ~/1/Vertica/supp $ wc -l one_million_rows.csv
1000001 one_million_rows.csv
The table actually contains one million rows. Note the line count in the file: 1,000,001. That's the title line included, but the footer (1000000 rows) removed.

Snakemake memory limiting

In Snakemake, I have 5 rules. For each I set the memory limit by resources mem_mb option.
It looks like this:
rule assembly:
input:
file1 = os.path.join(MAIN_DIR, "1.txt"), \
file2 = os.path.join(MAIN_DIR, "2.txt"), \
file3 = os.path.join(MAIN_DIR, "3.txt")
output:
foldr = dir, \
file4 = os.path.join(dir, "A.png"), \
file5 = os.path.join(dir, "A.tsv")
resources:
mem_mb=100000
shell:
" pythonscript.py -i {input.file1} -v {input.file2} -q {input.file3} --cores 5 -o {output.foldr} "
I want to limit the memory usage of the whole Snakefile by doing something like:
snakamake --snakefile mysnakefile_snakefile --resources mem_mb=100000
So not all jobs would use 100GB each ( if I have 5 rules, meaning as 500GB memory allocation), but all of their executions will be maximum 100GB ( 5 jobs, total of 100 GB allocation?)
The command line argument sets the total limit. The Snakemake scheduler will ensure that for the set of running jobs, the sum of the mem_mb resources will not exceed the total limit.
I think this is exactly what you want, isn't it? You just need to set the per-job expected memory in the rule itself. Note that Snakemake does not measure this for you. You have to define that value yourself in the rule. E.g., if you expect your job to use 100MB memory, put mem_mb=100 into that rule.

Generating data for MANET nodes location in a specific format using ns2 setdest

I am following this link and trying to implement the scenarios there.
So I need to generate a data for MANET nodes representing their location in this format:
Current time - latest x – latest y – latest update time – previous x –previous y – previous update time
with the use of setdest tool with these options:
1500 by 300 grid, ran for 300 seconds and used pause times of 20s and maximum velocities of 2.5 m/s.
so I come up with this command
./setdest -v 2 -n 10 -s 2.5 -m 10 -M 50 -t 300 -p 20 -x 1500 -y 300 > test1.tcl
which worked and generated a tcl file, but I don't know how can I obtain the data in the required format.
setdest -v 2 -n 10 -s 2.5 -m 10 -M 50 -t 300 -p 20 -x 1500 -y 300 > test1.tcl
Not a tcl file : Is a "scen" / scenario file with 1,700 "ns" commands. Your file was renamed to "test1.scen", and is now used in the manet examples, in the simulation example aodv-manet-20.tcl :
set val(cp) "test1.scen" ;#Connection Pattern
Please be aware that time settings are maximum time. "Long time settings" were useful ~20 years ago when computers were slow. (Though there are complex simulations lasting half an hour to one hour.)
Link, manet-examples-1.tar.gz https://drive.google.com/file/d/0B7S255p3kFXNR05CclpEdVdvQm8/view?usp=sharing
Edit: New example added → manet0-16-nam.tcl → → https://drive.google.com/file/d/0B7S255p3kFXNR0ZuQ1l6YnlWRGc/view?usp=sharing

How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of
#!/bin/bash
#SBATCH <options>
# Running the actual job in background
srun my_program input.in output.out &
# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
#update job status
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
if [ "$JobStatus" == "RUNNING" ]; then
if [ $FIRST -eq 0 ]; then
sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
FIRST=1
else
sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
fi
sleep $STIME
elif [ "$JobStatus" == "PENDING" ]; then
sleep $STIME
else
sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
JobStatus="COMPLETED"
break
fi
done
However, I'm not really convinced of this solution:
sstat unfortunately doesn't show how many cpus are used at the
moment (only average)
MaxRSS is also not helpful if I try to record memory usage over time
there still seems to be some error (script doesn't stop after job finishes)
Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.
Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.
You can activate it with
#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>
See the documentation here.
To check that this plugin is installed, run
scontrol show config | grep AcctGatherProfileType
It should output AcctGatherProfileType = acct_gather_profile/hdf5.
The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)
As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:
pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt
This will give you CPU and memory usage per process.
As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

Sending a nagios alert when graphite does not get data

I am collecting some metrics using graphite, but sometimes there is no data coming into it (probably because the server has gone down, or no network connectivity). I want nagios to send me an alert during such an event. How do i do that?
You could use the check_file_age script from nagios-plugins to check a single known datapoint of interest per system that you are collecting data from.
check_file_age -w 600 -c 1800 /opt/graphite/storage/whisper/servers/$(uname -f)/cpu/idl.wsp
That would alert you if a certain metric was missing within 5 minutes.
Else
You could run a find command over all the points, and report any that have not been updated in n hours.
#!/bin/bash
OLD_GRAPHS=$(find /opt/graphite/storage/whisper -mmin +120 -type f | wc -l)
if [[ OLD_GRAPHS -gt 0 ]];then
echo "Found ${OLD_GRAPHS} graph(s) without an update in 120 minutes"
exit 1
fi
echo "All graphs are up to date"
exit 0

Resources