I am new to nagios.
I am trying to configure the "check_disk" service for one host but I am not getting the expected results.
I should get the emails when when disk usage goes beyond 80%.
So, There is already service defined for this task with multiple hosts, as below:
define service{
use local-service ; Name of service template to use
host_name localhost, host1, host2, host3, host4, host5, host6
service_description Root Partition
check_command check_local_disk!20%!10%!/
contact_groups unix-admins,db-admins
}
The Issue:
Further I tried to test single host i.e. "host2". The current usage of host2 is as follow:
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-rootvol 94G 45G 45G 50% /
So to get instant emails, I written another service as below, where warning set to <60% and critical set to <40%.
define service{
use local-service
host_name host2
service_description Root Partition again
check_command check_local_disk!60%!40%!/
contact_groups dev-admins
}
But still I am not receive any emails for the same.
Where it going wrong.
The "check_local_disk" command is defined as below:
define command{
command_name check_local_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}
Your command definition currently is setup to only check your Nagios server's disk, not the remote hosts (such as host2). You need to define a new command definition to execute check_disk on the remote host via NRPE (Nagios Remote Plugin Execution).
On Nagios server, define the following:
define command {
command_name check_remote_disk
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk -a $ARG1$ $ARG2$ $ARG3$
register 1
}
define service{
use genric-service
host_name host1, host2, host3, host4, host5, host6
service_description Root Partition
check_command check_remote_disk!20%!10%!/
contact_groups unix-admins,db-admins
}
Restart the Nagios service.
On the remote host:
Ensure you have NRPE plugin installed.
Instructions for Ubuntu: http://tecadmin.net/install-nrpe-on-ubuntu/
Instructions for CentOS / RHEL: http://sharadchhetri.com/2013/03/02/how-to-install-and-configure-nagios-nrpe-in-centos-and-red-hat/
Ensure there is a command defined for check_disk on the remote host. This is usually included in nrpe.cfg, but commented-out. You'd have to un-comment the line.
Ensure you have the check_disk plugin installed on the remote host. Mine is located at: /usr/lib64/nagios/plugins/check_disk
Ensure that allowed_hosts field of nrpe.cfg includes the IP address / hostname of your Nagios server.
Ensure that dont_blame_nrpe field of nrpe.cfg is set to 1 to allow command line arguments to NRPE commands: dont_blame_nrpe=1
If you made any changes, restart the nrpe service.
Related
I have a UWSGI running behind Nginx proxy server. I tried to benchmark my backend instance using Apache bench. At one point in time, I get Too many open files (24) error when I run the command ab -c 1100 -n 2000 https://example.com/test.
I changed the ulimits of my ECS Instance as well as the docker containers and confirmed it by typing ulimit -n which returns 100000 in both the locations.
I cross checked the Individual NGINX, Uwsgi processes limits by opening the /proc/PID where the Max open files is set to 100000.
The worker_connections and worker_rlimit_nofile parameters in /etc/nginx/nginx.conf are also set to highest limit possible.
I created a service called accountservice and replicated it 3 times after. In my service I get IP address of the producing service instance and populate it in JSON response. The question is everytime I run curl $manager-ip:6767/accounts/10000 the returned IP is the same as before (I tried 100 times)
manager-ip environment variable:
set -x manager-ip (docker-machine ip swarm-manager-1)
Here's my Dockerfile:
FROM iron/base
EXPOSE 6767
ADD accountservice-linux-amd64 /
ADD healthchecker-linux-amd64 /
HEALTHCHECK --interval=3s --timeout=3s CMD ["./healthchecker-linux-amd64", "-port=6767"] || exit 1
ENTRYPOINT ["./accountservice-linux-amd64"]
And here's my automation script to build and run service:
#!/usr/bin/env fish
set -x GOOS linux
set -x CGO_ENABLED 0
set -x GOBIN ""
eval (docker-machine env swarm-manager-1)
go get
go build -o accountservice-linux-amd64 .
pushd ./healthchecker
go get
go build -o ../healthchecker-linux-amd64 .
popd
docker build -t azbshiri/accountservice .
docker service rm accountservice
docker service create \
--name accountservice \
--network my_network \
--replicas=1 \
-p 6767:6767 \
-p 6767:6767/udp \
azbshiri/accountservice
And here's the function I call to get the IP:
package common
import "net"
func GetIP() string {
addrs, err := net.InterfaceAddrs()
if err != nil {
return "error"
}
for _, addr := range addrs {
ipnet, ok := addr.(*net.IPNet)
if ok && !ipnet.IP.IsLoopback() {
if ipnet.IP.To4() != nil {
return ipnet.IP.String()
}
}
}
panic("Unable to determine local IP address (non loopback). Exiting.")
}
And I scale the service using the command below:
docker service scale accountservice=3
A few things:
Your results are normal. By default, a Swarm service has a VIP (virtual IP) in front of the service tasks to act as a load balancer. Trying to reach that service from inside the virtual network will only show that IP.
If you want to use a round-robin approach and skip the VIP, you could create a service with --endpoint-mode=dnsrr that would then return a different service task for each DNS request (but your client might be caching DNS names, causing that to show the same IP, which is why VIP is usually better).
If you wanted to get a list of IP's for task replicas, do a dig tasks.<servicename> inside the service's network.
If you wanted to test something easy, have your service create a random string, or use hostname on startup and return that so you can tell the different replicas when accessing. A easy example is to run one service using image elasticsearch:2 which will return JSON on port 9200 with a different random name per container.
We need to run "composer" command outside of docker container's network.
When I specify orderer and peer host name (e.g. peer0.org1.example.com) in /etc/hosts file, "composer" command seems to work.
However, if I specify server's IP address, it does not work. Here is sample.
$ composer network list -p hlfv1 -n info-share-bc -i PeerAdmin -s secret
✖ List business network info-share-bc
Error trying to ping. Error: Error trying to query chaincode. Error: Connect Failed
Command succeeded
This is a command example when I specify host name in /etc/hosts.
$ composer network list -p hlfv1 -n info-share-bc -i PeerAdmin -s secret
✔ List business network info-share-bc
name: info-share-bc
models:
- org.hyperledger.composer.system
- bc.share.info
<snip>
I believe when the server name can not be resolved, we will specify the option called "ssl-target-name-override", hyperledger node.js SDK as described here.
https://jimthematrix.github.io/Remote.html
- ssl-target-name-override {string} Used in test environment only,
when the server certificate's hostname (in the 'CN' field) does not
match the actual host endpoint that the server process runs at,
the application can work around the client TLS verify failure by
setting this property to the value of the server certificate's hostname
Is there any option to specify host name in connection profile (connection.json) ?
Found a work around: hostnameOverride option in connection profile resolved the connection issue.
"eventURL": "grpcs://<target-host>:17053",
"hostnameOverride": "peer0.org1.example.com",
In my CI chain I execute end-to-end tests after a "docker-compose up". Unfortunately my tests often fail because even if the containers are properly started, the programs contained in my containers are not.
Is there an elegant way to verify that my setup is completely started before running my tests ?
You could poll the required services to confirm they are responding before running the tests.
curl has inbuilt retry logic or it's fairly trivial to build retry logic around some other type of service test.
#!/bin/bash
await(){
local url=${1}
local seconds=${2:-30}
curl --max-time 5 --retry 60 --retry-delay 1 \
--retry-max-time ${seconds} "${url}" \
|| exit 1
}
docker-compose up -d
await http://container_ms1:3000
await http://container_ms2:3000
run-ze-tests
The alternate to polling is an event based system.
If all your services push notifications to an external service, scaeda gave the example of a log file or you could use something like Amazon SNS. Your services emit a "started" event. Then you can subscribe to those events and run whatever you need once everything has started.
Docker 1.12 did add the HEALTHCHECK build command. Maybe this is available via Docker Events?
If you have control over the docker engine in your CI setup you could execute docker logs [Container_Name] and read out the last line which could be emitted by your application.
RESULT=$(docker logs [Container_Name] 2>&1 | grep [Search_String])
logs output example:
Agent pid 13
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
parse specific line:
RESULT=$(docker logs ssh_jenkins_test 2>&1 | grep Enter)
result:
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)
I was reading a blog post on Percona Monitoring Plugins and how you can somehow monitor a Galera cluster using pmp-check-mysql-status plugin. Below is the link to the blog demonstrating that:
https://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins/
The commands in this tutorial are run on the command line. I wish to try these commands in a Nagios .cfg file e.g, monitor.cfg. How do i write the services for the commands used in this tutorial?
This was my attempt and i cannot figure out what the best parameters to use for check_command on the service. I am suspecting that where the problem is.
So inside my /etc/nagios3/conf.d/monitor.cfg file, i have the following:
define host{
use generic-host
host_name percona-server
alias percona
address 127.0.0.1
}
## Check for a Primary Cluster
define command{
command_name check_mysql_status
command_line /usr/lib/nagios/plugins/pmp-check-
mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
}
define service{
use generic-service
hostgroup_name mysql-servers
service_description Cluster
check_command pmp-check-mysql-
status!wsrep_cluster_status!==!str!non-Primary
}
When i run the command Nagios and go to monitor it, i get this message in the Nagios dashboard:
status: UNKNOWN; /usr/lib/nagios/plugins/pmp-check-mysql-status: 31:
shift: can't shift that many
You verified that:
/usr/lib/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
works fine on command line on the target host? I suspect there's a shell escape issue with the ==
Does this work well for you? /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9