Nagios Percona Monitoring Plugin - monitoring

I was reading a blog post on Percona Monitoring Plugins and how you can somehow monitor a Galera cluster using pmp-check-mysql-status plugin. Below is the link to the blog demonstrating that:
https://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins/
The commands in this tutorial are run on the command line. I wish to try these commands in a Nagios .cfg file e.g, monitor.cfg. How do i write the services for the commands used in this tutorial?
This was my attempt and i cannot figure out what the best parameters to use for check_command on the service. I am suspecting that where the problem is.
So inside my /etc/nagios3/conf.d/monitor.cfg file, i have the following:
define host{
use generic-host
host_name percona-server
alias percona
address 127.0.0.1
}
## Check for a Primary Cluster
define command{
command_name check_mysql_status
command_line /usr/lib/nagios/plugins/pmp-check-
mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
}
define service{
use generic-service
hostgroup_name mysql-servers
service_description Cluster
check_command pmp-check-mysql-
status!wsrep_cluster_status!==!str!non-Primary
}
When i run the command Nagios and go to monitor it, i get this message in the Nagios dashboard:
status: UNKNOWN; /usr/lib/nagios/plugins/pmp-check-mysql-status: 31:
shift: can't shift that many

You verified that:
/usr/lib/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
works fine on command line on the target host? I suspect there's a shell escape issue with the ==
Does this work well for you? /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9

Related

Remote GNU Parallel job gets "/bin/bash: Permission denied"

Having a problem where running a GNU Parallel job in distributed mode (ie. across multiple machines via the --sshloginfile) and finding that even though the job is running on each machine as the same user (or at least dictated that way in the file being given to the --sshloginfile (eg. myuser#myhostname00x)), getting a "Permission denied" error when the job tries to access a file. This occurs despite being able to (passwordless) ssh into the remote nodes in question and ls the files that the Parallel job claims it has no permissions for (the specified path is to a filesystem that is shared and NFS mounted on all the nodes).
Have a list file of nodes like
me#host001
me#host005
me#host006
and the actual Parallel job looks like
bcpexport() {
<do stuff to arg $1 to BCP copy to a MSSQL DB>
}
export -f bcpexport
parallel -q -j 10 --sshloginfile $basedir/src/parallel-nodes.txt --env $bcpexport \
bcpexport {} "$TO_SERVER_ODBCDSN" $DB $TABLE $USER $PASSWORD $RECOMMEDED_IMPORT_MODE $DELIMITER \
::: $DATAFILES/$TARGET_GLOB
where the $DATAFILES/$TARGET_GLOB glob pattern returns files from a directory. Running this job in single node mode works fine, but when running across all the nodes in the parallel-nodes.txt file throws
/bin/bash: line 27: /path/to/file001: Permission denied
/bin/bash: line 27: /path/to/file002: Permission denied
...and so on for all the files...
If anyone knows what could be going on here, advice or debugging suggestions would be appreciated.
I think the problem is the additional $:
parallel [...] --env $bcpexport bcpexport {} [...]
Unless you set the shell variable $bcpexport to something you probably meant bcpexport (no $) instead.
If $bcpexport is undefined, then it will be replace with nothing by the shell. Thus --env will eat the next argument, so you will really be running:
parallel [...] --env bcpexport {} [...]
which will execute {} as a command, which is exactly what you experience.
So try this instead:
parallel [...] --env bcpexport bcpexport {} [...]

How to know if load balancing works in Docker Swarm?

I created a service called accountservice and replicated it 3 times after. In my service I get IP address of the producing service instance and populate it in JSON response. The question is everytime I run curl $manager-ip:6767/accounts/10000 the returned IP is the same as before (I tried 100 times)
manager-ip environment variable:
set -x manager-ip (docker-machine ip swarm-manager-1)
Here's my Dockerfile:
FROM iron/base
EXPOSE 6767
ADD accountservice-linux-amd64 /
ADD healthchecker-linux-amd64 /
HEALTHCHECK --interval=3s --timeout=3s CMD ["./healthchecker-linux-amd64", "-port=6767"] || exit 1
ENTRYPOINT ["./accountservice-linux-amd64"]
And here's my automation script to build and run service:
#!/usr/bin/env fish
set -x GOOS linux
set -x CGO_ENABLED 0
set -x GOBIN ""
eval (docker-machine env swarm-manager-1)
go get
go build -o accountservice-linux-amd64 .
pushd ./healthchecker
go get
go build -o ../healthchecker-linux-amd64 .
popd
docker build -t azbshiri/accountservice .
docker service rm accountservice
docker service create \
--name accountservice \
--network my_network \
--replicas=1 \
-p 6767:6767 \
-p 6767:6767/udp \
azbshiri/accountservice
And here's the function I call to get the IP:
package common
import "net"
func GetIP() string {
addrs, err := net.InterfaceAddrs()
if err != nil {
return "error"
}
for _, addr := range addrs {
ipnet, ok := addr.(*net.IPNet)
if ok && !ipnet.IP.IsLoopback() {
if ipnet.IP.To4() != nil {
return ipnet.IP.String()
}
}
}
panic("Unable to determine local IP address (non loopback). Exiting.")
}
And I scale the service using the command below:
docker service scale accountservice=3
A few things:
Your results are normal. By default, a Swarm service has a VIP (virtual IP) in front of the service tasks to act as a load balancer. Trying to reach that service from inside the virtual network will only show that IP.
If you want to use a round-robin approach and skip the VIP, you could create a service with --endpoint-mode=dnsrr that would then return a different service task for each DNS request (but your client might be caching DNS names, causing that to show the same IP, which is why VIP is usually better).
If you wanted to get a list of IP's for task replicas, do a dig tasks.<servicename> inside the service's network.
If you wanted to test something easy, have your service create a random string, or use hostname on startup and return that so you can tell the different replicas when accessing. A easy example is to run one service using image elasticsearch:2 which will return JSON on port 9200 with a different random name per container.

How to know if my program is completely started inside my docker with compose

In my CI chain I execute end-to-end tests after a "docker-compose up". Unfortunately my tests often fail because even if the containers are properly started, the programs contained in my containers are not.
Is there an elegant way to verify that my setup is completely started before running my tests ?
You could poll the required services to confirm they are responding before running the tests.
curl has inbuilt retry logic or it's fairly trivial to build retry logic around some other type of service test.
#!/bin/bash
await(){
local url=${1}
local seconds=${2:-30}
curl --max-time 5 --retry 60 --retry-delay 1 \
--retry-max-time ${seconds} "${url}" \
|| exit 1
}
docker-compose up -d
await http://container_ms1:3000
await http://container_ms2:3000
run-ze-tests
The alternate to polling is an event based system.
If all your services push notifications to an external service, scaeda gave the example of a log file or you could use something like Amazon SNS. Your services emit a "started" event. Then you can subscribe to those events and run whatever you need once everything has started.
Docker 1.12 did add the HEALTHCHECK build command. Maybe this is available via Docker Events?
If you have control over the docker engine in your CI setup you could execute docker logs [Container_Name] and read out the last line which could be emitted by your application.
RESULT=$(docker logs [Container_Name] 2>&1 | grep [Search_String])
logs output example:
Agent pid 13
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
#host SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
parse specific line:
RESULT=$(docker logs ssh_jenkins_test 2>&1 | grep Enter)
result:
Enter passphrase (empty for no passphrase): Enter same passphrase again: Identity added: id_rsa (id_rsa)

check_disk not generating alerts: nagios

I am new to nagios.
I am trying to configure the "check_disk" service for one host but I am not getting the expected results.
I should get the emails when when disk usage goes beyond 80%.
So, There is already service defined for this task with multiple hosts, as below:
define service{
use local-service ; Name of service template to use
host_name localhost, host1, host2, host3, host4, host5, host6
service_description Root Partition
check_command check_local_disk!20%!10%!/
contact_groups unix-admins,db-admins
}
The Issue:
Further I tried to test single host i.e. "host2". The current usage of host2 is as follow:
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-rootvol 94G 45G 45G 50% /
So to get instant emails, I written another service as below, where warning set to <60% and critical set to <40%.
define service{
use local-service
host_name host2
service_description Root Partition again
check_command check_local_disk!60%!40%!/
contact_groups dev-admins
}
But still I am not receive any emails for the same.
Where it going wrong.
The "check_local_disk" command is defined as below:
define command{
command_name check_local_disk
command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
}
Your command definition currently is setup to only check your Nagios server's disk, not the remote hosts (such as host2). You need to define a new command definition to execute check_disk on the remote host via NRPE (Nagios Remote Plugin Execution).
On Nagios server, define the following:
define command {
command_name check_remote_disk
command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk -a $ARG1$ $ARG2$ $ARG3$
register 1
}
define service{
use genric-service
host_name host1, host2, host3, host4, host5, host6
service_description Root Partition
check_command check_remote_disk!20%!10%!/
contact_groups unix-admins,db-admins
}
Restart the Nagios service.
On the remote host:
Ensure you have NRPE plugin installed.
Instructions for Ubuntu: http://tecadmin.net/install-nrpe-on-ubuntu/
Instructions for CentOS / RHEL: http://sharadchhetri.com/2013/03/02/how-to-install-and-configure-nagios-nrpe-in-centos-and-red-hat/
Ensure there is a command defined for check_disk on the remote host. This is usually included in nrpe.cfg, but commented-out. You'd have to un-comment the line.
Ensure you have the check_disk plugin installed on the remote host. Mine is located at: /usr/lib64/nagios/plugins/check_disk
Ensure that allowed_hosts field of nrpe.cfg includes the IP address / hostname of your Nagios server.
Ensure that dont_blame_nrpe field of nrpe.cfg is set to 1 to allow command line arguments to NRPE commands: dont_blame_nrpe=1
If you made any changes, restart the nrpe service.

Distributed RabbitMQ Nodes don't recognize each other

I'm working on a RabbitMQ distributed POC and I'm stuck at the basics of clustering the nodes.
I'm trying to follow the rabbit's tutorial on clustering so this is my reference.
After installing erlang (R14B04) and rabbit (2.8.2-1) I've copied the .erlang.cookie file contents from one node to the other two.
I wasn't sure about how to get erlang to notice this change to I had to restart the machines themselves (pretty brute force but I don't know erlang at all).
In addtion I opened in iptables 4369 and 5 additional ports for communications and placed under
/usr/lib64/erlang/bin/sys.config the following config:
{kernel,[{inet_dist_listen_min, XX00},{inet_dist_listen_max,XX05}]}]
Then another restart (dumb I know) to verify erlang takes these into consideration but still when I run:
rabbitmqctl cluster rabbit#HostName1
I get:
Clustering node rabbit#HostName2 with [rabbit#HostName1] ...
Error: {no_running_cluster_nodes,[rabbit#HostName1],
[rabbit#HostName1]}
There is a chance my fiddling with the erlang.cookie or with the ports did not succeed but I don't know how to check them. I tried typing erl in the cmd and then erl_epmd:names() or other commands to get more information but I'm probably way off in erlang land.
Would truly appreciate any help
Update:
I tried pinging two erlang nodes manually and got pang back.
I did the following:
Connected to two nodes, stopped rabbitmq (wasn't sure if needed but to be sure), started erlang like so (erl -sname dilbert and erl -sname dilbert2) when the erlang command line started i ran node(). on each of them and got dilbert#HostName1 and dilbert2#HostName2 respectively. I then tried to run net_adm:ping('dilbert'). and net_adm:ping('dilbert#HostName1'). with the single quote and without them from both nodes (changed names of course) and got on all 8 cases pang.
When I ran nodes(). on one of the machines I got back an empty array.
I've also tried to allow all traffic in the firewall (script) and then try to run the above commands (don't worry they're back on now) and still got back pang.
Update2:
For some reason I had cookies mismatch which I needed to resolve (thanks #kjw0188 for the suggestion [I ran erlang:get_cookie(). in the erlang command line]).
This did not help and I needed to stop iptables completely (not sure why but I'll figure it soon) and load the erlang node with -name dilbert#my-ip because my rackspace servers have no dns-name. This finally enabled me to get a pong and see the nodes see each other (nodes(). returns a non-empty array after the ping).
The problem I'm facing now is how to instruct RabbitMQ to use -name instead of -sname when starting erlang.
So I had multiple issues with connecting my two RabbitMQ nodes-
I'll add that my nodes are hosted on rackspace, and so don't have a default exposable hostname, and require iptables since there is no DMZ or built in security group concept like amazon.
Problems:
1. Cookie- Not sure how or why but I had multiple instances of .erlang.cookie (in /root, in my home directory and in /var/lib/rabbitmq/) I kept only the one in rabbitmq and verified all nodes have the same cookie.
2. IPTables- In order for the nodes to communicate I needed to open the epmd port and the range of ports for the actual communication inet_dist_listen_min inet_dist_listen_max.
/sbin/iptables -A INPUT -i eth1 -p tcp --dport ${epmd} -s ${otherNode} -j ACCEPT
/sbin/iptables -A INPUT -i eth1 -p tcp --dport ${inet_dist_listen_min}:${inet_dist_listen_max} -s ${otherNode} -j ACCEPT
empd is the usuall 4369 port and for the other range use whatever range you want.
${otherNode} is the ip of my other node.
I also needed to configure erlang through rabbitmq to use these ports (see config file at end)
3. HostName- Seeing as I don't have a hostname I needed to edit the rabbit scripts to use -name and not -sname (the first tells erlang to take the whole name, the latter stands for short name and thus appends an # symbol and the hostname).
This was accomplished by editing:
/usr/lib/rabbitmq/bin/rabbitmqctl
Added at the beginning the definition of the RABBITMQ_NODE_IP_ADDRESS property
DEFAULT_NODE_IP_ADDRESS=auto
DEFAULT_NODE_PORT=5672
[ "x" = "x$RABBITMQ_NODE_IP_ADDRESS" ] && RABBITMQ_NODE_IP_ADDRESS=${NODE_IP_ADDRESS}
[ "x" = "x$RABBITMQ_NODE_PORT" ] && RABBITMQ_NODE_PORT=${NODE_PORT}
[ "x" = "x$RABBITMQ_NODE_IP_ADDRESS" ] && [ "x" != "x$RABBITMQ_NODE_PORT" ] && RABBITMQ_NODE_IP_ADDRESS=${DEFAULT_NODE_IP_ADDRESS}
[ "x" != "x$RABBITMQ_NODE_IP_ADDRESS" ] && [ "x" = "x$RABBITMQ_NODE_PORT" ] && RABBITMQ_NODE_PORT=${DEFAULT_NODE_PORT}
and in the actual erl command I changed
-sname ${RABBITMQ_NODENAME} \ to
-name ${RABBITMQ_NODENAME}#${RABBITMQ_NODE_IP_ADDRESS}\.
This made rabbitmq listen only on the specified ip address (specified in the config file at the end) and load with that ip instead of the usuall hostname.
edited /usr/lib/rabbitmq/bin/rabbitmq-server
Changed the actual erl command from -sname ${RABBITMQ_NODENAME} \ to -name ${RABBITMQ_NODENAME}#${RABBITMQ_NODE_IP_ADDRESS}\
Added a rabbit conf (/etc/rabbitmq/rabbitmq-env.conf) file with-
#the ip address which rabbit should use, this is to limit rabbit to only use internal rackspace communication and not publicly accessible ports
NODE_IP_ADDRESS=myIpAdress
#had to change the nodename becaue otherwise rabbitmq used rabbit#Hostname and not only rabbit
NODENAME=myCompany
#This instructed rabbit to instruct erlang which ports it should use for its communications with other nodes
export SERVER_ERL_ARGS="$SERVER_ERL_ARGS -kernel inet_dist_listen_min somePort -kernel inet_dist_listen_max someOtherBiggerPort"
Some resources which helped me along the way:
RabbitMQ Clustering Guide
Clustering RabbitMQ servers for High Availability
rabbitmq-env.conf(5) manual page
Node communication by public IP address erlang mailing list (The middle post)
Configuring RabbitMQ Cluster on Cloud
Hope this will help anyone else.
EDIT:
Not sure how I was mistaken but it seemed my erlang-rabbit port instructions were not taken into consideration or were not enough. Ended up having to allow all communications between the two nodes...
One thing to really watch out for is whitespace of any kind in the erlang cookie file, especially line breaks AFTER the contents of the cookie. So long as both are identical, things are okay, but when one has a line break and the other doesn't, thing won't work.
Background: I was facing the same issue while setting up Rabbitmq cluster. I was using 2 docker containers running on my host-machine, which is equivalent to 2 separate nodes and I could not create a cluster of these two.
Solution: 1. Make sure you have same erlang cookie on all your cluster nodes, the default location is /var/lib/rabbitmq/.erlang.cookie. This file is used for authentication, so make sure, you have it same on all the nodes. After changing the .erlang.cookie restart your rabbitmq service.
Make sure that nodes are accessible from one other, use ping or telnet to check the connection.
Check that /etc/hosts have correct entries, for example if rabbit2 wants to join cluster rabbit1, /etc/hosts of rabbit2 should contain.
172.68.1.6 rabbit1
172.68.1.7 rabbit2
Now stop service using $rabbitmqctl stop_app followed by $rabbitmqctl join_cluster rabbit#rabbit1, start your service by rabbitmqctl start_app and check $rabbitmqctl cluster_status to see weather you have joined the cluster or not.
I followed the rabbitmq official documentation to setup the cluster.
to change RabbitMQ sname/name behaviour you can edit the scripts:
rabbitmq-multi
rabbitmq-server
rabbitmqctl
Example
In script rabbitmqctl there is the following piece of code:
exec erl \
-pa "${RABBITMQ_HOME}/ebin" \
-noinput \
-hidden \
${RABBITMQ_CTL_ERL_ARGS} \
-sname rabbitmqctl$$ \
-s rabbit_control \
-nodename $RABBITMQ_NODENAME \
-extra "$#"
You have to change it in:
exec erl \
-pa "${RABBITMQ_HOME}/ebin" \
-noinput \
-hidden \
${RABBITMQ_CTL_ERL_ARGS} \
-name rabbitmqctl$$ \
-s rabbit_control \
-nodename $RABBITMQ_NODENAME \
-extra "$#"
http://pearlin.info/?p=1672
so you need to copy the cookie from the node you trying to connect
example :- rabbit#node1
rabbit#node2
go to rabbit#node1 and copy the cookie from cat /var/lib/rabbitmq/.erlang.cookie
go to rabbit#node2 remove the current cookie and paste the new one.
on same node
/usr/sbin/rabbitmqctl stop_app
/usr/sbin/rabbitmqctl reset
/usr/sbin/rabbitmqctl cluster rabbit#node1
should do it.
same documented here.
http://pearlin.info/?p=1672

Resources