Distributed RabbitMQ Nodes don't recognize each other - erlang

I'm working on a RabbitMQ distributed POC and I'm stuck at the basics of clustering the nodes.
I'm trying to follow the rabbit's tutorial on clustering so this is my reference.
After installing erlang (R14B04) and rabbit (2.8.2-1) I've copied the .erlang.cookie file contents from one node to the other two.
I wasn't sure about how to get erlang to notice this change to I had to restart the machines themselves (pretty brute force but I don't know erlang at all).
In addtion I opened in iptables 4369 and 5 additional ports for communications and placed under
/usr/lib64/erlang/bin/sys.config the following config:
{kernel,[{inet_dist_listen_min, XX00},{inet_dist_listen_max,XX05}]}]
Then another restart (dumb I know) to verify erlang takes these into consideration but still when I run:
rabbitmqctl cluster rabbit#HostName1
I get:
Clustering node rabbit#HostName2 with [rabbit#HostName1] ...
Error: {no_running_cluster_nodes,[rabbit#HostName1],
[rabbit#HostName1]}
There is a chance my fiddling with the erlang.cookie or with the ports did not succeed but I don't know how to check them. I tried typing erl in the cmd and then erl_epmd:names() or other commands to get more information but I'm probably way off in erlang land.
Would truly appreciate any help
Update:
I tried pinging two erlang nodes manually and got pang back.
I did the following:
Connected to two nodes, stopped rabbitmq (wasn't sure if needed but to be sure), started erlang like so (erl -sname dilbert and erl -sname dilbert2) when the erlang command line started i ran node(). on each of them and got dilbert#HostName1 and dilbert2#HostName2 respectively. I then tried to run net_adm:ping('dilbert'). and net_adm:ping('dilbert#HostName1'). with the single quote and without them from both nodes (changed names of course) and got on all 8 cases pang.
When I ran nodes(). on one of the machines I got back an empty array.
I've also tried to allow all traffic in the firewall (script) and then try to run the above commands (don't worry they're back on now) and still got back pang.
Update2:
For some reason I had cookies mismatch which I needed to resolve (thanks #kjw0188 for the suggestion [I ran erlang:get_cookie(). in the erlang command line]).
This did not help and I needed to stop iptables completely (not sure why but I'll figure it soon) and load the erlang node with -name dilbert#my-ip because my rackspace servers have no dns-name. This finally enabled me to get a pong and see the nodes see each other (nodes(). returns a non-empty array after the ping).
The problem I'm facing now is how to instruct RabbitMQ to use -name instead of -sname when starting erlang.

So I had multiple issues with connecting my two RabbitMQ nodes-
I'll add that my nodes are hosted on rackspace, and so don't have a default exposable hostname, and require iptables since there is no DMZ or built in security group concept like amazon.
Problems:
1. Cookie- Not sure how or why but I had multiple instances of .erlang.cookie (in /root, in my home directory and in /var/lib/rabbitmq/) I kept only the one in rabbitmq and verified all nodes have the same cookie.
2. IPTables- In order for the nodes to communicate I needed to open the epmd port and the range of ports for the actual communication inet_dist_listen_min inet_dist_listen_max.
/sbin/iptables -A INPUT -i eth1 -p tcp --dport ${epmd} -s ${otherNode} -j ACCEPT
/sbin/iptables -A INPUT -i eth1 -p tcp --dport ${inet_dist_listen_min}:${inet_dist_listen_max} -s ${otherNode} -j ACCEPT
empd is the usuall 4369 port and for the other range use whatever range you want.
${otherNode} is the ip of my other node.
I also needed to configure erlang through rabbitmq to use these ports (see config file at end)
3. HostName- Seeing as I don't have a hostname I needed to edit the rabbit scripts to use -name and not -sname (the first tells erlang to take the whole name, the latter stands for short name and thus appends an # symbol and the hostname).
This was accomplished by editing:
/usr/lib/rabbitmq/bin/rabbitmqctl
Added at the beginning the definition of the RABBITMQ_NODE_IP_ADDRESS property
DEFAULT_NODE_IP_ADDRESS=auto
DEFAULT_NODE_PORT=5672
[ "x" = "x$RABBITMQ_NODE_IP_ADDRESS" ] && RABBITMQ_NODE_IP_ADDRESS=${NODE_IP_ADDRESS}
[ "x" = "x$RABBITMQ_NODE_PORT" ] && RABBITMQ_NODE_PORT=${NODE_PORT}
[ "x" = "x$RABBITMQ_NODE_IP_ADDRESS" ] && [ "x" != "x$RABBITMQ_NODE_PORT" ] && RABBITMQ_NODE_IP_ADDRESS=${DEFAULT_NODE_IP_ADDRESS}
[ "x" != "x$RABBITMQ_NODE_IP_ADDRESS" ] && [ "x" = "x$RABBITMQ_NODE_PORT" ] && RABBITMQ_NODE_PORT=${DEFAULT_NODE_PORT}
and in the actual erl command I changed
-sname ${RABBITMQ_NODENAME} \ to
-name ${RABBITMQ_NODENAME}#${RABBITMQ_NODE_IP_ADDRESS}\.
This made rabbitmq listen only on the specified ip address (specified in the config file at the end) and load with that ip instead of the usuall hostname.
edited /usr/lib/rabbitmq/bin/rabbitmq-server
Changed the actual erl command from -sname ${RABBITMQ_NODENAME} \ to -name ${RABBITMQ_NODENAME}#${RABBITMQ_NODE_IP_ADDRESS}\
Added a rabbit conf (/etc/rabbitmq/rabbitmq-env.conf) file with-
#the ip address which rabbit should use, this is to limit rabbit to only use internal rackspace communication and not publicly accessible ports
NODE_IP_ADDRESS=myIpAdress
#had to change the nodename becaue otherwise rabbitmq used rabbit#Hostname and not only rabbit
NODENAME=myCompany
#This instructed rabbit to instruct erlang which ports it should use for its communications with other nodes
export SERVER_ERL_ARGS="$SERVER_ERL_ARGS -kernel inet_dist_listen_min somePort -kernel inet_dist_listen_max someOtherBiggerPort"
Some resources which helped me along the way:
RabbitMQ Clustering Guide
Clustering RabbitMQ servers for High Availability
rabbitmq-env.conf(5) manual page
Node communication by public IP address erlang mailing list (The middle post)
Configuring RabbitMQ Cluster on Cloud
Hope this will help anyone else.
EDIT:
Not sure how I was mistaken but it seemed my erlang-rabbit port instructions were not taken into consideration or were not enough. Ended up having to allow all communications between the two nodes...

One thing to really watch out for is whitespace of any kind in the erlang cookie file, especially line breaks AFTER the contents of the cookie. So long as both are identical, things are okay, but when one has a line break and the other doesn't, thing won't work.

Background: I was facing the same issue while setting up Rabbitmq cluster. I was using 2 docker containers running on my host-machine, which is equivalent to 2 separate nodes and I could not create a cluster of these two.
Solution: 1. Make sure you have same erlang cookie on all your cluster nodes, the default location is /var/lib/rabbitmq/.erlang.cookie. This file is used for authentication, so make sure, you have it same on all the nodes. After changing the .erlang.cookie restart your rabbitmq service.
Make sure that nodes are accessible from one other, use ping or telnet to check the connection.
Check that /etc/hosts have correct entries, for example if rabbit2 wants to join cluster rabbit1, /etc/hosts of rabbit2 should contain.
172.68.1.6 rabbit1
172.68.1.7 rabbit2
Now stop service using $rabbitmqctl stop_app followed by $rabbitmqctl join_cluster rabbit#rabbit1, start your service by rabbitmqctl start_app and check $rabbitmqctl cluster_status to see weather you have joined the cluster or not.
I followed the rabbitmq official documentation to setup the cluster.

to change RabbitMQ sname/name behaviour you can edit the scripts:
rabbitmq-multi
rabbitmq-server
rabbitmqctl
Example
In script rabbitmqctl there is the following piece of code:
exec erl \
-pa "${RABBITMQ_HOME}/ebin" \
-noinput \
-hidden \
${RABBITMQ_CTL_ERL_ARGS} \
-sname rabbitmqctl$$ \
-s rabbit_control \
-nodename $RABBITMQ_NODENAME \
-extra "$#"
You have to change it in:
exec erl \
-pa "${RABBITMQ_HOME}/ebin" \
-noinput \
-hidden \
${RABBITMQ_CTL_ERL_ARGS} \
-name rabbitmqctl$$ \
-s rabbit_control \
-nodename $RABBITMQ_NODENAME \
-extra "$#"

http://pearlin.info/?p=1672
so you need to copy the cookie from the node you trying to connect
example :- rabbit#node1
rabbit#node2
go to rabbit#node1 and copy the cookie from cat /var/lib/rabbitmq/.erlang.cookie
go to rabbit#node2 remove the current cookie and paste the new one.
on same node
/usr/sbin/rabbitmqctl stop_app
/usr/sbin/rabbitmqctl reset
/usr/sbin/rabbitmqctl cluster rabbit#node1
should do it.
same documented here.
http://pearlin.info/?p=1672

Related

Hyperledger sawtooth with docker (Test network tutorial). Connectivity problem between the nodes of the network

I am trying to settup a sawtooth network like in the following tutorial.
I use the following docker-compose.yaml file as instructed in the tutorial to create a sawtooth network of 5 nodes using the pbft consesus engine.
The problem is that once I try to check whether peering has occurred on the network by submit a peers query to the REST API on the first node from the shell container I get a connection refused answer:
curl: (7) Failed to connect to sawtooth-rest-api-default-0 port 8008: Connection refused
Connectivity among the containers seems to be working fine (I have checked with ping from inside the containers).
I suspect that the problem stems from the following line of the docker-compose.yaml file:
sawtooth-validator -vv \
--endpoint tcp://validator-0:8800 \
--bind component:tcp://eth0:4004 \
--bind consensus:tcp://eth0:5050 \
--bind network:tcp://eth0:8800 \
--scheduler parallel \
--peering static \
--maximum-peer-connectivity 10000
and more specifically the --bind option. I noticed that eth0 is not resolved properly to the IP of the container network, but instead to the loopback:
terminal output for validator 0
Do you believe that this could be the problem or is there something else I might have overlooked?
Thannk you
Looks like the moment I post something here the answer magically reveals itself.
The backslash characters are not interpreted correctly so the --bind option was not taken into account and the default is the loopback.
What I did to fix it is either put the whole command in the same line or use double backslash.

Docker swarm: guarantee high availability after restart

I have an issue using Docker swarm.
I have 3 replicas of a Python web service running on Gunicorn.
The issue is that when I restart the swarm service after a software update, an old running service is killed, then a new one is created and started. But in the short period of time when the old service is already killed, and the new one didn't fully start yet, network messages are already routed to the new instance that isn't ready yet, resulting in 502 bad gateway errors (I proxy to the service from nginx).
I use --update-parallelism 1 --update-delay 10s options, but this doesn't eliminate the issue, only slightly reduces chances of getting the 502 error (because there are always at least 2 services running, even if one of them might be still starting up).
So, following what I've proposed in comments:
Use the HEALTHCHECK feature of Dockerfile: Docs. Something like:
HEALTHCHECK --interval=5m --timeout=3s \
CMD curl -f http://localhost/ || exit 1
Knowing that Docker Swarm does honor this healthcheck during service updates, it's relative easy to have a zero downtime deployment.
But as you mentioned, you have a high-resource consumer health-check, and you need larger healthcheck-intervals.
In that case, I recomend you to customize your healthcheck doing the first run immediately and the successive checks at current_minute % 5 == 0, but the healthcheck itself running /30s:
HEALTHCHECK --interval=30s --timeout=3s \
CMD /service_healthcheck.sh
healthcheck.sh
#!/bin/bash
CURRENT_MINUTE=$(date +%M)
INTERVAL_MINUTE=5
[ $((a%2)) -eq 0 ]
do_healthcheck() {
curl -f http://localhost/ || exit 1
}
if [ ! -f /tmp/healthcheck.first.run ]; then
do_healhcheck
touch /tmp/healthcheck.first.run
exit 0
fi
# Run only each minute that is multiple of $INTERVAL_MINUTE
[ $(($CURRENT_MINUTE%$INTERVAL_MINUTE)) -eq 0 ] && do_healhcheck
exit 0
Remember to COPY the healthcheck.sh to /healthcheck.sh (and chmod +x)
There are some known issues (e.g. moby/moby #30321) with rolling upgrades in docker swarm with the current 17.05 and earlier releases (and doesn't look like all the fixes will make 17.06). These issues will result in connection errors during a rolling upgrade like you're seeing.
If you have a true zero downtime deployment requirement and can't solve this with a client side retry, then I'd recommend putting in some kind of blue/green switch in front of your swarm and do the rolling upgrade to the non-active set of containers until docker finds solutions to all of the scenarios.

Nagios Percona Monitoring Plugin

I was reading a blog post on Percona Monitoring Plugins and how you can somehow monitor a Galera cluster using pmp-check-mysql-status plugin. Below is the link to the blog demonstrating that:
https://www.percona.com/blog/2013/10/31/percona-xtradb-cluster-galera-with-percona-monitoring-plugins/
The commands in this tutorial are run on the command line. I wish to try these commands in a Nagios .cfg file e.g, monitor.cfg. How do i write the services for the commands used in this tutorial?
This was my attempt and i cannot figure out what the best parameters to use for check_command on the service. I am suspecting that where the problem is.
So inside my /etc/nagios3/conf.d/monitor.cfg file, i have the following:
define host{
use generic-host
host_name percona-server
alias percona
address 127.0.0.1
}
## Check for a Primary Cluster
define command{
command_name check_mysql_status
command_line /usr/lib/nagios/plugins/pmp-check-
mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
}
define service{
use generic-service
hostgroup_name mysql-servers
service_description Cluster
check_command pmp-check-mysql-
status!wsrep_cluster_status!==!str!non-Primary
}
When i run the command Nagios and go to monitor it, i get this message in the Nagios dashboard:
status: UNKNOWN; /usr/lib/nagios/plugins/pmp-check-mysql-status: 31:
shift: can't shift that many
You verified that:
/usr/lib/nagios/plugins/pmp-check-mysql-status -x wsrep_cluster_status -C == -T str -c non-Primary
works fine on command line on the target host? I suspect there's a shell escape issue with the ==
Does this work well for you? /usr/lib64/nagios/plugins/pmp-check-mysql-status -x wsrep_flow_control_paused -w 0.1 -c 0.9

iperf, sctp command not recognized in command-promt

I'm using iperf3 that is supposedly a rewritten version of iperf. Reason why Im using this is because I love iperf when it comes to TCP and UDP throughput and I now want to test SCTP throughput between my end-points.
However when I'm trying to use the --sctp command that I've seen people been using it says command not recognizable. Is it the implementation I'm using that have not implemented this command?
https://github.com/esnet/iperf
This is the implementation I'm using, can't find any obvious documentation of the SCTP commands related to this. Most SCTP iperf implementations are added manually in the tests and the source code is often not provided.
Any help would be appreciated!
Get a copy of iperf which supports lksctp module of linux kernel. Install it using the standard process. (If it fails, please inform with the error message and the operating system and kernel details). Now to use SCTP in iperf these are the proper syntaxes.
For creating an SCTP server,
iperf -z -s
(-z is for selecting the SCTP protocol and -s is for server.)
For creating an SCTP client,
iperf -z -c <host address> -t <time duration for the connection in second>s -i <interval of the time to print the bandwidth in terminal in second>s
(-z for SCTP, -c is for client. Host address should be the ip address of the server where iperf -z -s is already running. -t is to specify the communication time duration. -i is to specify the interval to show the bandwidth.)
Example:
iperf -z -c 0.0.0.0 -t 10s -i 2s
Here the communication time is 10 seconds and it'll report the bandwidth for each 2 seconds interval.
P.S.
(1) To use iperf for SCTP, you must enable the SCTP module in the kernel and recompile it. The kernel version must be 2.6 or above. Check it using uname -a or uname -r. If you have a lower one, then download a new kernel from The Linux Kernel Archives. And compile it by enabling SCTP.
First check if it is already enabled or not by running these two commands in the terminal.
modprobe sctp
lsmod | grep sctp If you get any output then SCTP is already enabled.
(2) If still iperf with -z fails. Try the following solution. If the two machines are 'A' and 'B'.
First make 'A' the server and 'B' the client. It won't succeed. So
exit by using `ctrl + z` and kill iperf
using `pkill -9 iperf`.
Then make 'B' the server and 'A' the client. It may succeed. If it fails again, kill iperf using the above command and repeat step 1 again. it might get succeeded.
(The 2nd solution works for me with fedora 20 and kernel 2.6 and above.)
Couldn't find any recent answers through googling so I though I would leave an answer here for those looking to installing Iperf3 to use SCTP on RHEL / CentOS.
You'll need to install lksctp-tools-devel first and build from source to enable the SCTP support. Yum Install Iperf3 3.17 with lksctp-tools-devel did not enable SCTP for me.

Inform me when site (server) is online again

When I ping one site it returns "Request timed out". I want to make little program that will inform me (sound beep or something like that) when this server is online again. No matter in which language. I think it should be very simple script with a several lines of code. So how to write it?
Some implementations of ping allow you to specify conditions for exiting after receipt of packets:
On Mac OS X, use ping -a -o $the_host
ping will keep trying (by default)
-a means beep when a packet is received
-o means exit when a packet is received
On Linux (Ubuntu at least), use ping -a -c 1 -w inf $the_host
-a means beep when a packet is received
-c 1 specifies the number of packets to send before exit (in this case 1)
-w inf specifies the deadline for when ping exits no matter what (in this case Infinite)
when -c and -w are used together, -c becomes number of packets received before exit
Either can be chained to perform your next command, e.g. to ssh into the server as soon as it comes up (with a gap between to allow sshd to actually start up):
# ping -a -o $the_host && sleep 3 && ssh $the_host
Don't forget the notify sound like echo"^G"! Just to be different - here's Windows batch:
C:\> more pingnotify.bat
:AGAIN
ping -n 1 %1%
IF ERRORLEVEL 1 GOTO AGAIN
sndrec32 /play /close "C:\Windows\Media\Notify.wav"
C:\> pingnotify.bat localhost
:)
One way is to run ping is a loop, e.g.
while ! ping -c 1 host; do sleep 1; done
(You can redirect the output to /dev/null if you want to keep it quiet.)
On some systems, such as Mac OS X, ping may also have the options -a -o (as per another answer) available which will cause it to keep pinging until a response is received. However, the ping on many (most?) Linux systems does not have the -o option and the kind of equivalent -c 1 -w 0 still exits if the network returns an error.
Edit: If the host does not respond to ping or you need to check the availability of service on a certain port, you can use netcat in the zero I/O mode:
while ! nc -w 5 -z host port; do sleep 1; done
The -w 5 specifies a 5 second timeout for each individual attempt. Note that with netcat you can even list multiple ports (or port ranges) to scan when some of them becomes available.
Edit 2: The loops shown above keep trying until the host (or port) is reached. Add your alert command after them, e.g. beep or pop-up a window.

Resources