Distributed Erlang - network split recovery and using heart with distributed applications - network-programming

I have a standard situation, two distributed Erlang nodes, one master one standby.
When I stop the master the standby comes on - failover, when I start the master the standby stops - takeover. Everything works fine as long as heart is not turned on and there is no network split.
However, when I disconnect the master from the network after 60 seconds or so the standby gives me an error message ** removing (timedout) connection ** and starts up as if the master node stopped. This makes sense to me, it doesn't know if the master is alive or not, and epmd can't connect to the master node so it is removed from the nodes() list. Lets pretend for a moment that this is the desired outcome.
The problem is that, when the connection is restored, I have master and standby running at the same time and the standby is oblivious to the fact that the master is running. Pinging the standby during the masters init does not solve the issue. I checked nodes() on the standby after doing so, it sees the master node but still it continues to run.
My solution for now has been to create a process, that monitors all nodes that are above each node in hierarchy and if any of them are online, can be pinged, the process calls erlang:halt() to terminate the standby node. It works for simple situations, but maybe someone can tell me if there is a better way? I found a similar problem described on Elixir forum so it probably a known erlang problem without an easy solution. https://elixirforum.com/t/distributed-application-network-split/10007
If during a network split you don't want to have two nodes running in parallel I'm guessing an outside monitoring application needs to be used?
The second major issue is heart. If heart is turned on, as is, the failover never happens. If heart is running with a sleep before it calls start it stops the failover node when it calls the application start. So even when it can't start the master, do to it not having access to vital resources for example, it stops the failover node, and doesn't bring it back up after it fails to start the master. I don't know if heart is not supposed to be used with a distributed application or if there is an option to run some erlang code to check if the resources are available before attempting a start the node and before stopping the failover node?
The documentation on heart is not great. Very hard to find any examples of HEART_COMMAND. I found a way to set the HEART_COMMAND to a script from within my application, but there is a limit to how long the argument can be, and it's not as long as stated in the documentation from what I can tell. This for example sets a sleep timer for 60 seconds before calling application start again. It doesn't solve any issues, because in 60 seconds it stops the failover node and hangs if master node can't start.
heart:set_cmd("sleep 60; ./bin/myapp start")
The solution I've ended up with for now is letting heart of the main release start another release, a pre-loader, which does a preliminary check that all resources are available and if they are it starts the main release-application, and if they are not it continues checking forever. This way the main app is running on the failover node without interruption. So the main release has heart turned on, and the pre-loader does not. I ended up using a bash script file because I needed to do more work than I could fit in the heart:set_cmd/1, so I'm still calling heart:set_cmd(Dir ++ "/priv/myHeartScript.sh " ++ Arg1 ++ " " ++ Arg2), but don't get carried away with the Args as there is a limit on size! I also used Environment Variables which I set in vm.args using -env to pass data to the script, such as the pre-loader path/name. This allowed me to avoid having to edit the scrip too during deployment.
If anyone has a better solution PLEASE let me know.
UPDATE
The team at Erlang Solutions was kind enough to shed some light onto the subject. Basically, nobody they know uses the Erlangs built in distributed model. Everything revolves around the data, and as long as it is available on redundant databases you can spin up new applications anytime. They recommend using the cloud hosts that can spin up new servers when one goes down or use a redundant node design, so have 5 nodes up in parallel and if a few go down you can restart them manually or by other means.
As for me, I can say that getting heart to start a pre-loader release/app gets the job done but it gets complicated fast. To launch the app now requires provisioning several extra sys.config/vm.args/rebar.config files. I will be looking into their suggestions for the next iteration.
UPDATE
Moved away from using Erlang distributed model. Using RabbitMQ to send heartbeats to all nodes, including itself. If a node is receiving heartbeats from itself and no other node it's the master, if receiving more than one use any attribute like node name to chose the master. You don't have to use RabbitMQ, but you need to make sure all nodes can reach the same destination and consume from it.
Also, devOps oppose using heart. They prefer to use standard Linux tools to monitor application status and restart it after crash or a server reboot.

Related

Why there is no "start_monitor" for gen_server?

Is there any particular reason why there is no start_monitor as an equivalent of spawn_monitor?
Is this simply not needed since gen_servers are usually started by supervisors?
I would like to get a notifications when my temporary workers crash. What is the recommended way to do this in an OTP application?
First idea was to have a gen_server which would monitor workers started by a dynamic supervisor.
More Info:
As far as I know supervisors provide controlled start, shutdown and controlled restart in case of a crash (to get back to a well defined state).
In addition to that, I would like to run a function when a worker process crashes.
For example, I have a C nodes which connect to the Erlang node. Since the C node can't monitor processes (AFAIK) and is also limited in other ways how it can interact with Erlang, I have a "proxy" processes for connecting C nodes in order to keep the C node as simple as possible.
The C nodes do rpc calls to Erlang using ei_rpc_to and processing messages from the connected Erlang node. Messages are either results of rpc calls or "out-of-band" data/info for the C node.
The Erlang "proxy" process monitors its C node using monitor_node to detect if it vanished, but I also need a mechanism for informing the C node that its proxy process crashed. One way of detecting this would be when it does the next rpc call, since it would obviously fail, but since I already have the "out-of-band" message processing in place, I wanted to use that.
Other use case would be having clients which do REST requests to the Erlang cluster. This in turn starts workers that perform some tasks (which may take a long time). After a while the external client may want to get the status of the task. The worker can for example update the status in a Mnesia table, but if it crashes, who will update the table with the failure status.
I know there are many ways of achieving this, but I would like to know what is the Erlang way of doing this.
2nd Edit:
After reading the docs I saw that in a gen_server, terminate will get called (if it is defined with a matching clause). Would this be a viable option to a separate monitoring process? This looks a bit messy since terminate does not get called when receiving 'EXIT' from other processes, so I would also need to trap exits

How to (or should I) monitor or ensure running of a monitoring software?

I'm writing a system/service monitoring software, and my primary goal is to make it as failsafe as possible.
Right now, I have a binary script which starts the master process, which forks off children which do the actual monitoring and reporting. The master only manages the restarting of children if they fail, and some communication between the children.
Given this level of failsafe, is it advisable to add another layer of monitoring for the master process?
Supposing my code is in a high level language (python et al.), would it make sense to wrap my software in a initscript or shellscript which watches it, or would it be redundant?
This reminds me of this old worm that would consist of 2 processes. If one of the processes was killed, the other one would respawn it and vice versa.
If this software is supposed to run on linux you can simply use /etc/inittab with the respawn option.

OTP Distributed Application Takeover

I'm dynamically loading an Erlang application into the system based on a config file which procludes me from starting the distributed app at boot time -- I'm able to get failover to work, but not failback (or in OTP Terms, Takeover.)
Say I have NodeA running the app, and NodeB as a failover node. I pull the cord on NodeA, and the application migrates to NodeB. This is expected. But when I bring NodeA back online and try to call application:start(MyApp) I get:
{error, {shutdown,{myapp, start, [normal,["config.xml"]]}}}
Which is indicative of the app failing to start up.
No matter, it fails to start because I have the supervisors running on the other NodeB already, and I've net_adm:ping'ed them together.
I would imagine I can call application:takeover/2 on MyApp to get control back of the node, and kill the application on the other node.
{error,{not_running_distributed, MyApp}}
But this doesn't work either. My Node priority list is [NodeA, {NodeB, NodeC}], so I would think that the app would know to move to the higher priority node once back online.
How do I go about implementing Takeover in this scenario?
You may need to use application:takeover/2 instead.

What is the significance of a Mnesia Master Node in a cluster

I am running two erlang nodes with a replicated mnesia database. Whenever I tried to start one of them while mnesia IS NOT Running on the other one, mnesia:wait_for_tables(?TABS,?TIMEOUT), would hang on the node that its called from. I need to have a structure where (if both nodes are not running), I can start working with one while the other is down and later decide to bring the other one up yet continue to work well. I need to be sure that the first node that was running has updated the later when it gets up. Does this necessarily require me to have one as the master?
%%% Edited...........................................................................
Oh, I've got it. The database I was using had a couple of fragmented tables. Some of the fragments had been distributed across the network for load balancing. So, Mnesia on one host would try to load them across the network and would fail since mnesia on the other one is down!
I guess this has got nothing to do with a mnesia master node. But I still would love to understand the significance of the same because I've not used it before, yet, I always play with distributed schemas.
Thanks again...
Mnesia master nodes are used to resolve split-brain situations in a fairly brutal fashion. If mnesia discovers a split-brain situation, it will issue an event, "running partitioned network". One way to respond to this would be to set master nodes to the "island" that you want to keep, and then restart the other nodes. When they come back up, they will unconditionally load tables from the master nodes.
There is another mechanism in mnesia, called force_load. One should be very careful with it, but in the case where you have two nodes, A and B, terminate B (A logs B as down), then terminate A, then restart B, B will have no info about when A went down, so will refuse to load tables that have a copy on A. If you know that A is not coming back soon, you could choose to call mnesia:force_load_tables(Ts) on B, which will cause it to run with its own copies. Once A comes back up, it will detect that B is up, and will load tables from it. As you can see, there are several other scenarios where you can end up with an inconsistent database. Mnesia will not fix that, but tries to provide tools to resolve the situation if it arises. In the scenario above, unfortunately, mnesia will give you no hints, but it is possible to create an application that detects the problem.

How to monitor a remote erlang node which was down and is restarting

My application runs in an erlang cluster - with usually two or more nodes. There's active monitoring between the nodes (using erlang:monitor_node) which works fine - I can detect and react to the fact that a node that was up is now down.
But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?
(Edited to add)
I think the answer to perform a technique like election of a supervisor is the thought process I was missing. I'll look into that and mark this question as done....
But how do I then find out that the node has restarted and is back in business? I can of course periodically ping the node until it is back up, but is there a better way that I've simply missed? Is process groups a better way of achieving this?
Just an idea, but how about having the restarting node itself explicitly inform the supervisor/monitoring node that it has finished restarting and that it is available again?
You could use a recurring "heartbeat message" for this purpose, or come up with a custom message specifically meant to be sent once after successful initialization. Something along the lines of:
start(SupervisorPID) ->
SuperVisorPID ! {hello, MyPID};
mainloop().
You could create a global_group then use the global_group:monitor_nodes(true) to monitor the other nodes within the same global group. The process that is monitoring the nodes will get nodeup and nodedown messages.

Resources