Automatically restarting Erlang applications - erlang

I recently ran into a bug where an entire Erlang application died, yielding a log message that looked like this:
=INFO REPORT==== 11-Jun-2010::11:07:25 ===
application: myapp
exited: shutdown
type: temporary
I have no idea what triggered this shutdown, but the real problem I have is that it didn't restart itself. Instead, the now-empty Erlang VM just sat there doing nothing.
Now, from the research I've done, it looks like there are other "start types" you can give an application: 'transient' and 'permanent'.
If I start a Supervisor within an application, I can tell it to make a particular process transient or permanent, and it will automatically restart it for me. However, according to the documentation, if I make an application transient or permanent, it doesn't restart it when it dies, but rather it kills all the other applications as well.
What I really want to do is somehow tell the Erlang VM that a particular application should always be running, and if it goes down, restart it. Is this possible to do?
(I'm not talking about implementing a supervisor on top of my application, because then it's a catch 22: what if my supervisor process crashes? I'm looking for some sort of API or setting that I can use to have Erlang monitor and restart my application for me.)
Thanks!

You should be able to fix this in the top-level supervisor: set the restart strategy to allow one million restarts every second, and the application should never crash. Something like:
init(_Args) ->
{ok, {{one_for_one, 1000000, 1},
[{ch3, {ch3, start_link, []},
permanent, brutal_kill, worker, [ch3]}]}}.
(Example adapted from the OTP Design Principles User Guide.)

You can use heart to restart the entire VM if it goes down, then use a permanent application type to make sure that the VM exits when your application exits.
Ultimately you need something above your application that you need to trust, whether it is a supervisor process, the erlang VM, or some shell script you wrote - it will always be a problem if that happens to fail also.

Use Monit, then setup your application to terminate by using a supervisor for the whole application with a reasonable restart frequency. If the application terminates, the VM terminates, and monit restarts everything.
I could never get Heart to be reliable enough, as it only restarts the VM once, and it doesn't deal well with a kill -9 of the erlang VM.

Related

How can I stop other application in gen_server:terminate/2?

I'm making my own server with erlang OTP and I stuck in the problem when I use Mnesia.
I start Mnesia in gen_server:init/1 of my worker and stop it in gen_server:terminate/2 of the same worker.
Unfortunately, When function mnesia:stop/0 is called by calling application:stop(myApplication) or init:stop(), the application stucks and ends up with this :
=SUPERVISOR REPORT==== 23-Jun-2021::16:54:12.048000 ===
supervisor: {local,temp_sup}
errorContext: shutdown_error
reason: killed
offender: [{pid,<0.159.0>},
{id,myMnesiaTest_sup},
{mfargs,{myMnesiaTest_sup,start_link,[]}},
{restart_type,permanent},
{shutdown,10000},
{child_type,supervisor}]
Of course, this doesn't happen when gen_server:terminate/2 isn't called by setting trap_exit flag as false, but Mnesia also doesn't stop.
I don't know why an application cannot be stopped in other application and want to know it's ok if I don't call mnesia:stop() in the end of my application.
The reason you cannot stop Mnesia when your application is stopping is that at that time the application_controller process is busy with stopping your application. This is a classic deadlock situation when one gen_server (in this case quite indirectly) performs a synchronous call to an other gen_server which in turn wants to do a synchronous call to the first one.
You can break the deadlock by asynchronously shutting down Mnesia after your application stopped. Try calling from your terminate/2 timer:apply_after(0, mnesia, stop, []) for example. (Just spawning a process to do the call is not ideal, it would still belong to your application and would get killed when the application terminates.)
But most of the time you don't really have to bother with stopping Mnesia. Erlang applications by convention leave their dependencies started when stopped. And in case your application is terminated by init:stop(), it will take care of stopping all other applications anyway, including Mnesia.

How do you hook to an application in Erlang

I do not understand how do you hook to an Erlang application since it does not return the Pid.
Consider for example the snippet below.I am starting a Pid which receives messages to process.However my application behaviour does not return anything.
How do i hook to the Pid i am interested in when using application behaviour ?
.app
{
application,simple_app,
[
{description,"something"},
{mod,{simple_app,[]}},
{modules,[proc]}
]
}
app
-module(simple_app).
-behaviour(application).
-export([start/2, stop/1]).
start(_StartType, _StartArgs) ->
proc:start().
stop(_State) ->
ok.
module
-module(proc).
-export([start/0]).
start()->
Pid=spawn_link(?MODULE,loop,[]),
{ok,Pid}.
loop()->
receive
{From,Message}-> From ! {ok,Message},
loop();
_ ->loop()
end.
P.S I am trying to understand how do i get the root Pid to further use it to issue commands ? In my case i need the Pid of the proc:start module.If my root was a supervisor , i would need the Pid of the supervisor.The application does not return a Pid? How do i hook to it ?
The question thus is when starting the application wouldn't i need a Pid returned by it to then be able to issue commands against?
Your application must depend on kernel and stdlib. You should define their names in your .app file, for example:
{
application,simple_app,
[
{description,"something"},
{mod,{simple_app,[]}},
{modules,[proc]},
{applications, [kernel, stdlib]}
]
}
When you want to start your app, you should use the application module which is part of the kernel application.
It starts some processes to manage your application and I/O handling. It calls YOUR_APP:start(_, _) and this function MUST return a Pid which is running the supervisor behaviour. We often call it the root supervisor of app.
So you have to define an application behaviour (as you did) and a supervisor behaviour.
This supervisor process may start your workers which are doing anything your app wants to do.
If you want to start a process, you define its start specification in your supervisor module. So kernel starts your app and your app starts your supervisor and your supervisor starts your worker(s).
You can register your worker pid with a name and you can send it messages by using its name.
If you have lots of workers you can use a pool of pids which maintains your worker pids.
I think it's OK to play with spawn and spawn_link and sending messages manually to processes. But in production code we usually don't do this. We use OTP behaviours and they do this for us in a reliable and clean manner.
I think it's better to write some gen_servers (another behaviour) and play with handle_call and handle_cast, etc callbacks. Then run some gen_servers under a supervision tree and play with the supervisor API to kill or terminate its children, etc. Then start writing a complete application.
Remember to read the documentation for behaviours carefully.

Exit the VM when an application stops running

I've got an Erlang application packed with Rebar that's meant to be run as a service. It clusters with other instances of itself.
One thing I've noticed is that if the application crashes on one node, the Erlang VM remains up even when the application reaches its supervisor's restart limit and vanishes forever. The result is that other nodes in the cluster don't notice anything until they try to talk to the application.
Is there a simple way to link the VM to the root supervisor, so that the application takes down the whole VM when it dies?
When starting your application using application:start() you can add the optional Type parameter to be one of the atoms permanent, transient or temporary. I guess you are looking for permanent.
As mentioned in application:start/2:
If a permanent application terminates, all other applications and the entire Erlang node are also terminated.
If a transient application terminates with Reason == normal, this is reported but no other applications are terminated. If a transient application terminates abnormally, all other applications and the entire Erlang node are also terminated.
If a temporary application terminates, this is reported but no other applications are terminated.

Otp application:stop(..) kills all spawned processes, not just spawn_linked ones?

I've set up a simple test-case at https://github.com/bvdeenen/otp_super_nukes_all that shows that an otp application:stop() actually kills all spawned processes by its children, even the ones that are not linked.
The test-case consists of one gen_server (registered as par) spawning a plain erlang process (registered as par_worker) and a gen_server (registered as reg_child), which also spawns a plain erlang process (registered as child_worker). Calling application:stop(test_app) does a normal termination on the 'par' gen_server, but an exit(kill) on all others!
Is this nominal behaviour? If so, where is it documented, and can I disable it? I want the processes I spawn from my gen_server (not link), to stay alive when the application terminates.
Thanks
Bart van Deenen
The application manual says (for the stop/1 function):
Last, the application master itself terminates. Note that all processes with the
application master as group leader, i.e. processes spawned from a process belonging
to the application, thus are terminated as well.
So I guess you cant modify this behavior.
EDIT: You might be able to change the group_leader of the started process with group_leader(GroupLeader, Pid) -> true (see: http://www.erlang.org/doc/man/erlang.html#group_leader-2). Changing the group_leader might allow you to avoid killing your process when the application ends.
I made that mistakes too, and found out it must happen.
If parent process dies, all children process dies no matter what it is registered or not.
If this does not happen, we have to track all up-and-running processes and figure out which is orphaned and which is not. you can guess how difficult it would be. You can think of unix ppid and pid. if you kill ppid, all children dies too. This, I think this must happen.
If you want to have processes independent from your application, you can send a messageto other application to start processes.
other_application_module:start_process(ProcessInfo).

Supervisors with backoff

I have a supervisor with two worker processes: a TCP client which handles connection to a remote server and an FSM which handles the connection protocol.
Handling TCP errors in the child process complicates code significantly. So I'd prefer to "let it crash", but this has a different problem: when the server is unreachable, the maximum number of restarts will be quickly reached and the supervisor will crash along with my entire application, which is quite undesirable for this case.
What I'd like is to have a restart strategy with back-off; failing that, it would be good enough if the supervisor was aware when it is restarted due to a crash (i.e. had it passed as a parameter to the init function). I've found this mailing list thread, but is there a more official/better tested solution?
You might find our supervisor cushion to be a good starting point. I use it slow down the restart on things that must be running, but are failing quickly on startup (such as ports that are encountering a resource problem).
I've had this problem many times working with erlang and tried many solutions. I think the best best I've found is to have an extra process that is started by the supervisor and starts the that might crash.
It starts the child on start-up, awaits child exits and restarts the child (with a delay) or exits as appropriate. I think this is simpler than the back-off server (which you link to) as you only need to keep state regarding a single child.
Another solution that I've used is to have to start the child processes as transient and have a separate process that polls and issues restarts to any processes that have crashed.
So first you want to catch an early termination of the child by using a process_flag(trap_exit, true) in your init.
Then you need to decide how long you want to delay a restart by, for example 10 sec., do this in the
handle_info({'EXIT', _Pid, Reason}, State) ->
erlang:send_after(10000, self(), {die, Reason}),
{noreply, State};
Lastly, let the process die with
handle_info({die, Reason}, State) ->
{stop, Reason, State};

Resources