Erlang freezes on supervisor:start_child - erlang

I want to start a supervisor with a process that would spawn more processes linked to the supervisor. The program freezes at supervisor:start_child.
The supervisor starts the main child:
% supervisor (only part shown)
init([]) ->
MainApp = ?CHILD_ARG(mainapp, worker, [self()]),
{ok, { {one_for_one, 5, 10}, [MainApp]} }.
The main child starts here:
% mainapp (gen_server)
start_link([SuperPid]) when is_pid(SuperPid) ->
io:format("Mainapp started~n"),
gen_server:start_link({local, ?MODULE}, ?MODULE, [SuperPid], []).
init([SuperPid]) ->
{ok, _Pid} = start_child(childapp, SuperPid), % <-- here start the other
{ok, #state{sup=SuperPid}}.
start_child(Module, SuperPid) -> % Module = childapp
io:format("start child before~n"), % printed
ChildSpec = ?CHILD(Module, worker),
{ok, Pid} = supervisor:start_child(SuperPid, ChildSpec), % <-- here freezes
io:format("start child after~n"), % not printed
{ok, Pid}.
And the other child source contains
% childapp
start_link([]) ->
io:format("Child started~n"),
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
%% gen_server interface
init([]) ->
{ok, #state{}}.
What I get at the output when running the app is:
erl -pa ebin -eval "application:start(mysuptest)"
Erlang R16B01 (erts-5.10.2) [source-bdf5300] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.2 (abort with ^G)
1> Mainapp started
start child before
and here it stops - it freezes, and does not return to the erlang console as usual. I do not get any error caught or any other messages. Any ideas? Do I start the child properly?

When you start a child process, the call from supervisor will return only after the child process init (in case the child process is a gen_server the start_link gets blocked till init) is returned. You are starting the main gen_server in the supervisor. Hence the supervisor is waiting for the mainapp to return. Meanwhile the mainapp is calling supervisor:start_child function. This gets blocked because the supervisor is waiting for return from mainapp. This results in a deadlock situation.
One possible solutions is that do not call start_child in the mainapp and do it asynchronously after init returns
For this you can send a cast message to itself where you can start the child. Or you can spawn another process which starts and sends the response (child Pid) to the mainapp
init([SuperPid]) ->
handle_cast(self(), {start, SuperPid}), % <-- send a cast message to itself
{ok, #state{sup=SuperPid}}.
Another preferred solution is having a supervision tree. The child process can have its own supervisor and the mainapp calls the child's supervisor to start the child process.

Related

Erlang: Supervisor start_child succeeds but no child is added

I'm working on building a supervisor in Erlang that looks like this:
-module(a_sup).
-behaviour(supervisor).
%% API
-export([start_link/0, init/1]).
start_link() ->
{ok, supervisor:start_link({local,?MODULE}, ?MODULE, [])}.
init(_Args) ->
RestartStrategy = {simple_one_for_one, 5, 3600},
ChildSpec = {
a_gen_server,
{a_gen_server, start_link, []},
permanent,
brutal_kill,
worker,
[a_gen_server]
},
{ok, {RestartStrategy,[ChildSpec]}}.
And this is how my gen_server looks:
-module(a_gen_server).
-behavior(gen_server).
%% API
-export([start_link/2, init/1]).
start_link(Name, {X, Y}) ->
gen_server:start_link({local, Name}, ?MODULE, [Name, {X,Y}], []),
ok.
init([Name, {X,Y}]) ->
process_flag(trap_exit, true),
io:format("~p: position {~p,~p}~n",[Name, X, Y]),
{ok, {X,Y}}.
My gen_server works completely fine. When I run the supervisor as:
1> c(a_sup).
{ok,a_sup}
2> Pid = a_sup:start_link().
{ok,{ok,<0.85.0>}}
3> supervisor:start_child(a_sup, [Hello, {4,3}]).
Hello: position {4,3}
{error,ok}
I couldn't understand where the {error, ok} is coming from, and if there is an error, then what is causing it. So this is what I get when I check the status of the children:
> supervisor:count_children(a_sup).
[{specs,1},{active,0},{supervisors,0},{workers,0}]
This means that there are no children registered with the supervisor yet despite it calling the init method of the gen_server and spawning a process? Clearly there is some error preventing the method to complete successfully but I can't seem to gather any hints to figure it out.
The problem is that a_gen_server:start_link (since that's the function used to in the child spec) is expected to return {ok, Pid}, not just ok.
As the docs put it:
The start function must create and link to the child process, and must
return {ok,Child} or {ok,Child,Info}, where Child is the pid of the
child process and Info any term that is ignored by the supervisor.
The start function can also return ignore if the child process for
some reason cannot be started, in which case the child specification
is kept by the supervisor (unless it is a temporary child) but the
non-existing child process is ignored.
If something goes wrong, the function can also return an error tuple
{error,Error}.

How can I know when it's the last cycle of my process restarted by the supervisor in erlang

I have a simple_one_for_one supervisor which has gen_fsm children.
I want each gen_fsm child to send a message only on the last time it terminates.
Is there any way to know when is the last cycle?
here's my supervisor:
-module(data_sup).
-behaviour(supervisor).
%% API
-export([start_link/0,create_bot/3]).
%% Supervisor callbacks
-export([init/1]).
%%-compile(export_all).
%%%===================================================================
%%% API functions
%%%===================================================================
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
RestartStrategy = {simple_one_for_one, 0, 1},
ChildSpec = {cs_fsm, {cs_fsm, start_link, []},
permanent, 2000, worker, [cs_fsm]},
Children = [ChildSpec],
{ok, {RestartStrategy, Children}}.
create_bot(BotId, CNPJ,Pid) ->
supervisor:start_child(?MODULE, [BotId, CNPJ, Pid]).
the Pid is the Pid of the process which starts the superviser and gives orders to start the children.
-module(cs_fsm).
-behaviour(gen_fsm).
-compile(export_all).
-define(SERVER, ?MODULE).
-define(TIMEOUT, 5000).
-record(params, {botId, cnpj, executionId, pid}).
%%%===================================================================
%%% API
%%%===================================================================
start_link(BotId, CNPJ, Pid) ->
io:format("start_link...~n"),
Params = #params{botId = BotId, cnpj = CNPJ, pid = Pid},
gen_fsm:start_link(?MODULE, Params, []).
%%%===================================================================
%%% gen_fsm callbacks
%%%===================================================================
init(Params) ->
io:format("initializing~n"),
process_flag(trap_exit, true),
{ok, requesting_execution, Params, 0}.
requesting_execution(timeout,Params) ->
io:format("erqusting execution"),
{next_state, finished, Params,?TIMEOUT}.
finished(timeout, Params) ->
io:format("finished :)~n"),
{stop, normal, Params}.
terminate(shutdown, _StateName, Params) ->
Params#params.pid ! {terminated, self(),Params},
ok;
terminate(_Reason, _StateName, Params) ->
ok.
my point is that if the process fails in any of the states it should send a message only if it is the last time it is restarted by the supervisor (according to its restart strategy).
If the gen_fsm fails, does it restart from the same state with same state data? If not how can I cause it to happen?
You can add sending the message to the Module:terminate/3 function which is called when one of the StateName functions returns {stop,Reason,NewStateData} to indicate that the gen_fsm should be stopped.
gen_fsm is a finite state machine so you decide how it transitions between states. Something that triggers the last cycle may also set something in the StateData that is passed to Module:StateName/3 so that the function that handles the state knows it's the last cycle. It's hard to give a more specific answer unless you provide some code which we could analyze and comment on.
EDIT after further clarification:
Supervisor doesn't notify its children which time it has restarted them and it also can't notify the child that it's the last restart. This later is simply because it doesn't know that it's going to be the last until the supervisor process actually crashes once more, which the supervisor can't possibly predict. Only after the child crashed supervisor can calculate how many times the child crashed during a period of time and if it is allowed to restart the child once more or if that was the last restart and now it's time for the supervisor to die as well.
However, nothing is stopping the child from registering, e.g. in an ETS table, how many times it has been restarted. But it of course won't help with deducting which restart is the last one.
Edit 2:
When the supervisor restarts the child it starts it from scratch using the standard init function. Any previous state of the child before it crashed is lost.
Please note that a crash is an exceptional situation and it's not always possible to recover the state, because the crash could have corrupted the state. Instead of trying to recover the state or asking supervisor when it's done restarting the child, why not to prevent the crash from happening in the first place? You have two options:
I. Use try/catch to catch any exceptional situations and act accordingly. It's possible to catch any error that would otherwise crash the process and cause supervisor to restart it. You can add try/catch to any entry function inside the gen_fsm process so that any error condition is caught before it crashes the server. See example function 1 or example function 2:
read() ->
try
try_home() orelse try_path(?MAIN_CFG) orelse
begin io:format("Some Error", []) end
catch
throw:Term -> {error, Term}
end.
try_read(Path) ->
try
file:consult(Path)
catch
error:Error -> {error, Error}
end.
II. Spawn a new process to handle the job and trap EXIT signals when the process dies. This allows gen_fsm to handle a job asynchronously and handle any errors in a custom way (not necessarily by restarting the process as a supervisor would do). This section titled Error Handling explains how to trap exit signals from child processes. And this is an example of trapping signals in a gen_server. Check the handle_info function that contains a few clauses to trap different types of EXIT messages from children processes.
init([Cfg, Id, Mode]) ->
process_flag(trap_exit, true),
(...)
handle_info({'EXIT', _Pid, normal}, State) ->
{noreply, State};
handle_info({'EXIT', _Pid, noproc}, State) ->
{noreply, State};
handle_info({'EXIT', Pid, Reason}, State) ->
log_exit(Pid, Reason),
check_done(error, Pid, State);
handle_info(_, State) ->
{noreply, State}.

Simple_one_for_one can only be terminated if appointed SHUTDOWN strategy to brutal_kill?

The supervisor is an OTP behavior.
init([]) ->
RoomSpec = {mod_zytm_room, {mod_zytm_room, start_link, []},
transient, brutal_kill, worker, [mod_zytm_room]},
{ok, {{simple_one_for_one, 10, 10000}, [RoomSpec]}}.
Above code will invoke child's terminate method.
But if I change the brutal_kill to an integer timeout (e.g. 6000), the terminate method was never invoked.
I see an explanation in the Erlang document:
The dynamically created child processes of a simple-one-for-one
supervisor are not explicitly killed, regardless of shutdown strategy,
but are expected to terminate when the supervisor does (that is, when
an exit signal from the parent process is received).
But I cannot fully understand. Is it said that exit(Pid, kill) can terminate a simple_one_for_one child spec while exit(Pid, shutdown) can't ?
===================================update====================================
mod_zytm_room_sup.erl
-module(mod_zytm_room_sup).
-behaviour(supervisor).
-export([start_link/0, init/1, open_room/1, close_room/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
RoomSpec = {mod_zytm_room, {mod_zytm_room, start_link, []},
transient, brutal_kill, worker, [mod_zytm_room]},
{ok, {{simple_one_for_one, 10, 10000}, [RoomSpec]}}.
open_room(RoomId) ->
supervisor:start_child(?MODULE, [RoomId]).
close_room(RoomPid) ->
supervisor:terminate_child(?MODULE, RoomPid).
mod_zytm_room.erl
-module(mod_zytm_room).
-behaviour(gen_server).
-export([start_link/1]).
-export([init/1, handle_cast/2, handle_info/2, handle_call/3, code_change/3, terminate/2]).
start_link(RoomId) ->
gen_server:start_link(?MODULE, [RoomId], []).
init([RoomId]) ->
{ok, []}.
terminate(_, _) ->
error_logger:info_msg("~p terminated:~p", [?MODULE, self()]),
ok.
...other methods ommited.
mod_zytm_sup.erl
-module(mod_zytm_sup).
-behaviour(gen_server).
-export([start_link/0]).
-export([init/1, handle_cast/2, handle_info/2, handle_call/3, code_change/3, terminate/2]).
start_link() ->
gen_server:start_link(?MODULE, [], []).
init([]) ->
{ok, []}.
%% invoked by an erlang:send_after event.
handle_info({'CLOSE_ROOM', RoomPid}, State) ->
mod_zytm_room_sup:close_room(RoomPid),
{noreply, State}.
...other methods ommited.
Both mod_zytm_sup and mod_zytm_room_sup are a part of a system supervision tree, mod_zytm_sup invoke mod_zytm_room_sup to create or close mod_zytm_room process.
Sorry I've got wrong result.
To make it clear:
brutal_kill strategy kill child process immediately.
The terminate method will be invoked if the simple_one_for_one's shutdown strategy is an integer timeout. The child must declare process_flag(trap_exit, true) in its init callback.
FYI, Manual on Erlang doc:
If the gen_server is part of a supervision tree and is ordered by its
supervisor to terminate, this function will be called with
Reason=shutdown if the following conditions apply:
the gen_server has been set to trap exit signals, and the shutdown
strategy as defined in the supervisor's child specification is an
integer timeout value, not brutal_kill.
The dynamically created child processes of a simple-one-for-one
supervisor are not explicitly killed, regardless of shutdown strategy,
but are expected to terminate when the supervisor does (that is, when
an exit signal from the parent process is received).
Note that this is no longer true. Since Erlang/OTP R15A, dynamic children are explicitly terminated as per the shutdown strategy.

What OTP behaviors should I use for such module?

I have simple erlang module and I want to rewrite it based on OTP principles. But I can not determine what opt template I should use.
Module's code:
-module(main).
-export([start/0, loop/0]).
start() ->
Mypid = spawn(main, loop, []),
register( main, Mypid).
loop() ->
receive
[Pid, getinfo] -> Pid! [self(), welcome],
io:fwrite( "Got ~p.~n", [Pid] ),
// spawn new process here
loop();
quit -> ok;
X ->
io:fwrite( "Got ~p.~n", [ X ] ),
// spawn new process here
loop()
end.
gen_server would be fine.
Couple things:
it is a bad practice to send message to yourself
messages are usually tuples not lists because they are not dynamic
despite your comment, you do not spawn the new process.
Call to loop/0 enters the same loop.
Gen_server init would hold your start/0 body. API calls sequence and proxy your calls via gen_server to handle_calls. To spawn new process on function call, add spawn function to the body of desired handle_call. Do not use handle_info to handle incoming messages -- instead of sending them call the gen_server API and 'translate' your call into gen_server:call or cast. e.g.
start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
init(_) ->
{ok, #state{}}
welcome(Arg) ->
gen_server:cast(?MODULE, {welcome, Arg}).
handle_cast({welcome, Arg},_,State) ->
io:format("gen_server PID: ~p~n", [self()]),
spawn(?MODULE, some_fun, [Arg]),
{noreply, State}
some_fun(Arg) ->
io:format("Incoming arguments ~p to me: ~p~n",[Arg, self()]).
I have never compiled above, but it should give you the idea.

ejabberd supervisor module

I need to keep a gen_mod process running as it loops every minute and does some cleanup. However once every few days it will crash and I'll have to manually start it back up again.
I could use a basic example of implementing a supervisor into ejabberd_sup so it can keep going. I am struggling to understand the examples that use gen_server.
Thanks for the help.
Here's an example module combining ejabberd's gen_mod and OTP's gen_server. Explanation is inlined in the code.
-module(cleaner).
-behaviour(gen_server).
-behaviour(gen_mod).
%% gen_mod requires these exports
-export([start/2, stop/1]).
%% these are exports for gen_server
-export([start_link/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
terminate/2, code_change/3]).
-define(INTERVAL, timer:minutes(1)).
-record(state, {}).
%% ejabberd calls this function when this module is loaded
%% basically it adds gen_server defined by this module to
%% ejabberd main supervisor
start(Host, Opts) ->
Proc = gen_mod:get_module_proc(Host, ?MODULE),
ChildSpec = {Proc,
{?MODULE, start_link, [Host, Opts]},
permanent,
1000,
worker,
[?MODULE]},
supervisor:start_child(ejabberd_sup, ChildSpec).
%% this is called by ejabberd when module is unloaded, so it
%% does the opposite of start/2 :)
stop(Host) ->
Proc = gen_mod:get_module_proc(Host, ?MODULE),
supervisor:terminate_child(ejabberd_sup, Proc),
supervisor:delete_child(ejabberd_sup, Proc).
%% it will be called by supervisor when it is time to start
%% this gen_server under control of supervisor
start_link(_Host, _Opts) ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
%% it is an initialization function for gen_server
%% it starts a timer, which sends 'tick' message periodically to itself
init(_) ->
timer:send_interval(?INTERVAL, self(), tick),
{ok, #state{}}.
handle_call(_Request, _From, State) ->
Reply = ok,
{reply, Reply, State}.
handle_cast(_Msg, State) ->
{noreply, State}.
%% this function is called whenever gen_server receives a 'tick' message
handle_info(tick, State) ->
State2 = do_cleanup(State),
{noreply, State2};
handle_info(_Info, State) ->
{noreply, State}.
terminate(_Reason, _State) ->
ok.
code_change(_OldVsn, State, _Extra) ->
{ok, State}.
%% this function is called by handle_info/2 when tick message is received
%% so put all cleanup code here
do_cleanup(State) ->
%% do all cleanup work here
State.
This blog post gives a good explanation how gen_servers work. Of course make sure to re-read OTP design principles on gen_server and on supervisor.
Ejabberd's module developement is described here

Resources