I'm working on building a supervisor in Erlang that looks like this:
-module(a_sup).
-behaviour(supervisor).
%% API
-export([start_link/0, init/1]).
start_link() ->
{ok, supervisor:start_link({local,?MODULE}, ?MODULE, [])}.
init(_Args) ->
RestartStrategy = {simple_one_for_one, 5, 3600},
ChildSpec = {
a_gen_server,
{a_gen_server, start_link, []},
permanent,
brutal_kill,
worker,
[a_gen_server]
},
{ok, {RestartStrategy,[ChildSpec]}}.
And this is how my gen_server looks:
-module(a_gen_server).
-behavior(gen_server).
%% API
-export([start_link/2, init/1]).
start_link(Name, {X, Y}) ->
gen_server:start_link({local, Name}, ?MODULE, [Name, {X,Y}], []),
ok.
init([Name, {X,Y}]) ->
process_flag(trap_exit, true),
io:format("~p: position {~p,~p}~n",[Name, X, Y]),
{ok, {X,Y}}.
My gen_server works completely fine. When I run the supervisor as:
1> c(a_sup).
{ok,a_sup}
2> Pid = a_sup:start_link().
{ok,{ok,<0.85.0>}}
3> supervisor:start_child(a_sup, [Hello, {4,3}]).
Hello: position {4,3}
{error,ok}
I couldn't understand where the {error, ok} is coming from, and if there is an error, then what is causing it. So this is what I get when I check the status of the children:
> supervisor:count_children(a_sup).
[{specs,1},{active,0},{supervisors,0},{workers,0}]
This means that there are no children registered with the supervisor yet despite it calling the init method of the gen_server and spawning a process? Clearly there is some error preventing the method to complete successfully but I can't seem to gather any hints to figure it out.
The problem is that a_gen_server:start_link (since that's the function used to in the child spec) is expected to return {ok, Pid}, not just ok.
As the docs put it:
The start function must create and link to the child process, and must
return {ok,Child} or {ok,Child,Info}, where Child is the pid of the
child process and Info any term that is ignored by the supervisor.
The start function can also return ignore if the child process for
some reason cannot be started, in which case the child specification
is kept by the supervisor (unless it is a temporary child) but the
non-existing child process is ignored.
If something goes wrong, the function can also return an error tuple
{error,Error}.
Related
I have a simple_one_for_one supervisor which has gen_fsm children.
I want each gen_fsm child to send a message only on the last time it terminates.
Is there any way to know when is the last cycle?
here's my supervisor:
-module(data_sup).
-behaviour(supervisor).
%% API
-export([start_link/0,create_bot/3]).
%% Supervisor callbacks
-export([init/1]).
%%-compile(export_all).
%%%===================================================================
%%% API functions
%%%===================================================================
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
RestartStrategy = {simple_one_for_one, 0, 1},
ChildSpec = {cs_fsm, {cs_fsm, start_link, []},
permanent, 2000, worker, [cs_fsm]},
Children = [ChildSpec],
{ok, {RestartStrategy, Children}}.
create_bot(BotId, CNPJ,Pid) ->
supervisor:start_child(?MODULE, [BotId, CNPJ, Pid]).
the Pid is the Pid of the process which starts the superviser and gives orders to start the children.
-module(cs_fsm).
-behaviour(gen_fsm).
-compile(export_all).
-define(SERVER, ?MODULE).
-define(TIMEOUT, 5000).
-record(params, {botId, cnpj, executionId, pid}).
%%%===================================================================
%%% API
%%%===================================================================
start_link(BotId, CNPJ, Pid) ->
io:format("start_link...~n"),
Params = #params{botId = BotId, cnpj = CNPJ, pid = Pid},
gen_fsm:start_link(?MODULE, Params, []).
%%%===================================================================
%%% gen_fsm callbacks
%%%===================================================================
init(Params) ->
io:format("initializing~n"),
process_flag(trap_exit, true),
{ok, requesting_execution, Params, 0}.
requesting_execution(timeout,Params) ->
io:format("erqusting execution"),
{next_state, finished, Params,?TIMEOUT}.
finished(timeout, Params) ->
io:format("finished :)~n"),
{stop, normal, Params}.
terminate(shutdown, _StateName, Params) ->
Params#params.pid ! {terminated, self(),Params},
ok;
terminate(_Reason, _StateName, Params) ->
ok.
my point is that if the process fails in any of the states it should send a message only if it is the last time it is restarted by the supervisor (according to its restart strategy).
If the gen_fsm fails, does it restart from the same state with same state data? If not how can I cause it to happen?
You can add sending the message to the Module:terminate/3 function which is called when one of the StateName functions returns {stop,Reason,NewStateData} to indicate that the gen_fsm should be stopped.
gen_fsm is a finite state machine so you decide how it transitions between states. Something that triggers the last cycle may also set something in the StateData that is passed to Module:StateName/3 so that the function that handles the state knows it's the last cycle. It's hard to give a more specific answer unless you provide some code which we could analyze and comment on.
EDIT after further clarification:
Supervisor doesn't notify its children which time it has restarted them and it also can't notify the child that it's the last restart. This later is simply because it doesn't know that it's going to be the last until the supervisor process actually crashes once more, which the supervisor can't possibly predict. Only after the child crashed supervisor can calculate how many times the child crashed during a period of time and if it is allowed to restart the child once more or if that was the last restart and now it's time for the supervisor to die as well.
However, nothing is stopping the child from registering, e.g. in an ETS table, how many times it has been restarted. But it of course won't help with deducting which restart is the last one.
Edit 2:
When the supervisor restarts the child it starts it from scratch using the standard init function. Any previous state of the child before it crashed is lost.
Please note that a crash is an exceptional situation and it's not always possible to recover the state, because the crash could have corrupted the state. Instead of trying to recover the state or asking supervisor when it's done restarting the child, why not to prevent the crash from happening in the first place? You have two options:
I. Use try/catch to catch any exceptional situations and act accordingly. It's possible to catch any error that would otherwise crash the process and cause supervisor to restart it. You can add try/catch to any entry function inside the gen_fsm process so that any error condition is caught before it crashes the server. See example function 1 or example function 2:
read() ->
try
try_home() orelse try_path(?MAIN_CFG) orelse
begin io:format("Some Error", []) end
catch
throw:Term -> {error, Term}
end.
try_read(Path) ->
try
file:consult(Path)
catch
error:Error -> {error, Error}
end.
II. Spawn a new process to handle the job and trap EXIT signals when the process dies. This allows gen_fsm to handle a job asynchronously and handle any errors in a custom way (not necessarily by restarting the process as a supervisor would do). This section titled Error Handling explains how to trap exit signals from child processes. And this is an example of trapping signals in a gen_server. Check the handle_info function that contains a few clauses to trap different types of EXIT messages from children processes.
init([Cfg, Id, Mode]) ->
process_flag(trap_exit, true),
(...)
handle_info({'EXIT', _Pid, normal}, State) ->
{noreply, State};
handle_info({'EXIT', _Pid, noproc}, State) ->
{noreply, State};
handle_info({'EXIT', Pid, Reason}, State) ->
log_exit(Pid, Reason),
check_done(error, Pid, State);
handle_info(_, State) ->
{noreply, State}.
I want to start a supervisor with a process that would spawn more processes linked to the supervisor. The program freezes at supervisor:start_child.
The supervisor starts the main child:
% supervisor (only part shown)
init([]) ->
MainApp = ?CHILD_ARG(mainapp, worker, [self()]),
{ok, { {one_for_one, 5, 10}, [MainApp]} }.
The main child starts here:
% mainapp (gen_server)
start_link([SuperPid]) when is_pid(SuperPid) ->
io:format("Mainapp started~n"),
gen_server:start_link({local, ?MODULE}, ?MODULE, [SuperPid], []).
init([SuperPid]) ->
{ok, _Pid} = start_child(childapp, SuperPid), % <-- here start the other
{ok, #state{sup=SuperPid}}.
start_child(Module, SuperPid) -> % Module = childapp
io:format("start child before~n"), % printed
ChildSpec = ?CHILD(Module, worker),
{ok, Pid} = supervisor:start_child(SuperPid, ChildSpec), % <-- here freezes
io:format("start child after~n"), % not printed
{ok, Pid}.
And the other child source contains
% childapp
start_link([]) ->
io:format("Child started~n"),
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
%% gen_server interface
init([]) ->
{ok, #state{}}.
What I get at the output when running the app is:
erl -pa ebin -eval "application:start(mysuptest)"
Erlang R16B01 (erts-5.10.2) [source-bdf5300] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V5.10.2 (abort with ^G)
1> Mainapp started
start child before
and here it stops - it freezes, and does not return to the erlang console as usual. I do not get any error caught or any other messages. Any ideas? Do I start the child properly?
When you start a child process, the call from supervisor will return only after the child process init (in case the child process is a gen_server the start_link gets blocked till init) is returned. You are starting the main gen_server in the supervisor. Hence the supervisor is waiting for the mainapp to return. Meanwhile the mainapp is calling supervisor:start_child function. This gets blocked because the supervisor is waiting for return from mainapp. This results in a deadlock situation.
One possible solutions is that do not call start_child in the mainapp and do it asynchronously after init returns
For this you can send a cast message to itself where you can start the child. Or you can spawn another process which starts and sends the response (child Pid) to the mainapp
init([SuperPid]) ->
handle_cast(self(), {start, SuperPid}), % <-- send a cast message to itself
{ok, #state{sup=SuperPid}}.
Another preferred solution is having a supervision tree. The child process can have its own supervisor and the mainapp calls the child's supervisor to start the child process.
The supervisor is an OTP behavior.
init([]) ->
RoomSpec = {mod_zytm_room, {mod_zytm_room, start_link, []},
transient, brutal_kill, worker, [mod_zytm_room]},
{ok, {{simple_one_for_one, 10, 10000}, [RoomSpec]}}.
Above code will invoke child's terminate method.
But if I change the brutal_kill to an integer timeout (e.g. 6000), the terminate method was never invoked.
I see an explanation in the Erlang document:
The dynamically created child processes of a simple-one-for-one
supervisor are not explicitly killed, regardless of shutdown strategy,
but are expected to terminate when the supervisor does (that is, when
an exit signal from the parent process is received).
But I cannot fully understand. Is it said that exit(Pid, kill) can terminate a simple_one_for_one child spec while exit(Pid, shutdown) can't ?
===================================update====================================
mod_zytm_room_sup.erl
-module(mod_zytm_room_sup).
-behaviour(supervisor).
-export([start_link/0, init/1, open_room/1, close_room/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
RoomSpec = {mod_zytm_room, {mod_zytm_room, start_link, []},
transient, brutal_kill, worker, [mod_zytm_room]},
{ok, {{simple_one_for_one, 10, 10000}, [RoomSpec]}}.
open_room(RoomId) ->
supervisor:start_child(?MODULE, [RoomId]).
close_room(RoomPid) ->
supervisor:terminate_child(?MODULE, RoomPid).
mod_zytm_room.erl
-module(mod_zytm_room).
-behaviour(gen_server).
-export([start_link/1]).
-export([init/1, handle_cast/2, handle_info/2, handle_call/3, code_change/3, terminate/2]).
start_link(RoomId) ->
gen_server:start_link(?MODULE, [RoomId], []).
init([RoomId]) ->
{ok, []}.
terminate(_, _) ->
error_logger:info_msg("~p terminated:~p", [?MODULE, self()]),
ok.
...other methods ommited.
mod_zytm_sup.erl
-module(mod_zytm_sup).
-behaviour(gen_server).
-export([start_link/0]).
-export([init/1, handle_cast/2, handle_info/2, handle_call/3, code_change/3, terminate/2]).
start_link() ->
gen_server:start_link(?MODULE, [], []).
init([]) ->
{ok, []}.
%% invoked by an erlang:send_after event.
handle_info({'CLOSE_ROOM', RoomPid}, State) ->
mod_zytm_room_sup:close_room(RoomPid),
{noreply, State}.
...other methods ommited.
Both mod_zytm_sup and mod_zytm_room_sup are a part of a system supervision tree, mod_zytm_sup invoke mod_zytm_room_sup to create or close mod_zytm_room process.
Sorry I've got wrong result.
To make it clear:
brutal_kill strategy kill child process immediately.
The terminate method will be invoked if the simple_one_for_one's shutdown strategy is an integer timeout. The child must declare process_flag(trap_exit, true) in its init callback.
FYI, Manual on Erlang doc:
If the gen_server is part of a supervision tree and is ordered by its
supervisor to terminate, this function will be called with
Reason=shutdown if the following conditions apply:
the gen_server has been set to trap exit signals, and the shutdown
strategy as defined in the supervisor's child specification is an
integer timeout value, not brutal_kill.
The dynamically created child processes of a simple-one-for-one
supervisor are not explicitly killed, regardless of shutdown strategy,
but are expected to terminate when the supervisor does (that is, when
an exit signal from the parent process is received).
Note that this is no longer true. Since Erlang/OTP R15A, dynamic children are explicitly terminated as per the shutdown strategy.
I have simple erlang module and I want to rewrite it based on OTP principles. But I can not determine what opt template I should use.
Module's code:
-module(main).
-export([start/0, loop/0]).
start() ->
Mypid = spawn(main, loop, []),
register( main, Mypid).
loop() ->
receive
[Pid, getinfo] -> Pid! [self(), welcome],
io:fwrite( "Got ~p.~n", [Pid] ),
// spawn new process here
loop();
quit -> ok;
X ->
io:fwrite( "Got ~p.~n", [ X ] ),
// spawn new process here
loop()
end.
gen_server would be fine.
Couple things:
it is a bad practice to send message to yourself
messages are usually tuples not lists because they are not dynamic
despite your comment, you do not spawn the new process.
Call to loop/0 enters the same loop.
Gen_server init would hold your start/0 body. API calls sequence and proxy your calls via gen_server to handle_calls. To spawn new process on function call, add spawn function to the body of desired handle_call. Do not use handle_info to handle incoming messages -- instead of sending them call the gen_server API and 'translate' your call into gen_server:call or cast. e.g.
start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
init(_) ->
{ok, #state{}}
welcome(Arg) ->
gen_server:cast(?MODULE, {welcome, Arg}).
handle_cast({welcome, Arg},_,State) ->
io:format("gen_server PID: ~p~n", [self()]),
spawn(?MODULE, some_fun, [Arg]),
{noreply, State}
some_fun(Arg) ->
io:format("Incoming arguments ~p to me: ~p~n",[Arg, self()]).
I have never compiled above, but it should give you the idea.
I'm trying implement a simple supervisor and just have it restart child processes if they fail. But, I don't even know how to spawn more than one process under a supervisor! I looked at simple supervisor code on this site and found something
-module(echo_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
start_link() ->
{ok, Pid} = supervisor:start_link(echo_sup, []),
unlink(Pid).
init(_Args) ->
{ok, {{one_for_one, 5, 60},
[{echo_server, {echo_server, start_link, []},
permanent, brutal_kill, worker, [echo_server]},
{echo_server2, {echo_server2, start_link, []},
permanent, brutal_kill, worker, [echo_server2]}]}}.
I assumed that putting "echo_server2" part in the init() function would spawn another process under this supervisor, but I end up getting an exception exit:shutdown message.
Both the files "echo_server" and "echo_server2" are the same code but different names. So I'm just confused now.
-module(echo_server2).
-behaviour(gen_server).
-export([start_link/0]).
-export([echo/1, crash/0]).
-export([init/1, handle_call/3, handle_cast/2]).
start_link() ->
{ok,Pid} = gen_server:start_link({local, echo_server2}, echo_server2, [], []),
unlink(Pid).
%% public api
echo(Text) ->
gen_server:call(echo_server2, {echo, Text}).
crash() ->
gen_server:call(echo_server2, crash).
%% behaviours
init(_Args) ->
{ok, none}.
handle_call(crash, _From, State) ->
X=1,
{reply, X=2, State};
handle_call({echo, Text}, _From, State) ->
{reply, Text, State}.
handle_cast(_, State) ->
{noreply, State}.
First you need read some docs about OTP/gen_server and OTP/supervisors. You have few errors in your code.
1) In echo_sup module change your start_link function as follow:
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
Dont know why do you unlink/1 after process has been started.
2) In both echo_servers change start_link function to:
start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
You should not to change return value of this function, because supervisor expect one of this values:
{ok,Pid} | ignore | {error,Error}
You don't need two different modules just to run two instances of the same server. The conflict problem is due to the tag in the child specification which has to be unique. It is the first element in the tuple. So you could have something like:
[{echo_server, {echo_server, start_link, []},
permanent, brutal_kill, worker, [echo_server]},
{echo_server2, {echo_server, start_link, []},
permanent, brutal_kill, worker, [echo_server]}]}}.
Why do you unlink the child processes? The supervisor uses these links to supervise its children. The error you are getting is that the supervisor expects the functions which start the children to return {ok,ChildPid}, this is how it gets the pid of the children, so when it gets another return value it fails the startup of the children and then gives up itself. All according to how it is supposed to work.
If you want to register both servers then you could modify the start_link function to take the name to use as an argument and pass so you can explicitly pass it in through the child spec. So:
start_link(Name) ->
gen_server:start_link({local, Name}, ?MODULE, [], []).
and
[{echo_server, {echo_server, start_link, [echo_server]},
permanent, brutal_kill, worker, [echo_server]},
{echo_server2, {echo_server, start_link, [echo_server2]},
permanent, brutal_kill, worker, [echo_server]}]}}.
Using the module name as the registered name for a server is just a convention which only works if you run one instance of the server.