Erlang supervisor does not restart child - erlang

I'm trying to learn about erlang supervisors. I have a simple printer process that prints hello every 3 seconds. I also have a supervisor that must restart the printer process if any exception occurs.
Here is my code:
test.erl:
-module(test).
-export([start_link/0]).
start_link() ->
io:format("started~n"),
Pid = spawn_link(fun() -> loop() end),
{ok, Pid}.
loop() ->
timer:sleep(3000),
io:format("hello~n"),
loop().
test_sup.erl:
-module(test_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init(_Args) ->
SupFlags = #{strategy => one_for_one, intensity => 1, period => 5},
ChildSpecs = [#{id => test,
start => {test, start_link, []},
restart => permanent,
shutdown => brutal_kill,
type => worker,
modules => [test]}],
{ok, {SupFlags, ChildSpecs}}.
Now I run this program and start the supervisor using test_sup:start_link(). command and after a few seconds, I raise an exception. Why the supervisor does not restart the printer process?
Here is the shell output:
1> test_sup:start_link().
started
{ok,<0.36.0>}
hello
hello
hello
hello
2> erlang:error(err).
=ERROR REPORT==== 13-Dec-2016::00:57:10 ===
** Generic server test_sup terminating
** Last message in was {'EXIT',<0.34.0>,
{err,
[{erl_eval,do_apply,6,
[{file,"erl_eval.erl"},{line,674}]},
{shell,exprs,7,
[{file,"shell.erl"},{line,686}]},
{shell,eval_exprs,7,
[{file,"shell.erl"},{line,641}]},
{shell,eval_loop,3,
[{file,"shell.erl"},{line,626}]}]}}
** When Server state == {state,
{local,test_sup},
one_for_one,
[{child,<0.37.0>,test,
{test,start_link,[]},
permanent,brutal_kill,worker,
[test]}],
undefined,1,5,[],0,test_sup,[]}
** Reason for termination ==
** {err,[{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,674}]},
{shell,exprs,7,[{file,"shell.erl"},{line,686}]},
{shell,eval_exprs,7,[{file,"shell.erl"},{line,641}]},
{shell,eval_loop,3,[{file,"shell.erl"},{line,626}]}]}
** exception error: err

Here's the architecture you've created with your files:
test_sup (supervisor)
^
|
v
test (worker)
Then you start your supervisor by calling start_link() in the shell. This creates another bidirectional link:
shell
^
|
v
test_sup (supervisor)
^
|
v
test (worker)
With a bidirectional link, if either side dies, the other side is killed.
When you run erlang:error, you're causing an error in your shell!
Your shell is linked to your supervisor, so Erlang kills the supervisor in response. By chain reaction, your worker gets killed too.
I think you intended to send the error condition to your worker rather than the shell:
Determine the Pid of your worker: supervisor:which_children
Call erlang:exit(Pid, Reason) on the worker's Pid.

When you execute erlang:error(err)., you are killing the calling process, your shell.
As you have used start_link to start the supervisor, it is also killed, and the loop also.
The shell is automatically restarted (thanks to some supervisor), but nobody restart your test supervisor, which cannot restart the loop.
To make this test you should do:
in module test:
start_link() ->
Pid = spawn_link(fun() -> loop() end),
io:format("started ~p~n",[Pid]),
{ok, Pid}.
you will get a prompt:
started <0,xx,0>
where <0,xx,0> is the loop pid, and in the shell you can call
exit(pid(0,xx,0), err).
to kill the loop only.

Related

Can't start process in erlang node

I have two erlang nodes, node01 is 'vm01#192.168.146.128', node02 is 'vm02#192.168.146.128'. I want to start one process on node01 by using spawn(Node, Mod, Fun, Args) on node02, but I always get useless pid.
Node connection is ok:
(vm02#192.168.146.128)14> net_adm:ping('vm01#192.168.146.128').
pong
Module is in the path of node01 and node02:
(vm01#192.168.146.128)7> m(remote_process).
Module: remote_process
MD5: 99784aa56b4feb2f5feed49314940e50
Compiled: No compile time info available
Object file: /src/remote_process.beam
Compiler options: []
Exports:
init/1
module_info/0
module_info/1
start/0
ok
(vm02#192.168.146.128)20> m(remote_process).
Module: remote_process
MD5: 99784aa56b4feb2f5feed49314940e50
Compiled: No compile time info available
Object file: /src/remote_process.beam
Compiler options: []
Exports:
init/1
module_info/0
module_info/1
start/0
ok
However, the spawn is not successful:
(vm02#192.168.146.128)21> spawn('vm01#192.168.146.128', remote_process, start, []).
I'm on node 'vm01#192.168.146.128'
<9981.89.0>
My pid is <9981.90.0>
(vm01#192.168.146.128)8> whereis(remote_process).
undefined
The process is able to run on local node:
(vm02#192.168.146.128)18> remote_process:start().
I'm on node 'vm02#192.168.146.128'
My pid is <0.108.0>
{ok,<0.108.0>}
(vm02#192.168.146.128)24> whereis(remote_process).
<0.115.0>
But it fails on remote node. Can anyone give me some idea?
Here is the source code remote_process.erl:
-module(remote_process).
-behaviour(supervisor).
-export([start/0, init/1]).
start() ->
{ok, Pid} = supervisor:start_link({global, ?MODULE}, ?MODULE, []),
{ok, Pid}.
init([]) ->
io:format("I'm on node ~p~n", [node()]),
io:format("My pid is ~p~n", [self()]),
{ok, {{one_for_one, 1, 5}, []}}.
You are using a global registration for your process, it is necessary for your purpose. The function to retrieve it is global:whereis_name(remote_process).
Edit : It works if
the 2 nodes are connected (check with nodes())
the process is registered with the global module
the process is still alive
if any of these conditions is not satisfied you will get undefined
Edit 2: start node 1 with : werl -sname p1 and type in the shell :
(p1#W7FRR00423L)1> c(remote_process).
{ok,remote_process}
(p1#W7FRR00423L)2> remote_process:start().
I'm on node p1#W7FRR00423L
My pid is <0.69.0>
{ok,<0.69.0>}
(p1#W7FRR00423L)3> global:whereis_name(remote_process).
<0.69.0>
(p1#W7FRR00423L)4>
then start a second node with werl - sname p2 and type in the shell (it is ok to connect the second node later, the global registration is "updated" when necessary):
(p2#W7FRR00423L)1> net_kernel:connect_node(p1#W7FRR00423L).
true
(p2#W7FRR00423L)2> nodes().
[p1#W7FRR00423L]
(p2#W7FRR00423L)3> global:whereis_name(remote_process).
<7080.69.0>
(p2#W7FRR00423L)4>
(p2#W7FRR00423L)4>
Edit 3:
In your test you are spawning a process P1 on the remote node which executes the function remote_process:start/0.
This function calls supervisor:start_link/3 which basically spawns a new supervisor process P2 and links itself to it. after this, P1 has nothing to do anymore so it dies, causing the linked process P2 to die too and you get an undefined reply to the global:whereis_name call.
In my test, I start the process from the shell of the remote node; the shell does not die after I evaluate remote_process:start/0, so the supervisor process does not die and global:whereis_name find the requested pid.
If you want that the supervisor survive to the call, you need an intermediate process that will be spawned without link, so it will not die with its parent. I give you a small example based on your code:
-module(remote_process).
-behaviour(supervisor).
-export([start/0, init/1,local_spawn/0,remote_start/1]).
remote_start(Node) ->
spawn(Node,?MODULE,local_spawn,[]).
local_spawn() ->
% spawn without link so start_wait_stop will survive to
% the death of local_spawn process
spawn(fun start_wait_stop/0).
start_wait_stop() ->
start(),
receive
stop -> ok
end.
start() ->
io:format("start (~p)~n",[self()]),
{ok, Pid} = supervisor:start_link({global, ?MODULE}, ?MODULE, []),
{ok, Pid}.
init([]) ->
io:format("I'm on node ~p~n", [node()]),
io:format("My pid is ~p~n", [self()]),
{ok, {{one_for_one, 1, 5}, []}}.
in the shell you get in node 1
(p1#W7FRR00423L)1> net_kernel:connect_node(p2#W7FRR00423L).
true
(p1#W7FRR00423L)2> c(remote_process).
{ok,remote_process}
(p1#W7FRR00423L)3> global:whereis_name(remote_process).
undefined
(p1#W7FRR00423L)4> remote_process:remote_start(p2#W7FRR00423L).
<7080.68.0>
start (<7080.69.0>)
I'm on node p2#W7FRR00423L
My pid is <7080.70.0>
(p1#W7FRR00423L)5> global:whereis_name(remote_process).
<7080.70.0>
(p1#W7FRR00423L)6> global:whereis_name(remote_process).
undefined
and in node 2
(p2#W7FRR00423L)1> global:registered_names(). % before step 4
[]
(p2#W7FRR00423L)2> global:registered_names(). % after step 4
[remote_process]
(p2#W7FRR00423L)3> rp(processes()).
[<0.0.0>,<0.1.0>,<0.4.0>,<0.30.0>,<0.31.0>,<0.33.0>,
<0.34.0>,<0.35.0>,<0.36.0>,<0.37.0>,<0.38.0>,<0.39.0>,
<0.40.0>,<0.41.0>,<0.42.0>,<0.43.0>,<0.44.0>,<0.45.0>,
<0.46.0>,<0.47.0>,<0.48.0>,<0.49.0>,<0.50.0>,<0.51.0>,
<0.52.0>,<0.53.0>,<0.54.0>,<0.55.0>,<0.56.0>,<0.57.0>,
<0.58.0>,<0.62.0>,<0.64.0>,<0.69.0>,<0.70.0>]
ok
(p2#W7FRR00423L)4> pid(0,69,0) ! stop. % between steps 5 and 6
stop
(p2#W7FRR00423L)5> global:registered_names().
[]

Erlang simple_one_for_one supervisor does not restart child

I have a test module and a simple_one_for_one supervisor.
test.erl
-module(test).
-export([
run/1,
do_job/1
]).
run(Fun) ->
test_sup:start_child([Fun]).
do_job(Fun) ->
Pid = spawn(Fun),
io:format("started ~p~n", [Pid]),
{ok, Pid}.
test_sup.erl
-module(test_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
-export([start_child/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init(_Args) ->
SupFlags = #{strategy => simple_one_for_one, intensity => 2, period => 20},
ChildSpecs = [#{id => test,
start => {test, do_job, []},
restart => permanent,
shutdown => brutal_kill,
type => worker,
modules => [test]}],
{ok, {SupFlags, ChildSpecs}}.
start_child(Args) ->
supervisor:start_child(?MODULE, Args).
I start supervisor in shell by command test_sup:start_link(). After that i run this command: test:run(fun() -> erlang:throw(err) end). I except the function do_job restart 2times but it never does. What is the problem?
Here is shell:
1> test_sup:start_link().
{ok,<0.36.0>}
2> test:run(fun() -> erlang:throw(err) end).
started <0.38.0>
{ok,<0.38.0>}
3>
=ERROR REPORT==== 16-Dec-2016::22:08:41 ===
Error in process <0.38.0> with exit value:
{{nocatch,err},[{erlang,apply,2,[]}]}
Restarting children is contrary to how simple_one_for_one supervisors are defined. Per the supervisor docs:
Functions delete_child/2 and restart_child/2 are invalid for simple_one_for_one supervisors and return {error,simple_one_for_one} if the specified supervisor uses this restart strategy.
In other words, what you're asking for can never happen. That's because a simple_one_for_one is intended for dynamic children that are defined on the fly by passing in additional startup args when you request the child. Other supervisors are able to restart their children because the startup args are statically defined in the supervisor.
Basically, this type of supervisor is strictly for ensuring a tidy shutdown when you need to have a dynamic pool of workers.

Erlang supervisor shutdowns after running child

I have a test module and a one_for_one supervisor.
test.erl
-module(test).
-export([do_job/1,run/2, start_worker/1]).
run(Id, Fun) ->
test_sup:start_child(Id, [Fun]).
do_job(Fun) ->
Fun().
start_worker(Args) ->
Pid = spawn_link(test, do_job, Args),
io:format("started ~p~n",[Pid]),
{ok, Pid}.
test_sup.erl
-module(test_sup).
-behaviour(supervisor).
-export([start_link/0]).
-export([init/1]).
-export([start_child/2]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init(_Args) ->
SupFlags = #{strategy => one_for_one, intensity => 2, period => 20},
{ok, {SupFlags, []}}.
start_child(Id, Args) ->
ChildSpecs = #{id => Id,
start => {test, start_worker, [Args]},
restart => transient,
shutdown => brutal_kill,
type => worker,
modules => [test]},
supervisor:start_child(?MODULE, ChildSpecs).
Now i start supervisor in shell and run command test:run(id, fun() -> erlang:throw(err) end). It works nice and the function start_worker/1 restart three times but after that, an exception occurs and the supervisor process shutdowns and i must start it manually with command test_sup:start_link(). What is the problem?
Shell:
1> test_sup:start_link().
{ok,<0.36.0>}
2> test:run(id, fun() -> erlang:throw(err) end).
started <0.38.0>
started <0.39.0>
started <0.40.0>
{ok,<0.38.0>}
=ERROR REPORT==== 16-Dec-2016::23:31:50 ===
Error in process <0.38.0> with exit value:
{{nocatch,err},[{test,do_job,1,[]}]}
=ERROR REPORT==== 16-Dec-2016::23:31:50 ===
Error in process <0.39.0> with exit value:
{{nocatch,err},[{test,do_job,1,[]}]}
=ERROR REPORT==== 16-Dec-2016::23:31:50 ===
Error in process <0.40.0> with exit value:
{{nocatch,err},[{test,do_job,1,[]}]}
** exception exit: shutdown
What is the problem?
There is no "problem". It's working exactly as you told it to:
To prevent a supervisor from getting into an infinite loop of child process terminations and restarts, a maximum restart intensity is defined using two integer values specified with keys intensity and period in the above map. Assuming the values MaxR for intensity and MaxT for period, then, if more than MaxR restarts occur within MaxT seconds, the supervisor terminates all child processes and then itself.
Your supervisor's configuration says, "If I have to restart a child more than two times (intensity) in 20 seconds (period), then something is wrong, so just shut down." As to why you have to restart the supervisor manually, it's because your supervisor isn't supervised itself. Otherwise, the supervisor's supervisor might try to restart it based on its own configuration.

Supervisor crashes when worker does, where Supervisor shouldn't

I have an Erlang supervisor that supervises a process of a worker-server based on gen_server, I start form the shell my supervisor, which in turn starts my worker-server with no problems, It looks like this:
start_link() ->
supervisor:start_link({local, ?SERVER}, ?MODULE, []).
But I when I crash my worker server, my supervisor crashes with it for unknown reason.
I found on the internet a fix for this, I use this:
start_link_shell() ->
{ok,Pid} = supervisor:start_link({local, ?SERVER}, ?MODULE, []),
unlink(Pid).
Now it works fine, but I don't understand why, Can anyone explain ?
**
Update
**
This is my init function
%%%===================================================================
init([]) ->
% Building Supervisor specifications
RestartStrategy = one_for_one,
MaxRestarts = 2,
MaxSecondsBetweenRestarts = 5000,
SupFlags = {RestartStrategy, MaxRestarts, MaxSecondsBetweenRestarts},
% Building Child specifications
Restart = permanent,
Shutdown = 2000, % Number of seconds the child is allowed to run after receiving shutdown message
Type = worker,
ChildSpec = {'db_server',
{'db_server', start_link, []},
Restart,
Shutdown,
Type,
['db_server']},
% Putting Supervisor and Child(ren) specifications in the return
{ok, {SupFlags, [ChildSpec]}}.
As per this link:
The problem testing supervisors from the shell is that the supervisor process is linked to the shell process. When gen_server process crashes the exit signal is propagated up to the shell which crashes and get restarted .. and that will be used for testing only, otherwise, the OTP application should start the supervisor and get linked to it.

Registering a global process in a startup script

I wrote a supervisor (shown below).
It only has one child process that I get from using locations:start_link/0. I expect it to start up a supervisor and register itself globally. That way, I can get to by using global:whereis_name/1.
When I start the supervisor through the shell it works as expected:
$ erl
1> locator_suo:start_link().
registering global supervisor
starting it....
supervisor <0.34.0>
{ok,<0.34.0>}
Then I can get to it by its global name, locator_sup:
2> global:whereis_name( locator_sup ).
<0.34.0>
But I want to start the system using a startup script, so I tried starting the system like so:
$ erl -s locator_sup start_link
registering global supervisor
starting it....
supervisor <0.32.0>
It seems that the init function for the supervisor is being called, but when I try to find the supervisor by its global name, I get undefined
1> global:whereis_name( locator_sup ).
undefined
So my question is, why does the supervisor process only get registered when I use start_link from the shell?
The supervisor module:
-module(locator_sup).
-behaviour(supervisor).
%% API
-export([start_link/0]).
%% Supervisor callbacks
-export([init/1]).
%% ===================================================================
%% API functions
%% ===================================================================
start_link() ->
io:format( "registering global supervisor\n" ),
{ok, E} = supervisor:start_link({global, ?MODULE}, ?MODULE, []),
io:format("supervisor ~p\n", [E] ),
{ok, E}.
%% ===================================================================
%% Supervisor callbacks
%% ===================================================================
% only going to start the gen_server that keeps track of locations
init(_) ->
io:format( "starting it....\n" ),
{ok, {{one_for_one, 1, 60},
[{locations, {locations, start_link, []},
permanent, brutal_kill, worker, [locations]}]}}.
One reason you may have that it is because you start your node not in distributed mode.
First of all add such params to see what happens during startup: erl -boot start_sasl.
Second add node name (it will automatically enable distributed mode) : ... -sname my_node
So the startup command will look like:
erl -boot start_sasl -sname my_node -s locator_sup start_link

Resources