The following code is also from rabbitmq's supervisor2.erl. The code's function is to kill supervisor's children, by for every child:
monitor child send an trappable exit signal
start timer
if timer's arrive, send an untrappable exit signal (kill).
My question about EXIT and DOWN signal.
If the child doesn't trap the exit signal, the supervisor will receive 2 signal, first is exit signal, and then is DOWN signal, is it right? Is the signal sequence is strictly guaranteed?
If the child traps the exit signal, the supervisor will receive only 1 signal, just down signal, is it right?
terminate_simple_children(Child, Dynamics, SupName) ->
Pids = dict:fold(fun (Pid, _Args, Pids) ->
erlang:monitor(process, Pid),
unlink(Pid),
exit(Pid, child_exit_reason(Child)),
[Pid | Pids]
end, [], Dynamics),
TimeoutMsg = {timeout, make_ref()},
TRef = timeout_start(Child, TimeoutMsg),
{Replies, Timedout} =
lists:foldl(
fun (_Pid, {Replies, Timedout}) ->
{Reply, Timedout1} =
receive
TimeoutMsg ->
Remaining = Pids -- [P || {P, _} <- Replies],
[exit(P, kill) || P <- Remaining],
receive {'DOWN', _MRef, process, Pid, Reason} ->
{{error, Reason}, true}
end;
{'DOWN', _MRef, process, Pid, Reason} ->
{child_res(Child, Reason, Timedout), Timedout};
{'EXIT', Pid, Reason} -> %%<==== strict signal, first EXIT, then DOWN.
receive {'DOWN', _MRef, process, Pid, _} ->
{{error, Reason}, Timedout}
end
end,
{[{Pid, Reply} | Replies], Timedout1}
end, {[], false}, Pids),
timeout_stop(Child, TRef, TimeoutMsg, Timedout),
ReportError = shutdown_error_reporter(SupName),
[case Reply of
{_Pid, ok} -> ok;
{Pid, {error, R}} -> ReportError(R, Child#child{pid = Pid})
end || Reply <- Replies],
ok.
There are two things you are confusing here:
First, the child is the one trapping exits, but you are looking at the supervisor code. What the child does with exit signals does not directly affect the supervisor.
the kill exit signal cannot be trapped. It always kills the child.
The supervisor2 has a monitor on the child. This means it is guaranteed to get a 'DOWN' message and this code is concerned about getting that kind of message. If supervisor2 is also trapping exits, it will get the 'EXIT' message in addition.
Related
I am new to Erlang and trying to pass the result obtained from one function to another through message passing but I am unsure why it is giving me badarg errors when I do ie main ! {numbers, 1, 100}. I have tried to make some changes, as seen in the greyed out portions, but the message didn't get passed when using From. Here is my code:
-module(numbers).
-export([start_main/0,main/2,proc/0,sumNumbers/2]).
main(Total, N) ->
if
N == 0 ->
io:format("Total: ~p ~n", [Total]);
true ->
true
end,
receive
{numbers, Low, High} ->
Half = round(High/2),
Pid1 = spawn(numbers, proc, []),
Pid2 = spawn(numbers, proc, []),
Pid1 ! {work, Low, Half},
Pid2 ! {work, Half+1, High};
%% Pid1 ! {work, self(), Low, Half},
%% Pid2 ! {work, self(), Half+1, High},
{return, Result} ->
io:format("Received ~p ~n", [Result]),
main(Total+Result, N-1)
end.
proc() ->
receive
{work, Low, High} ->
io:format("Processing ~p to ~p ~n", [Low, High]),
Res = sumNumbers(Low,High),
main ! {return, Res},
proc()
%% {work, From, Low, High} ->
%% io:format("Processing ~p to ~p ~n", [Low, High]),
%% Res = sumNumbers(Low,High),
%% From ! {return, Res},
%% proc()
end.
sumNumbers(Low, High) ->
Lst = lists:seq(Low,High),
lists:sum(Lst).
start_main() ->
register(main, spawn(numbers, main, [0,2])).
If you get a badarg from main ! Message, it means that there is no process registered under the name main. It either didn't get started or properly registered to begin with, or it has died at some point. You can add some more print statements to follow what happens.
You can use erlang:registered/2 to see what is registered.
When you send the first message to the main, it is handled by the receive statement with the message {numbers,1,100}, in this statement you spawn twice the proc() function, and terminate the receive statement.
Thus, the main process dies, it is no more registered, and as soon as the proc functions try to send their message at line main ! {return, Res}, you get the badarg error.
You need to recursively call the main function as you do it at line main(Total+Result, N-1) to keep the main process alive. (see RichardC answer)
Executing this function:
start_main() ->
register(main, spawn(numbers, main, [0,2])).
will start a process called main, which executes the function numbers:main/2. After the function numbers:main/2 starts executing, it enters a receive clause and waits for a message.
If you then execute the function proc/0, it too will enter a receive clause and wait for a message. If you send the process running proc/0 a message, like {work, 1, 3}, then proc/0 will send a message, like {return, 6}, to numbers:main/2; and numbers:main/2 will display some output. Here's the proof:
~/erlang_programs$ erl
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V9.3 (abort with ^G)
1> c(numbers).
{ok,numbers}
2> numbers:start_main().
true
3> Pid = spawn(numbers, proc, []).
<0.73.0>
4> Pid ! {work, 1, 3}.
Processing 1 to 3
{work,1,3}
Received 6
5>
As you can see, there's no badarg error. When you claim something happens, you need to back it up with proof--merely saying that something happens is not enough.
Here is an example trace where I'm able to call erlang:monitor/2 on the same Pid:
1> Loop = fun F() -> F() end.
#Fun<erl_eval.30.99386804>
2> Pid = spawn(Loop).
<0.71.0>
3> erlang:monitor(process, Pid).
#Ref<0.2485499597.1470627842.126937>
4> erlang:monitor(process, Pid).
#Ref<0.2485499597.1470627842.126942>
5> erlang:monitor(process, Pid).
#Ref<0.2485499597.1470627842.126947>
The expressions returned by instruction #4 and #5 are different than #3, meaning that it is possible to create multiple monitor references between the current process and Pid. Is there a practical case where you would need or use multiple monitor references to the same process?
I would expect this to return the same reference (returning a new one would perhaps imply that the old one had failed/crashed), following the same logic that exists for link/1.
Imagine you use third party library which does this (basically what OTP *:call/* functions does):
call(Pid, Request) ->
call(Pid, Request, ?DEFAULT_TIMEOUT).
call(Pid, Request, Timeout) ->
MRef = erlang:monitor(process, Pid),
Pid ! {call, self(), MRef, Request},
receive
{answer, MRef, Result} ->
erlang:demonitor(Mref, [flush]),
{ok, Result};
{'DOWN', MRef, _, _, Info} ->
{error, Info}
after Timeout ->
erlang:demonitor(MRef, [flush]),
{error, timeout}
end.
and then you use it in your code where you would monitor the same process Pid and then call function call/2,3.
my_fun1(Service) ->
MRef = erlang:monitor(process, Service),
ok = check_if_service_runs(MRef),
my_fun2(Service),
mind_my_stuf(),
ok = check_if_service_runs(MRef),
erlang:demonitor(MRef, [flush]),
return_some_result().
check_if_service_runs(MRef) ->
receive
{'DOWN', MRef, _, _, Info} -> {down, Info}
after 0 -> ok
end.
my_fun2(S) -> my_fun3(S).
% and a many layers of other stuff and modules
my_fun3(S) -> call(S, hello).
What a nasty surprise it would be if erlang:monitor/2,3 would always return the same reference and if erlang:demonitor/1,2 would remove your previous monitor. It would be a source of ugly and unsolvable bugs. You should start to think that there are libraries, other processes, your code is part of a huge system and Erlang was made by experienced people who thought it through. Maintainability is key here.
Try to use OTP-style in project and got one OTP-interface question. What solution is more popular/beautiful?
What I have:
web-server with mochiweb
one process, what spawns many (1000-2000) children.
Children contain state (netflow-speed). Process proxies messages to children and create new children, if need.
In mochiweb I have one page with speed of all actors, how whey made:
nf_collector ! {get_abonents_speed, self()},
receive
{abonents_speed_count, AbonentsCount} ->
ok
end,
%% write http header, chunked
%% and while AbonentsCount != 0, receive speed and write http
This is not-opt style, how i can understand. Solutions:
In API synchronous function get all requests with speed and return list with all speeds. But I want write it to client at once.
One argument of API-function is callback:
nf_collector:get_all_speeds(fun (Speed) -> Resp:write_chunk(templater(Speed)) end)
Return iterator:
One of results of get_all_speeds will be function with receive-block. Every call of it will return {ok, Speed}, at the end it return {end}.
get_all_speeds() ->
nf_collector ! {get_abonents_speed, self()},
receive
{abonents_speed_count, AbonentsCount} ->
ok
end,
{ok, fun() ->
create_receive_fun(AbonentsCount)
end}.
create_receive_fun(0)->
{end};
create_receive_fun(Count)->
receive
{abonent_speed, Speed} ->
Speed
end,
{ok, Speed, create_receive_fun(Count-1)}.
Spawn your 'children' from a supervisor:
-module(ch_sup).
-behaviour(supervisor).
-export([start_link/0, init/1, start_child/1]).
start_link() -> supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) -> {ok, {{simple_one_for_one}, [{ch, {ch, start_link, []}, transient, 1000, worker, [ch]}]}}.
start_child(Data) -> supervisor:start_child(?MODULE, [Data]).
Start them with ch_sup:start_child/1 (Data is whatever).
Implement your children as a gen_server:
-module(ch).
-behaviour(gen_server).
-record(?MODULE, {speed}).
...
get_speed(Pid, Timeout) ->
try
gen_server:call(Pid, get, Timeout)
catch
exit:{timeout, _} -> timeout;
exit:{noproc, _} -> died
end
.
...
handle_call(get, _From, St) -> {reply, {ok, St#?MODULE.speed}, St} end.
You can now use the supervisor to get the list of running children and query them, though you have to accept the possibility of a child dying between getting the list of children and calling them, and obviously a child could for some reason be alive but not respond, or respond with an error, etc.
The get_speed/2 function above returns either {ok, Speed} or died or timeout. It remains for you to filter appropriately according to your applications needs; easy with a list comprehension, here's a few.
Just the speeds:
[Speed || {ok, Speed} <- [ch:get_speed(Pid, 1000) || Pid <-
[Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)
]
]].
Pid and speed tuples:
[{Pid, Speed} || {Pid, {ok, Speed}} <-
[{Pid, ch:get_speed(Pid, 1000)} || Pid <-
[Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)]
]
].
All results, including timeouts and 'died' results for children that died before you got to them:
[{Pid, Any} || {Pid, Any} <-
[{Pid, ch:get_speed(Pid, 1000)} || Pid <-
[Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)]
]
].
In most situations you almost certainly don't want anything other than the speeds, because what are you going to do about deaths and timeouts? You want those that die to be respawned by the supervisor, so the problem is more or less fixed by the time you know about it, and timeouts, as with any fault, are a separate problem, to be dealt with in whatever way you see fit... There's no need to mix the fault fixing logic with the data retrieval logic though.
Now, the problem with all these, which I think you were getting at in your post, but I'm not quite sure, is that the timeout of 1000 is for each call, and each call is synchronous one after the other, so for 1000 children with a 1 second timeout, it could take 1000 seconds to produce no results. Making time timeout 1ms might be the answer, but to do it properly is a bit more complicated:
get_speeds() ->
ReceiverPid = self(),
Ref = make_ref(),
Pids = [Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)],
lists:foreach(
fun(Pid) -> spawn(
fun() -> ReceiverPid ! {Ref, ch:get_speed(Pid, 1000)} end
) end,
Pids),
receive_speeds(Ref, length(Pids), os_milliseconds(), 1000)
.
receive_speeds(_Ref, 0, _StartTime, _Timeout) ->
[];
receive_speeds(Ref, Remaining, StartTime, Timeout) ->
Time = os_milliseconds(),
TimeLeft = Timeout - Time + StartTime,
receive
{Ref, acc_timeout} ->
[];
{Ref, {ok, Speed}} ->
[Speed | receive_speeds(Ref, Remaining-1, StartTime, Timeout)];
{Ref, _} ->
receive_speeds(Ref, Remaining-1, StartTime, Timeout)
after TimeLeft ->
[]
end
.
os_milliseconds() ->
{OsMegSecs, OsSecs, OsMilSecs} = os:timestamp(),
round(OsMegSecs*1000000 + OsSecs + OsMilSecs/1000)
.
Here each call is spawned in a different process and the replies collected, until the 'master timeout' or they have all been received.
Code has largely been cut-n-pasted from various works I have lying round, and edited manually and by search replace, to anonymise it and remove surplus, so it's probably mostly compilable quality, but I don't promise I didn't break anything.
I'm a beginner at Erlang and I've been working through "Learn You Some Erlang For Great Good!". I use a modified version of this example code where the critic has a parameter:
critic(Count) ->
receive
{From, {"Rage Against the Turing Machine", "Unit Testify"}} ->
From ! {self(), {"They are great!", Count}};
{From, {"System of a Downtime", "Memoize"}} ->
From ! {self(), {"They're not Johnny Crash but they're good.", Count}};
{From, {"Johnny Crash", "The Token Ring of Fire"}} ->
From ! {self(), {"Simply incredible.", Count}};
{From, {_Band, _Album}} ->
From ! {self(), {"They are terrible!", Count}}
end,
critic(Count).
Which is spawned like this:
restarter() ->
process_flag(trap_exit, true),
Pid = spawn_link(?MODULE, critic, [my_atom]),
register(critic, Pid),
receive
{'EXIT', Pid, normal} -> % not a crash
ok;
{'EXIT', Pid, shutdown} -> % manual termination, not a crash
ok;
{'EXIT', Pid, _} ->
restarter()
end.
The module is used like this:
1> c(linkmon).
{ok,linkmon}
2> Monitor = linkmon:start_critic().
<0.163.0>
3> linkmon:judge("Rage Against the Turing Machine", "Unit Testify").
{"They are great!",my_atom}
Now, when I change "my_atom" to a simple number (like 255) the monitor crashes:
1> c(linkmon).
{ok,linkmon}
2> Monitor = linkmon:start_critic().
=ERROR REPORT==== 14-Jul-2013::20:42:20 ===
Error in process <0.173.0> with exit value: {badarg,[{erlang,register,[critic,<0.174.0>] []},{linkmon,restarter,0,[{file,"linkmon.erl"},{line,16}]}]}
However, it does work when I send [1] (so the code is "spawn(....., [[255]]).")
Why can't I pass a single number? Is just skimming over the documentation of spawn/3 doesn't really tell me anything... except maybe that I missed something and a number is not an Erlang term. But then how do I pass a number?
The error message says that the call to register(critic, Pid) on line 16 crashes due to "badarg" even though the arguments look ok. This can happen if the process referred to by Pid is already dead (if it crashes immediately, e.g. if you pass the wrong number of args), or if you already have a process around using that name. Ensure that the length of the list in the spawn(Mod,Fun,[...]) matches the number of args to your critic() function, and call "whereis(critic)" in the shell to check if there's an old process blocking the name from being reused.
Quis custodiet ipsos custodes? -- (Decimus Iunius Iuvenalis)
I have the following setup:
On one node ('one#erlang.enzo') a server process is running which has a watchdog running one another node ('two#erlang.enzo'). When the server starts up, it will start its watchdog on the remote node. When the server exits ungracefully, the watchdog starts the server again. When the watchdog exits, the server starts it again.
The server is started as part of the runlevel after the network is up.
The server also monitors the remote node and starts a watchdog as soon as it (i.e. the node) comes online. Now connection losses between server and watchdog can have two reasons: First, the network may go down; second, the node may crash or be killed.
My code seems to work, but I have the slight suspicion that the following is happening:
When the watchdog node is shut down (or killed or crashed) and is restarted, the server correctly restarts its watchdog.
But when the network fails and the watchdog node keeps running, the server starts a new watchdog when connection is reestablished and leaves one zombie watchdog behind.
My questions are:
(A) Do I create zombies?
(B) In the case of a network loss, how can the server check if the watchdog is still alive (and vice versa)?
(C) If B is possible, how can I reconnect the old server and the old watchdog?
(D) What other major (and minor) flaws do you, distinguished reader, spot in my setup?
EDIT: The die and kill_dog messages are for faking ungraceful exits and won't make it beyond debugging.
Here goes the code:
-module (watchdog).
-compile (export_all).
init () ->
io:format ("Watchdog: Starting # ~p.~n", [node () ] ),
process_flag (trap_exit, true),
loop ().
loop () ->
receive
die -> 1 / 0;
{'EXIT', _, normal} ->
io:format ("Watchdog: Server shut down.~n");
{'EXIT', _, _} ->
io:format ("Watchdog: Restarting server.~n"),
spawn ('one#erlang.enzo', server, start, [] );
_ -> loop ()
end.
-module (server).
-compile (export_all).
start () ->
io:format ("Server: Starting up.~n"),
register (server, spawn (fun init/0) ).
stop () ->
whereis (server) ! stop.
init () ->
process_flag (trap_exit, true),
monitor_node ('two#erlang.enzo', true),
loop (down, none).
loop (Status, Watchdog) ->
{NewStatus, NewWatchdog} = receive
die -> 1 / 0;
stop -> {stop, none};
kill_dog ->
Watchdog ! die,
{Status, Watchdog};
{nodedown, 'two#erlang.enzo'} ->
io:format ("Server: Watchdog node has gone down.~n"),
{down, Watchdog};
{'EXIT', Watchdog, noconnection} ->
{Status, Watchdog};
{'EXIT', Watchdog, Reason} ->
io:format ("Server: Watchdog has died of ~p.~n", [Reason] ),
{Status, spawn_link ('two#erlang.enzo', watchdog, init, [] ) };
_ -> {Status, Watchdog}
after 2000 ->
case Status of
down -> checkNode ();
up -> {up, Watchdog}
end
end,
case NewStatus of
stop -> ok;
_ -> loop (NewStatus, NewWatchdog)
end.
checkNode () ->
net_adm:world (),
case lists:any (fun (Node) -> Node =:= 'two#erlang.enzo' end, nodes () ) of
false ->
io:format ("Server: Watchdog node is still down.~n"),
{down, none};
true ->
io:format ("Server: Watchdog node has come online.~n"),
monitor_node ('two#erlang.enzo', true),
Watchdog = spawn_link ('two#erlang.enzo', watchdog, init, [] ),
{up, Watchdog}
end.
Using global module to register watchdog should prevent your concern:
watchdog.erl:
-module (watchdog).
-compile (export_all).
init () ->
io:format ("Watchdog: Starting # ~p.~n", [node () ] ),
process_flag (trap_exit, true),
global:register_name (watchdog, self ()),
loop ().
loop () ->
receive
die -> 1 / 0;
{'EXIT', _, normal} ->
io:format ("Watchdog: Server shut down.~n");
{'EXIT', _, _} ->
io:format ("Watchdog: Restarting server.~n"),
spawn ('one#erlang.enzo', server, start, [] );
_ -> loop ()
end.
server.erl:
checkNode () ->
net_adm:world (),
case lists:any (fun (Node) -> Node =:= 'two#erlang.enzo' end, nodes () ) of
false ->
io:format ("Server: Watchdog node is still down.~n"),
{down, none};
true ->
io:format ("Server: Watchdog node has come online.~n"),
global:sync (), %% not sure if this is necessary
case global:whereis_name (watchdog) of
undefined ->
io:format ("Watchdog process is dead"),
Watchdog = spawn_link ('two#erlang.enzo', watchdog, init, [] );
Watchdog ->
io:format ("Watchdog process is still alive")
end,
{up, Watchdog}
end.