Quis custodiet ipsos custodes? -- (Decimus Iunius Iuvenalis)
I have the following setup:
On one node ('one#erlang.enzo') a server process is running which has a watchdog running one another node ('two#erlang.enzo'). When the server starts up, it will start its watchdog on the remote node. When the server exits ungracefully, the watchdog starts the server again. When the watchdog exits, the server starts it again.
The server is started as part of the runlevel after the network is up.
The server also monitors the remote node and starts a watchdog as soon as it (i.e. the node) comes online. Now connection losses between server and watchdog can have two reasons: First, the network may go down; second, the node may crash or be killed.
My code seems to work, but I have the slight suspicion that the following is happening:
When the watchdog node is shut down (or killed or crashed) and is restarted, the server correctly restarts its watchdog.
But when the network fails and the watchdog node keeps running, the server starts a new watchdog when connection is reestablished and leaves one zombie watchdog behind.
My questions are:
(A) Do I create zombies?
(B) In the case of a network loss, how can the server check if the watchdog is still alive (and vice versa)?
(C) If B is possible, how can I reconnect the old server and the old watchdog?
(D) What other major (and minor) flaws do you, distinguished reader, spot in my setup?
EDIT: The die and kill_dog messages are for faking ungraceful exits and won't make it beyond debugging.
Here goes the code:
-module (watchdog).
-compile (export_all).
init () ->
io:format ("Watchdog: Starting # ~p.~n", [node () ] ),
process_flag (trap_exit, true),
loop ().
loop () ->
receive
die -> 1 / 0;
{'EXIT', _, normal} ->
io:format ("Watchdog: Server shut down.~n");
{'EXIT', _, _} ->
io:format ("Watchdog: Restarting server.~n"),
spawn ('one#erlang.enzo', server, start, [] );
_ -> loop ()
end.
-module (server).
-compile (export_all).
start () ->
io:format ("Server: Starting up.~n"),
register (server, spawn (fun init/0) ).
stop () ->
whereis (server) ! stop.
init () ->
process_flag (trap_exit, true),
monitor_node ('two#erlang.enzo', true),
loop (down, none).
loop (Status, Watchdog) ->
{NewStatus, NewWatchdog} = receive
die -> 1 / 0;
stop -> {stop, none};
kill_dog ->
Watchdog ! die,
{Status, Watchdog};
{nodedown, 'two#erlang.enzo'} ->
io:format ("Server: Watchdog node has gone down.~n"),
{down, Watchdog};
{'EXIT', Watchdog, noconnection} ->
{Status, Watchdog};
{'EXIT', Watchdog, Reason} ->
io:format ("Server: Watchdog has died of ~p.~n", [Reason] ),
{Status, spawn_link ('two#erlang.enzo', watchdog, init, [] ) };
_ -> {Status, Watchdog}
after 2000 ->
case Status of
down -> checkNode ();
up -> {up, Watchdog}
end
end,
case NewStatus of
stop -> ok;
_ -> loop (NewStatus, NewWatchdog)
end.
checkNode () ->
net_adm:world (),
case lists:any (fun (Node) -> Node =:= 'two#erlang.enzo' end, nodes () ) of
false ->
io:format ("Server: Watchdog node is still down.~n"),
{down, none};
true ->
io:format ("Server: Watchdog node has come online.~n"),
monitor_node ('two#erlang.enzo', true),
Watchdog = spawn_link ('two#erlang.enzo', watchdog, init, [] ),
{up, Watchdog}
end.
Using global module to register watchdog should prevent your concern:
watchdog.erl:
-module (watchdog).
-compile (export_all).
init () ->
io:format ("Watchdog: Starting # ~p.~n", [node () ] ),
process_flag (trap_exit, true),
global:register_name (watchdog, self ()),
loop ().
loop () ->
receive
die -> 1 / 0;
{'EXIT', _, normal} ->
io:format ("Watchdog: Server shut down.~n");
{'EXIT', _, _} ->
io:format ("Watchdog: Restarting server.~n"),
spawn ('one#erlang.enzo', server, start, [] );
_ -> loop ()
end.
server.erl:
checkNode () ->
net_adm:world (),
case lists:any (fun (Node) -> Node =:= 'two#erlang.enzo' end, nodes () ) of
false ->
io:format ("Server: Watchdog node is still down.~n"),
{down, none};
true ->
io:format ("Server: Watchdog node has come online.~n"),
global:sync (), %% not sure if this is necessary
case global:whereis_name (watchdog) of
undefined ->
io:format ("Watchdog process is dead"),
Watchdog = spawn_link ('two#erlang.enzo', watchdog, init, [] );
Watchdog ->
io:format ("Watchdog process is still alive")
end,
{up, Watchdog}
end.
Related
Here is an example trace where I'm able to call erlang:monitor/2 on the same Pid:
1> Loop = fun F() -> F() end.
#Fun<erl_eval.30.99386804>
2> Pid = spawn(Loop).
<0.71.0>
3> erlang:monitor(process, Pid).
#Ref<0.2485499597.1470627842.126937>
4> erlang:monitor(process, Pid).
#Ref<0.2485499597.1470627842.126942>
5> erlang:monitor(process, Pid).
#Ref<0.2485499597.1470627842.126947>
The expressions returned by instruction #4 and #5 are different than #3, meaning that it is possible to create multiple monitor references between the current process and Pid. Is there a practical case where you would need or use multiple monitor references to the same process?
I would expect this to return the same reference (returning a new one would perhaps imply that the old one had failed/crashed), following the same logic that exists for link/1.
Imagine you use third party library which does this (basically what OTP *:call/* functions does):
call(Pid, Request) ->
call(Pid, Request, ?DEFAULT_TIMEOUT).
call(Pid, Request, Timeout) ->
MRef = erlang:monitor(process, Pid),
Pid ! {call, self(), MRef, Request},
receive
{answer, MRef, Result} ->
erlang:demonitor(Mref, [flush]),
{ok, Result};
{'DOWN', MRef, _, _, Info} ->
{error, Info}
after Timeout ->
erlang:demonitor(MRef, [flush]),
{error, timeout}
end.
and then you use it in your code where you would monitor the same process Pid and then call function call/2,3.
my_fun1(Service) ->
MRef = erlang:monitor(process, Service),
ok = check_if_service_runs(MRef),
my_fun2(Service),
mind_my_stuf(),
ok = check_if_service_runs(MRef),
erlang:demonitor(MRef, [flush]),
return_some_result().
check_if_service_runs(MRef) ->
receive
{'DOWN', MRef, _, _, Info} -> {down, Info}
after 0 -> ok
end.
my_fun2(S) -> my_fun3(S).
% and a many layers of other stuff and modules
my_fun3(S) -> call(S, hello).
What a nasty surprise it would be if erlang:monitor/2,3 would always return the same reference and if erlang:demonitor/1,2 would remove your previous monitor. It would be a source of ugly and unsolvable bugs. You should start to think that there are libraries, other processes, your code is part of a huge system and Erlang was made by experienced people who thought it through. Maintainability is key here.
experimenting with distributed erlang, here's what I have:
loop()->
receive {From, ping} ->
io:format("received ping from ~p~n", [From]),
From ! pong,
loop();
{From, Fun} when is_function(Fun) ->
io:format("executing function ~p received from ~p~n", [Fun, From]),
From ! Fun(),
loop()
end.
test_remote_node_can_execute_sent_clojure()->
Pid = spawn(trecias, fun([])-> loop() end),
Pid ! {self(), fun()-> erlang:nodes() end},
receive Result ->
Result = [node()]
after 300 ->
timeout
end.
getting: Can not start erlang:apply,[#Fun<tests.1.123107452>,[]] on trecias
node I execute the test on runs on the same machine as the node 'trecias'. Both nodes can load same code.
Any ideas what is amiss?
In the spawn call, you've specified the node name as trecias, but you need to specify the full node name including the hostname, e.g. trecias#localhost.
Also, the function you pass to spawn/2 must take zero arguments, but the one in the code above takes one argument (and crashes if that argument isn't the empty list). Write it as fun() -> loop() end instead.
When spawning an anonymous function on a remote node, you also need to make sure that the module is loaded on both nodes, with the same version. Otherwise you'll get a badfun error.
Try to use OTP-style in project and got one OTP-interface question. What solution is more popular/beautiful?
What I have:
web-server with mochiweb
one process, what spawns many (1000-2000) children.
Children contain state (netflow-speed). Process proxies messages to children and create new children, if need.
In mochiweb I have one page with speed of all actors, how whey made:
nf_collector ! {get_abonents_speed, self()},
receive
{abonents_speed_count, AbonentsCount} ->
ok
end,
%% write http header, chunked
%% and while AbonentsCount != 0, receive speed and write http
This is not-opt style, how i can understand. Solutions:
In API synchronous function get all requests with speed and return list with all speeds. But I want write it to client at once.
One argument of API-function is callback:
nf_collector:get_all_speeds(fun (Speed) -> Resp:write_chunk(templater(Speed)) end)
Return iterator:
One of results of get_all_speeds will be function with receive-block. Every call of it will return {ok, Speed}, at the end it return {end}.
get_all_speeds() ->
nf_collector ! {get_abonents_speed, self()},
receive
{abonents_speed_count, AbonentsCount} ->
ok
end,
{ok, fun() ->
create_receive_fun(AbonentsCount)
end}.
create_receive_fun(0)->
{end};
create_receive_fun(Count)->
receive
{abonent_speed, Speed} ->
Speed
end,
{ok, Speed, create_receive_fun(Count-1)}.
Spawn your 'children' from a supervisor:
-module(ch_sup).
-behaviour(supervisor).
-export([start_link/0, init/1, start_child/1]).
start_link() -> supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) -> {ok, {{simple_one_for_one}, [{ch, {ch, start_link, []}, transient, 1000, worker, [ch]}]}}.
start_child(Data) -> supervisor:start_child(?MODULE, [Data]).
Start them with ch_sup:start_child/1 (Data is whatever).
Implement your children as a gen_server:
-module(ch).
-behaviour(gen_server).
-record(?MODULE, {speed}).
...
get_speed(Pid, Timeout) ->
try
gen_server:call(Pid, get, Timeout)
catch
exit:{timeout, _} -> timeout;
exit:{noproc, _} -> died
end
.
...
handle_call(get, _From, St) -> {reply, {ok, St#?MODULE.speed}, St} end.
You can now use the supervisor to get the list of running children and query them, though you have to accept the possibility of a child dying between getting the list of children and calling them, and obviously a child could for some reason be alive but not respond, or respond with an error, etc.
The get_speed/2 function above returns either {ok, Speed} or died or timeout. It remains for you to filter appropriately according to your applications needs; easy with a list comprehension, here's a few.
Just the speeds:
[Speed || {ok, Speed} <- [ch:get_speed(Pid, 1000) || Pid <-
[Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)
]
]].
Pid and speed tuples:
[{Pid, Speed} || {Pid, {ok, Speed}} <-
[{Pid, ch:get_speed(Pid, 1000)} || Pid <-
[Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)]
]
].
All results, including timeouts and 'died' results for children that died before you got to them:
[{Pid, Any} || {Pid, Any} <-
[{Pid, ch:get_speed(Pid, 1000)} || Pid <-
[Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)]
]
].
In most situations you almost certainly don't want anything other than the speeds, because what are you going to do about deaths and timeouts? You want those that die to be respawned by the supervisor, so the problem is more or less fixed by the time you know about it, and timeouts, as with any fault, are a separate problem, to be dealt with in whatever way you see fit... There's no need to mix the fault fixing logic with the data retrieval logic though.
Now, the problem with all these, which I think you were getting at in your post, but I'm not quite sure, is that the timeout of 1000 is for each call, and each call is synchronous one after the other, so for 1000 children with a 1 second timeout, it could take 1000 seconds to produce no results. Making time timeout 1ms might be the answer, but to do it properly is a bit more complicated:
get_speeds() ->
ReceiverPid = self(),
Ref = make_ref(),
Pids = [Pid || {undefined, Pid, worker, [ch]} <-
supervisor:which_children(ch_sup)],
lists:foreach(
fun(Pid) -> spawn(
fun() -> ReceiverPid ! {Ref, ch:get_speed(Pid, 1000)} end
) end,
Pids),
receive_speeds(Ref, length(Pids), os_milliseconds(), 1000)
.
receive_speeds(_Ref, 0, _StartTime, _Timeout) ->
[];
receive_speeds(Ref, Remaining, StartTime, Timeout) ->
Time = os_milliseconds(),
TimeLeft = Timeout - Time + StartTime,
receive
{Ref, acc_timeout} ->
[];
{Ref, {ok, Speed}} ->
[Speed | receive_speeds(Ref, Remaining-1, StartTime, Timeout)];
{Ref, _} ->
receive_speeds(Ref, Remaining-1, StartTime, Timeout)
after TimeLeft ->
[]
end
.
os_milliseconds() ->
{OsMegSecs, OsSecs, OsMilSecs} = os:timestamp(),
round(OsMegSecs*1000000 + OsSecs + OsMilSecs/1000)
.
Here each call is spawned in a different process and the replies collected, until the 'master timeout' or they have all been received.
Code has largely been cut-n-pasted from various works I have lying round, and edited manually and by search replace, to anonymise it and remove surplus, so it's probably mostly compilable quality, but I don't promise I didn't break anything.
There is an locked door example about gen_fsm in the Elrang Otp System Documentation. I have a question about timeout. I will copy the code here first:
-module(code_lock).
-behaviour(gen_fsm).
-export([start_link/1]).
-export([button/1]).
-export([init/1, locked/2, open/2]).
start_link(Code) ->
gen_fsm:start_link({local, code_lock}, code_lock, lists:reverse(Code), []).
button(Digit) ->
gen_fsm:send_event(code_lock, {button, Digit}).
init(Code) ->
{ok, locked, {[], Code}}.
locked({button, Digit}, {SoFar, Code}) ->
case [Digit|SoFar] of
Code ->
do_unlock(),
{next_state, open, {[], Code}, 30000};
Incomplete when length(Incomplete)<length(Code) ->
{next_state, locked, {Incomplete, Code}};
_Wrong ->
{next_state, locked, {[], Code}}
end.
open(timeout, State) ->
do_lock(),
{next_state, locked, State}.
Here is the question: when the door is opened, if I press the button, the gen_fsm will have an {button, Digit} event at the state open. An error will occurs. But if I add these code after open function:
open(_Event, State) ->
{next_state, open, State}.
Then if I press the button in 30s, the timeout will not be occurs. The door will be opened forever. What should I do?
Thanks.
Update:
I know I could use send_event_after or something like that. But I don't think it is a good idea. Because the state you excepted to handle the message may be changed in a complex application.
For example, if I have a function to lock the door manually after the door opened in 30s. Then locked will handle the timeout message, which is not the excepted behaviour.
You could maintain the remaining timeout in StateData. To do this, add a third item to the tuple:
init(Code) ->
{ok, locked, {[], Code, infinity}}.
You'll need to change locked to set the initial value:
locked({button, Digit}, {SoFar, Code, _Until}) ->
case [Digit|SoFar] of
Code ->
do_unlock(),
Timeout = 30000,
Now = to_milliseconds(os:timestamp()),
Until = Now + Timeout,
{next_state, open, {[], Code, Until}, Timeout};
Incomplete when length(Incomplete)<length(Code) ->
{next_state, locked, {Incomplete, Code, infinity}};
_Wrong ->
{next_state, locked, {[], Code, infinity}}
end.
And, if a button is pressed while open, calculate the new timeout and go around again:
open({button, _Digit}, {_SoFar, _Code, Until} = State) ->
Now = to_milliseconds(os:timestamp()),
Timeout = Until - Now,
{next_state, open, State, Timeout};
You'll also need the following helper function:
to_milliseconds({Me, S, Mu}) ->
(Me * 1000 * 1000 * 1000) + (S * 1000) + (Mu div 1000).
You should be specifying a timeout at the open function "open(_Event, State)"
Since the next state is proceeded without timeout.. the door will remain open forever and no where a timeout occurs..
The newly defined function should be
open(_Event, State) ->
{next_state, open, State, 30000}. %% State should be re-initialized
Using the fsm timeout, it is not possible - as far as I know - to avoid the re-initialization of it:
If you don't specify a new timeout when you skip the event while the door is open, it will remain open forever, as you notice.
If you specify one, it will restart from the beginning.
If none of these solutions satisfy you, you can use an external process to create the timeout:
-module(code_lock).
-behaviour(gen_fsm).
-export([start_link/1]).
-export([button/1,stop/0]).
-export([init/1, locked/2, open/2,handle_event/3,terminate/3]).
start_link(Code) ->
gen_fsm:start_link({local, code_lock}, code_lock, lists:reverse(Code), []).
button(Digit) ->
gen_fsm:send_event(code_lock, {button, Digit}).
stop() ->
gen_fsm:send_all_state_event(code_lock, stop).
init(Code) ->
{ok, locked, {[], Code}}.
locked({button, Digit}, {SoFar, Code}) ->
case [Digit|SoFar] of
Code ->
do_unlock(),
timeout(10000,code_lock),
{next_state, open, {[], Code}};
Incomplete when length(Incomplete)<length(Code) ->
{next_state, locked, {Incomplete, Code}};
_Wrong ->
{next_state, locked, {[], Code}}
end.
open(timeout, State) ->
do_lock(),
{next_state, locked, State};
open(_, State) ->
{next_state, open, State}.
handle_event(stop, _StateName, StateData) ->
{stop, normal, StateData}.
terminate(normal, _StateName, _StateData) ->
ok.
do_lock() -> io:format("locking the door~n").
do_unlock() -> io:format("unlocking the door~n").
timeout(X,M) ->
spawn(fun () -> receive
after X -> gen_fsm:send_event(M,timeout)
end
end).
There are a bunch of functions in the module timer to do that, preferable to my custom example.
maybe a better usage of the Fsm timeout should be in the lock state:
wait for the first digit without timeout
a digit is entered and code is complete -> test it and continue without timeout (lock or open depending on code entered)
a digit is entered and code is not complete-> store it and continue with timeout
if an unexpected event occurs -> restart from begining without timeout
if timeout barks, restart from begining without timeout
EDIT:
to Bin Wang: what you say in your update is correct, but you cannot avoid to manage this situation. I don't know any built in function that cover your use case. To satisfy it you will need to manage the unexpected timeout message in the lock state, but to avoid multiple timeout running, you will need also to stop the current one before to go to lock state. Note that this does not prevent you to manage the timeout message in lock state, because there is a race between the message to stop the timer and the timeout itself. I wrote for one of my application a general purpose apply_after function that can be canceled, stopped and resumed:
applyAfter_link(T, F, A) ->
V3 = time_ms(),
spawn_link(fun () -> applyAfterp(T, F, A, V3) end).
applyAfterp(T, F, A, Time) ->
receive
cancel -> ok;
suspend when T =/= infinity ->
applyAfterp(infinity, F, A, T + Time - time_ms());
suspend ->
applyAfterp(T, F, A, Time);
resume when T == infinity ->
applyAfterp(Time, F, A, time_ms());
resume ->
Tms = time_ms(), applyAfterp(T + Time - Tms, F, A, Tms)
after T ->
%% io:format("apply after time ~p, function ~p, arg ~p , stored time ~p~n",[T,F,A,Time]),
catch F(A)
end.
time_us() ->
{M, S, U} = erlang:now(),
1000000 * (1000000 * M + S) + U.
time_ms() -> time_us() div 1000.
You will need to sore the Pid of the timeout process in the FSM state.
The following code is also from rabbitmq's supervisor2.erl. The code's function is to kill supervisor's children, by for every child:
monitor child send an trappable exit signal
start timer
if timer's arrive, send an untrappable exit signal (kill).
My question about EXIT and DOWN signal.
If the child doesn't trap the exit signal, the supervisor will receive 2 signal, first is exit signal, and then is DOWN signal, is it right? Is the signal sequence is strictly guaranteed?
If the child traps the exit signal, the supervisor will receive only 1 signal, just down signal, is it right?
terminate_simple_children(Child, Dynamics, SupName) ->
Pids = dict:fold(fun (Pid, _Args, Pids) ->
erlang:monitor(process, Pid),
unlink(Pid),
exit(Pid, child_exit_reason(Child)),
[Pid | Pids]
end, [], Dynamics),
TimeoutMsg = {timeout, make_ref()},
TRef = timeout_start(Child, TimeoutMsg),
{Replies, Timedout} =
lists:foldl(
fun (_Pid, {Replies, Timedout}) ->
{Reply, Timedout1} =
receive
TimeoutMsg ->
Remaining = Pids -- [P || {P, _} <- Replies],
[exit(P, kill) || P <- Remaining],
receive {'DOWN', _MRef, process, Pid, Reason} ->
{{error, Reason}, true}
end;
{'DOWN', _MRef, process, Pid, Reason} ->
{child_res(Child, Reason, Timedout), Timedout};
{'EXIT', Pid, Reason} -> %%<==== strict signal, first EXIT, then DOWN.
receive {'DOWN', _MRef, process, Pid, _} ->
{{error, Reason}, Timedout}
end
end,
{[{Pid, Reply} | Replies], Timedout1}
end, {[], false}, Pids),
timeout_stop(Child, TRef, TimeoutMsg, Timedout),
ReportError = shutdown_error_reporter(SupName),
[case Reply of
{_Pid, ok} -> ok;
{Pid, {error, R}} -> ReportError(R, Child#child{pid = Pid})
end || Reply <- Replies],
ok.
There are two things you are confusing here:
First, the child is the one trapping exits, but you are looking at the supervisor code. What the child does with exit signals does not directly affect the supervisor.
the kill exit signal cannot be trapped. It always kills the child.
The supervisor2 has a monitor on the child. This means it is guaranteed to get a 'DOWN' message and this code is concerned about getting that kind of message. If supervisor2 is also trapping exits, it will get the 'EXIT' message in addition.