erlang node not responding - erlang

I received such message in erlang condose at first#localhost node
=ERROR REPORT==== 1-Jan-2011::23:19:28 ===
** Node 'second#localhost' not responding **
** Removing (timedout) connection **
My question is - what is timeout in this case? How much time before causes this event?
Howto prevent this "horror"? I can restore\recover to normal work only by restart node...
But what is the right way?
Thank you, and Happy New Year!

Grepping for the not responding string in the Erlang source code, you can see how the message is generated in the dist_util module in the kernel application (con_loop function).
{error, not_responding} ->
error_msg("** Node ~p not responding **~n"
"** Removing (timedout) connection **~n",
[Node]),
Within the module, the following documentation is present, explaining the logic behind ticks and not responding nodes:
%%
%% Send a TICK to the other side.
%%
%% This will happen every 15 seconds (by default)
%% The idea here is that every 15 secs, we write a little
%% something on the connection if we haven't written anything for
%% the last 15 secs.
%% This will ensure that nodes that are not responding due to
%% hardware errors (Or being suspended by means of ^Z) will
%% be considered to be down. If we do not want to have this
%% we must start the net_kernel (in erlang) without its
%% ticker process, In that case this code will never run
%% And then every 60 seconds we also check the connection and
%% close it if we havn't received anything on it for the
%% last 60 secs. If ticked == tick we havn't received anything
%% on the connection the last 60 secs.
%% The detection time interval is thus, by default, 45s < DT < 75s
%% A HIDDEN node is always (if not a pending write) ticked if
%% we haven't read anything as a hidden node only ticks when it receives
%% a TICK !!
Hope this helps a bit.

Related

Keeping a process alive just to link other processes

In Programming Erlang by Joe Armstrong, Chapter 12, "Making a Set of Processes That All Die Together", the following code is given:
% (Some variables are renamed and comments added for extra clarity)
start(WorkerFuns) ->
spawn(fun() ->
% Parent process
[spawn_link(WorkerFun) || WorkerFun <- WorkerFuns],
receive
after infinity -> true
end
end).
The resulting processes are linked as such:
+- parent -+
/ | \
/ | \
worker1 worker2 .. workerN
If a worker crashes, then the parent crashes, and then the remaining workers crash as well. However, if all of the workers exit normally, then the parent process lives forever, albeit in a suspended state.
While Erlang processes are supposed to be cheap, if start/1 is called many times in a long-running service, one process—the parent—appears to be "leaked" every time all workers exit normally.
Is this ever a problem in practice? And is the extra code to properly account for when all workers exit normally (see below), worth it?
start(WorkerFuns) ->
spawn(fun() ->
% Parent process
process_flag(trap_exit, true),
[spawn_link(WorkerFun) || WorkerFun <- WorkerFuns],
parent_loop(length(WorkerFuns))
end).
parent_loop(0) ->
% All workers exited normally
true;
parent_loop(RemainingWorkers) ->
receive
{'EXIT', _WorkerPid, normal} ->
parent_loop(RemainingWorkers - 1);
{'EXIT', _WorkerPid, CrashReason} ->
exit(CrashReason)
end.
Your analysis is correct. The code as given does not account for normal termination of the workers and will leave a dangling process. The space leak will be about 2 kb per invocation, so in a large system you're not likely to notice it unless you call start/1 a thousand times or more, but for a system expected to run "forever" you should definitely add the extra code.

Are erlang:send_after/3 and timer:send_after/3 intended to behave differently?

I wanted to send a message to a process after a delay, and discovered erlang:send_after/4.
When looking at the docs it looked like this is exactly what I wanted:
erlang:send_after(Time, Dest, Msg, Options) -> TimerRef
Starts a timer. When the timer expires, the message Msg is sent to the
process identified by Dest.
However, it doesn't seem to work when the destination is running on another node - it tells me one of the arguments are bad.
1> P = spawn('node#host', module, function, [Arg]).
<10585.83.0>
2> erlang:send_after(1000, P, {123}).
** exception error: bad argument
in function erlang:send_after/3
called as erlang:send_after(1000,<10585.83.0>,{123})
Doing the same thing with timer:send_after/3 appears to work fine:
1> P = spawn('node#host', module, function, [Arg]).
<10101.10.0>
2> timer:send_after(1000, P, {123}).
{ok,{-576458842589535,#Ref<0.1843049418.1937244161.31646>}}
And, the docs for timer:send_after/3 state almost the same thing as the erlang version:
send_after(Time, Pid, Message) -> {ok, TRef} | {error, Reason}
Evaluates Pid ! Message after Time milliseconds.
So the question is, why do these two functions, which on the face of it do the same thing, behave differently? Is erlang:send_after broken, or mis-advertised? Or maybe timer:send_after isn't doing what I think it is?
TL;DR
Your assumption is correct: these are intended to do the same thing, but are implemented differently.
Discussion
Things in the timer module such as timer:send_after/2,3 work through the gen_server that defines that as a service. Like any other service, this one can get overloaded if you assign a really huge number of tasks (timers to track) to it.
erlang:send_after/3,4, on the other hand, is a BIF implemented directly within the runtime and therefore have access to system primitives like the hardware timer. If you have a ton of timers this is definitely the way to go. In most programs you won't notice the difference, though.
There is actually a note about this in the Erlang Efficiency Guide:
3.1 Timer Module
Creating timers using erlang:send_after/3 and erlang:start_timer/3 , is much more efficient than using the timers provided by the timer module in STDLIB. The timer module uses a separate process to manage the timers. That process can easily become overloaded if many processes create and cancel timers frequently (especially when using the SMP emulator).
The functions in the timer module that do not manage timers (such as timer:tc/3 or timer:sleep/1), do not call the timer-server process and are therefore harmless.
A workaround
A workaround to gain the efficiency of the BIF without the same-node restriction is to have a process of your own that does nothing but wait for a message to forward to another node:
-module(foo_forward).
-export([send_after/3, cancel/1]).
% Obviously this is an example only. You would want to write this to
% be compliant with proc_lib, write a proper init/N and integrate with
% OTP. Note that this snippet is missing the OTP service functions.
start() ->
spawn(fun() -> loop(self(), [], none) end).
send_after(Time, Dest, Message) ->
erlang:send_after(Time, self(), {forward, Dest, Message}).
loop(Parent, Debug, State) ->
receive
{forward, Dest, Message} ->
Dest ! Message,
loop(Parent, Debug, State);
{system, From, Request} ->
sys:handle_msg(Request, From, Parent, ?MODULE, Debug, State);
Unexpected ->
ok = log(warning, "Received message: ~tp", [Unexpected]),
loop(Parent, Debug, State)
end.
The above example is a bit shallow, but hopefully it expresses the point. It should be possible to get the efficiency of the BIF erlang:send_after/3,4 but still manage to send messages across nodes as well as give you the freedom to cancel a message using erlang:cancel_timer/1
But why?
The puzzle (and bug) is why erlang:send_after/3,4 does not want to work across nodes. The example you provided above looks a bit odd as the first assignment to P was the Pid <10101.10.0>, but the crashed call was reported as <10585.83.0> -- clearly not the same.
For the moment I do not know why erlang:send_after/3,4 doesn't work, but I can say with confidence that the mechanism of operation between the two is not the same. I'll look into it, but I imagine that the BIF version is actually doing some funny business within the runtime to gain efficiency and as a result signalling the target process by directly updating its mailbox instead of actually sending an Erlang message on the higher Erlang-to-Erlang level.
Maybe it is good that we have both, but this should definitely be clearly marked in the docs, and it evidently is not (I just checked).
There is some difference in timeout order if you have many timers.
The example below shows erlang:send_after does not guarantee order, but
timer:send_after does.
1> A = lists:seq(1,10).
[1,2,3,4,5,6,7,8,9,10]
2> [erlang:send_after(100, self(), X) || X <- A].
...
3> flush().
Shell got 2
Shell got 3
Shell got 4
Shell got 5
Shell got 6
Shell got 7
Shell got 8
Shell got 9
Shell got 10
Shell got 1
ok
4> [timer:send_after(100, self(), X) || X <- A].
...
5> flush().
Shell got 1
Shell got 2
Shell got 3
Shell got 4
Shell got 5
Shell got 6
Shell got 7
Shell got 8
Shell got 9
Shell got 10
ok

How to kill an infinite loop process in Erlang

If I create a module with this code below
start_nonstop() ->
spawn(fun() ->
Pid = spawn(?MODULE, nonstop, [0]),
timer:sleep(1000),
exit(Pid, kill)
end).
nonstop(N) ->
io:format("number: ~B~n", [N + 1]),
nonstop(N + 1).
and call start_nonstop() from the Erlang shell, I see an endless series of
number: 1
number: 2
...
which means that the nonstop(N) process was not killed as expected by calling exit(Pid,kill)...
What am I doing wrong? Obviously, this code is a mockup, but I think there is always the chance that some logic bug in a process might result in an infinite loop behaviour similar to this one.
I supposed this could be handled by Erlang, but if not, how can I have an Erlang application be protected regarding these kind of situations?
Which patterns of "infinite loops" can Erlang break? For example, if I put a sleep in the middle of the nonstop(N) functions, Erlang can break the infinite loop, but if I put an erlang:yield() it still cannot break from the infinite loop ...
In this case the infinite process is local to the one trying to kill it. But, what if the infinite process was in a different (e.g., remote) Erlang VM? Could it be killed then?
I am a newbie, and I am evaluating Erlang before I put too much effort in learning and using it for serious applications.
Thanks
In this code, you spawn two process.
In function start_nonstop(), you spawn an process, we can call it Process1. Then in Process1, you spawn another process, we call it Process2.
The work of Process2 is:
nonstop(N) ->
io:format("number: ~B~n", [N + 1]),
nonstop(N + 1).
just do io:format("number: ~B~n", [N + 1]), until the Process1 kill it.
In my environment, the Process2 can be killed. But the variable N become very large from the output.
number: 51321
number: 51322
number: 51323
number: 51324
number: 51325
number: 51326
number: 51327
number: 51328
number: 51329
number: 51330
number: 51331
number: 51332
7>

erlang supervisor best way to handle ibrowse:send_req conn_failed

new to Erlang and just having a bit of trouble getting my head around the new paradigm!
OK, so I have this internal function within an OTP gen_server:
my_func() ->
Result = ibrowse:send_req(?ROOTPAGE,[{"User-Agent",?USERAGENT}],get),
case Result of
{ok, "200", _, Xml} -> %<<do some stuff that won't interest you>>
,ok;
{error,{conn_failed,{error,nxdomain}}} -> <<what the heck do I do here?>>
end.
If I leave out the case for handling the connection failed then I get an exit signal propagated to the supervisor and it gets shut down along with the server.
What I want to happen (at least I think this is what I want to happen) is that on a connection failure I'd like to pause and then retry send_req say 10 times and at that point the supervisor can fail.
If I do something ugly like this...
{error,{conn_failed,{error,nxdomain}}} -> stop()
it shuts down the server process and yes, I get to use my (try 10 times within 10 seconds) restart strategy until it fails, which is also the desired result however the return value from the server to the supervisor is 'ok' when I would really like to return {error,error_but_please_dont_fall_over_mr_supervisor}.
I strongly suspect in this scenario that I'm supposed to handle all the business stuff like retrying failed connections within 'my_func' rather than trying to get the process to stop and then having the supervisor restart it in order to try it again.
Question: what is the 'Erlang way' in this scenario ?
I'm new to erlang too.. but how about something like this?
The code is long just because of the comments. My solution (I hope I've understood correctly your question) will receive the maximum number of attempts and then do a tail-recursive call, that will stop by pattern-matching the max number of attempts with the next one. Uses timer:sleep() to pause to simplify things.
%% #doc Instead of having my_func/0, you have
%% my_func/1, so we can "inject" the max number of
%% attempts. This one will call your tail-recursive
%% one
my_func(MaxAttempts) ->
my_func(MaxAttempts, 0).
%% #doc This one will match when the maximum number
%% of attempts have been reached, terminates the
%% tail recursion.
my_func(MaxAttempts, MaxAttempts) ->
{error, too_many_retries};
%% #doc Here's where we do the work, by having
%% an accumulator that is incremented with each
%% failed attempt.
my_func(MaxAttempts, Counter) ->
io:format("Attempt #~B~n", [Counter]),
% Simulating the error here.
Result = {error,{conn_failed,{error,nxdomain}}},
case Result of
{ok, "200", _, Xml} -> ok;
{error,{conn_failed,{error,nxdomain}}} ->
% Wait, then tail-recursive call.
timer:sleep(1000),
my_func(MaxAttempts, Counter + 1)
end.
EDIT: If this code is in a process which is supervised, I think it's better to have a simple_one_for_one, where you can add dinamically whatever workers you need, this is to avoid delaying initialization due to timeouts (in a one_for_one the workers are started in order, and having sleep's at that point will stop the other processes from initializing).
EDIT2: Added an example shell execution:
1> c(my_func).
my_func.erl:26: Warning: variable 'Xml' is unused
{ok,my_func}
2> my_func:my_func(5).
Attempt #0
Attempt #1
Attempt #2
Attempt #3
Attempt #4
{error,too_many_retries}
With 1s delays between each printed message.

Should/can I do nested receives for TCP data?

Can I Nest receive {tcp, Socket, Bin} -> calls? For example I have a top level loop called Loop, which upon receipt of tcp data calls a function, parse_header, to parse header data (an integer which indicates the kind of data to follow and thus its size), after that I need to receive the entire payload before moving on. I might only receive 4 bytes when I need a full 20 bytes and would like to call receive in a separate function called parse_payload. So the call chain would look like loop->parse_header->parse_payload and I would like parse_payload to call receive {tcp, Socket, Bin} ->. I don't know if this ok or if I'm completely going to mess things up and can only do it in the Loop function. Can someone enlighten me? If I am allowed to do this is am I violating some sort of best practice?
Maybe you can check the sample code for "erlang programming".
The download page is Erlang Programming Source Code
In file socket_examples.erl, please check "receive_data" function.
For perse message, I think you should determine how to seperate messages one by one (fixed length or with termination byte), then parse message's header, and payload.
receive_data(Socket, SoFar) ->
receive
{tcp,Socket,Bin} -> %% (3)
receive_data(Socket, [Bin|SoFar]);
{tcp_closed,Socket} -> %% (4)
list_to_binary(reverse(SoFar)) %% (5)
end.
You can also set a gen_tcp socket in passive mode. This way, the owning process won't receive the input by messages but has to fetch it using gen_tcp:recv(Socket, ByteCount) which returns either {ok, Input} or {error, Reason}. As this methods waits infinitely for the bytes you might want to add a timeout using gen_tcp:recv/3. (Erlang documentation of gen_tcp:recv)
While at first glance it might seem the process is now completely unable to react to messages sent to it, there is the following workaround improving the situation a bit:
f1(X) ->
receive
message1 ->
... do something ...,
f1(X);
message2 ->
... do something ...,
f1(X)
after 0 %timeout in ms
{ok, Input} = gen_tcp:recv(Socket, ByteCount, Timeout),
... do something ... % maybe call some times gen_tcp:recv again
f1(X)
end.
If you don't add a timeout to gen_tcp:recv here, other processes could wait ages for f1 to handle their messages.

Resources