Resolving a dead lock between two gen_tcp - erlang

While browsing the code of an erlang application, I came across an interesting design problem. Let me describe the situation, but I can't post any code because of PIA sorry.
The code is structured as an OTP application in which two gen_server modules are responsible for allocating some kind of resources. The application runs perfectly for some time and we didn't really had big issues.
The tricky part begins when one the first gen_server need to check if the second have enough resources left. A call is issued to the second gen_server that itself call a utility library that (in very very special case) issue a call to the first gen_server.
I'm relatively new to erlang but I think that this situation is going to make the two gen_server wait for each other.
This is probably a design problem but I just wanted to know if there is any special mechanism built into OTP that can prevent this kind of "hangs".
Any help would be appreciated.
EDIT :
To summaries the answers : If you have a situation where two gen_servers call each other in a cyclic way you'd better spend some more time in the application design.
Thanks for your help :)

This is called a deadlock and could/should be avoided at a design level. Below is a possible workaround and some subjective points that hopefully helps you avoid doing a mistake.
While there are ways to work around your problem, "waiting" is exactly what the call is doing.
One possible work around would be to spawn a process from inside A which calls B, but does not block A from handling the call from B. This process would reply directly to the caller.
In server A:
handle_call(do_spaghetti_call, From, State) ->
spawn(fun() -> gen_server:reply(From, call_server_B(more_spaghetti)) end),
{noreply, State};
handle_call(spaghetti_callback, _From, State) ->
{reply, foobar, State}
In server B:
handle_call(more_spaghetti, _From, State) ->
{reply, gen_server:call(server_a, spaghetti_callback), State}
For me this is very complex and superhard to reason about. I think you even could call it spaghetti code without offending anyone.
On another note, while the above might solve your problem, you should think hard about what calling like this actually implies. For example, what happens if server A executes this call many times? What happens if at any point there is a timeout? How do you configure the timeouts so they make sense? (The innermost call must have a shorter timeout than the outer calls, etc).
I would change the design, even if it is painful, because when you allow this to exist and work around it, your system becomes very hard to reason about. IMHO, complexity is the root of all evil and should be avoided at all costs.

It is mostly a design issue where you need to make sure that there are no long blocking calls from gen_server1. This can quite easily be done by spawning a small fun which takes care of your call to gen_server2 and the delivers the result to gen_server1 when done.
You would have to keep track of the fact that gen_server1 is waiting for a response from gen_server2. Something like this maybe:
handle_call(Msg, From, S) ->
Self = self(),
spawn(fun() ->
Res = gen_server:call(gen_server2, Msg),
gen_server:cast(Self, {reply,Res})
end),
{noreply, S#state{ from = From }}.
handle_cast({reply, Res}, S = #state{ from = From }) ->
gen_server:reply(From, Res),
{noreply, S#state{ from = undefiend}.
This way gen_server1 can serve requests from gen_server2 without hanging. You would ofcourse also need to do proper error propagation of the small process, but you get the general idea.

Another way of doing it, which I think is better, is to make this (resource) information passing asynchronous. Each server reacts and does what it is supposed to when it gets an (asynchronous) my_resource_state message from the other server. It can also prompt the other server to send its resource state with an send_me_your_resource_state asynchronous message. As both these messages are asynchronous they will never block and a server can process other requests while it is waiting for a my_resource_state message from the other server after prompting it.
Another benefit of having the message asynchronous is that servers can send off this information without being prompted when they feel it is necessary, for example "help me I am running really low!" or "I am overflowing, do you want some?".
The two replies from #Lukas and #knutin actually do do it asynchronously, but they do it by a spawning a temporary process, which can then do synchronous calls without blocking the servers. It is easier to use asynchronous messages straight off, and clearer in intent as well.

Related

Reference vs pid?

I'm not entirely sure the differences between the PID and Reference and when to use which.
If I were to spawn a new process with spawn/1 pid. I can kill it with the PID no? Why would I need a reference?
Likewise I see monitor/1 receiving a message with a ref and pid number.
Thanks!
Pid is process identifier. You can get one when you create new process with spawn, or you can get Pid of yourself with self(). It allows you to interact with given process. Especially send messages to it by Pid ! Message. And some other stuff, like killing it explicitly (should not do) or obtaining some process information with erlang:process_info.
And you can create relations between process with erlang:link(Pid) and erlang:monitor(process, Pid) (that's between Pid process, and process execution this function). In short, it gives you "notifications" when another process dies.
Reference is just almost unique value (of different type). One might say, that it gives you some reference to here and now, which you could recognize later. For example, if we are sending a message to another process, and we expect a response, we would like to make sure, that the message we will receive is associated to our request, and not just any message from someone else. The easiest way to do it is to tag the message with a unique value, and wait until a response with exactly the same tag.
Tag = make_ref(),
Pid ! {Tag, Message},
receive
{Tag, Response} ->
....
In this code, with use of pattern matching, we make sure that (we wait in receive until) Response is exactly for the Message we sent. No matter other messages from other processes. This is the most common use of reference you can encounter.
And now back to monitor. When calling Ref = monitor(process, Pid) we make this special connection with Pid process. Ref that is returned is just some unique reference, that we could use to demonitor this process. That is all.
One might ask, if we are able to create monitor with Pid, why do we need Ref for demonitoring? Couldn't we just use Pid again. In theory we could, but monitors are implemented in such a way, that multiple monitors could be established between two same processes. So when demonitoring, we have to remove only one of such connections. It is done in this way to make monitoring more transparent. If you have library of function that's creating and removing one monitor, you would not like to interfere with other libraries and functions and monitors they might be using.
According this page:
References are erlang objects with exactly two properties:
They can be created by a program (using make_ref/0), and,
They can be compared for equality.
You should use it ever you need to bind an unique identifier to some "object". Any time you could generate new one using erlang:make_ref/0. Documentation says:
make_ref() -> reference()
Returns an almost unique reference.
The returned reference will re-occur after approximately 2^82 calls;
therefore it is unique enough for practical purposes.
When you call erlang:monitor/2 function, it returns you reference to give you availability to cancel monitor (erlang:demonitor/1 function). This reference only identifies certain call of erlang:monitor/1. If you need operate with process (kill it, for example), you still have to use process pid.
Likewise I see monitor/1 receiving a message with a ref and pid number.
Yep, monitor sends messages like {'DOWN', Ref, process, Pid, Reason}. What to use (pid or ref) is only depends on your application logic, but (IMO) in most usual cases, there is no matter what to use.

Erlang: how to deal with long running init callback?

I have a gen_server that when started attempts to start a certain number of child processes (usually 10-20) under a supervisor in the supervision tree. The gen_server's init callback invokes supervisor:start_child/2 for each child process needed. The call to supervisor:start_child/2 is synchronous so it doesn't return until the child process has started. All the child processes are also gen_servers, so the start_link call doesn't return until the init callback returns. In the init callback a call is made to a third-party system, which may take a while to respond (I discovered this issue when calls to a third-party system were timing out after 60 seconds). In the meantime the init call has blocked, meaning the supervisor:start_child/2 is also blocked. So the whole time the gen_server process that invoked supervisor:start_child/2 is unresponsive. Calls to the gen_server timeout while it is waiting the on the start_child function to return. Since this can easily last for 60 seconds or more. I would like to change this as my application is suspended in a sort of half started state while it is waiting.
What is the best way to resolve this issue?
The only solution I can think of is to move the code that interacts with the third-party system out of the init callback and into a handle_cast callback. This would make the init callback faster. The disadvantage is that I would need to call gen_server:cast/2 after all the child processes have been started.
Is there a better way of doing this?
One approach I've seen is use of timeout init/1 and handle_info/2.
init(Args) ->
{ok, {timeout_init, Args} = _State, 0 = _Timeout}.
...
handle_info( timeout, {timeout_init, Args}) ->
%% do your inicialization
{noreply, ActualServerState}; % this time no need for timeout
handle_info( ....
Almost all results you can be returned with additional timeout parameter, which is basically time to wait for a another message. It given time passes the handle_info/2 is called, with timeout atom, and servers state. In our case, with timeout equal to 0, the timeout should occur even before gen_server:start finishes. Meaning that handle_info should be called even before we are able to return pid of our server to anyone else. So this timeout_init should be first call made to our server, and give us some assurance, that we finish initialization, before handling anything else.
If you don't like this approach (is not really readable), you might try to send message to self in init/1
init(Args) ->
self() ! {finish_init, Args},
{ok, no_state_yet}.
...
handle_info({finish_init, Args} = _Message, no_state_yet) ->
%% finish whateva
{noreply, ActualServerState};
handle_info( ... % other clauses
Again, you are making sure that message to finish initialization is send as soon as possible to this server, which is very important in case of gen_servers which register under some atom.
EDIT After some more careful study of OTP source code.
Such approach is good enough when you communicate with your server trough it's pid. Mainly because pid is returned after your init/1 functions returns. But it is little bit different in case of gen_.. started with start/4 or start_link/4 where we automatically register process under same name. There is one race condition you could encounter, which I would like to explain in little more detail.
If process is register one usually simplifies all calls and cast to server, like:
count() ->
gen_server:cast(?SERVER, count).
Where ?SERVER is usually module name (atom) and which will work just fine untill under this name is some registered (and alive) process. And of course, under the hood this cast is standard Erlang's message send with !. Nothing magical about it, almost the same as you do in your init with self() ! {finish ....
But in our case we assume one more thing. Not just registration part, but also that our server finished it's initialization. Of course since we are dealing with message box, it is not really important how long something takes, but it is important which message we receive firs. So to be exact, we would like to receive finish_init message before receiving count message.
Unfortunately such scenario could happened. This is due to fact that gen's in OTP are registered before init/1 callback is called. So in theory while one process calls start function which will go up to registration part, than another one could find our server and send count message, and just after that the init/1 function would be called with finish_init message. Chances are small (very, very small), but still it could happen.
There are three solutions to this.
First would be to do nothing. In case of such race condition the handle_cast would fail (due to function clause, since our state is not_state_yet atom), and supervisor would just restart whole thing.
Second case would be ignoring this bad message/state incident. This is easily achieved with
... ;
handle_cast( _, State) ->
{noreply, State}.
as your last clause. And unfortunately most people using templates use such unfortunate (IMHO) pattern.
In both of those you maybe could lose one count message. If that is really a problem you still could try to fix it by changing last clause to
... ;
handle_cast(Message, no_state_yet) ->
gen_server:cast( ?SERVER, Message),
{noreply, no_state_yet}.
but this have other obvious advantages, an I would prefer "let it fail" approach.
Third option is registering process little bit later. Rather than using start/4 and asking for automatic registration, use start/3, receive pid, and register it yourself.
start(Args) ->
{ok, Pid} = gen_server:start(?MODULE, Args, []),
register(?SERVER, Pid),
{ok, Pid}.
This way we send finish_init message before registration, and before any one else could send and count message.
But such approach has it's own drawbacks, mainly registration itself which could fail in few different ways. One could always check how OTP handles that, and duplicate this code. But this is another story.
So in the end it all depends on what you need, or even what problems you will encounter in production. It is important to have some idea what bad could happen, but I personally wouldn't try to fix any of it until I would actually suffer from such race condition.

erlang typechecking

From what I understand there is no way of type-checking the messages send in erlang.
lets say i start a module with the following receive loop:
loop(State) ->
receive
{insert, _} ->
io:fwrite("insert\n",[]),
loop(State);
{view, _} ->
io:fwrite("view\n", []),
loop(State)
after 10000 ->
ok
end.
There is no way for me to check what people are sending to the process, and no way to check for that its type safe?
Are there any easy work arrounds?
The one I have come up with is using functions in the module being called like :
send_insert(Message) ->
whereis(my_event_handler) ! {insert, Message},
ok.
this way at least I can add the -spec send_insert(string()) -> ok. spec to the module. now at least I have limited the error to my module.
Are there a more standard way of doing typechecking on messages?
There is sheriff project that solves your problem. You can use it for checking values against their type as defined through typespecs.
I would say that having a function like send_insert in your module, that just sends a message to the process, is good practice not just for type checking. If you need to change the message format some time in the future, you'll know that you only need to change that function and possibly its callers, which is easier to track down than finding all places that send a message of a certain format to some process (which may or may not be the process whose code you're refactoring). Also, since any callers will need to specify the module name, the code becomes a little more self-documenting; you'll know what process that message is supposed to go to.
(BTW, whereis(my_event_handler) ! {insert, Message} can be written as my_event_handler ! {insert, Message}.)
Well, if what you need is just some basic type (and maybe range) checking, you can use guards:
receive
{insert, Message} when is_list(Message) ->
io:fwrite("insert\n",[]),
loop(State);
Unfortunately, because of some constraints (guards must be free of any side-effects, for example) there's no way to write your own guard functions.
AFAIK, "-spec" is only for documentation purposes and will not check your types at runtime.
As you correctly say, there's no typechecking per se, but you can have a mix of pattern matching and guards to make things fails. Nevertheless, this is all defensive programming, and you should just let it crash, and have a supervisor tree restart whatever needs to be restarted. The logs and crash reports should give you enough information to know what went wrong and act accordingly.

Erlang: simple pubsub for processes — is my approach okay?

Disclaimer: I'm pretty new to Erlang and OTP.
I want a simple pubsub in Erlang/OTP, where processes could subscribe at some "hub" and receive a copy of messages that were sent to that hub.
I know about gen_event, but it processes events in one single event manager process, while I want every subscriber to be a separate, autonomous process. Also, I was unable to grok gen_event's handlers supervision. Unfortunately, Google results were full of XMPP (Ejabberd) and RabbitMQ links, so I didn't find anything relevant to my idea.
My idea is that such pubsub model seamlessly maps to supervision tree. So I thought to extend the supervisor (a gen_server under the hood) to be able to send a cast message to all its children.
I've hacked this in my quick-and-dirty custom "dispatcher" behavior:
-module(dispatcher).
-extends(supervisor).
-export([notify/2, start_link/2, start_link/3, handle_cast/2]).
start_link(Mod, Args) ->
gen_server:start_link(dispatcher, {self, Mod, Args}, []).
start_link(SupName, Mod, Args) ->
gen_server:start_link(SupName, dispatcher, {SupName, Mod, Args}, []).
notify(Dispatcher, Message) ->
gen_server:cast(Dispatcher, {message, Message}).
handle_cast({message, Message}, State) ->
{reply, Children, State} = supervisor:handle_call(which_children, dummy, State),
Pids = lists:filter(fun(Pid) -> is_pid(Pid) end,
lists:map(fun({_Id, Child, _Type, _Modules}) -> Child end,
Children)),
[gen_server:cast(Pid, Message) || Pid <- Pids],
{noreply, State}.
However, while everything seem to work fine at the first glance (children receive messages and are seamlessly restarted when they fail), I wonder whenever this was a good idea.
Could someone, please, criticize (or approve) my approach, and/or recommend some alternatives?
I've recently used gproc to implement pubsub. The example from the readme does the trick.
subscribe(EventType) ->
%% Gproc notation: {p, l, Name} means {(p)roperty, (l)ocal, Name}
gproc:reg({p, l, {?MODULE, EventType}}).
notify(EventType, Msg) ->
Key = {?MODULE, EventType},
gproc:send({p, l, Key}, {self(), Key, Msg}).
From your code it looks to me that gen_event handlers are a perfect match.
The handler callbacks are called from one central process dispatching the messages, but these callbacks shouldn't do much work.
So if you need a autonomous process with its own state for the subscribers, just send a message in the event callback.
Usually these autonomous processes would be gen_servers and you just would call gen_server:cast from your event callbacks.
Supervision is a separate issue, that can be handled by the usual supervision infrastructure that comes with OTP. How you want to do supervision depends on the semantics of your subscriber processes. If they are all identical servers, you could use a simple_one_for_one for example.
In the init callback of the subscriber processes you can put the gen_event:add_handler calls that adds them to the event manager.
You can even use the event manager as supervisor if you use the gen_event:add_sup_handler function to add your processes if the semantics of this suits you.
Online resources for understanding gen_event better: Learn you some Erlang chapter
Otherwise the Erlang books all have some gen_event introduction. Probably the most thorough one you can find in Erlang and OTP in Action
Oh and BTW: I wouldn't hack up your own supervisors for this.
A very simple example where you do it all yourself is in my very basic chat_demo which is a simple web-based chat server. Look at chat_backend.erl (or chat_backend.lfe if you like parentheses) which allows users to subscribe and they will then be sent all messages that arrive at the backend. It does not fit into supervision trees though the modification is simple (although it does use proc_lib to get better error messages).
Sometimes ago, i read about øMQ (ZeroMQ), which has a bunch of bindings to different programming languages.
http://www.zeromq.org/
http://www.zeromq.org/bindings:erlang
If it must not an pure erlang solution, this could be a choice.

Why are error_logger messages in different order on the console compared to error_logger_mf file

I'm looking at error_logger messages on the console and store them in a file with error_logger_mf at the same time.
The messages are totally in a different order if I look at the file and the console.
The time-stamps all show the same value, so its going pretty fast, and I do understand that messages could get out of order when sent from different processes.
But I always thought that once the reach the error_logger they are kept in the same order when they are sent to the different event handlers.
What I see that in the files (when I look at it with rb) the events come out in a more sane order than on the console.
Clarification:
It is clear that the order in which messages from different processes arrive at error_logger is not to be take too serious.
What I don't understand is the difference in order, when I compare the disk log to the screen log.
Added a answer as community wiki with my partial findings below, please edit if you know additional points.
Update: this is still unresolved, feel free to add to this community wiki if you know something
Did some digging in the source, but no solution to the riddle so far:
Looked into error_logger_tty_h.erl which should be responsible for output to the console:
handle_event({_Type, GL, _Msg}, State) when node(GL) =/= node() ->
{ok, State};
handle_event(Event, State) ->
write_event(tag_event(Event)),
{ok, State}.
So events that have a group_leader on another node are ignored, everything not ignored is passed through write_event/1. Which does some formatting and then outputs the result with:
format(String) -> io:format(user, String, []).
format(String, Args) -> io:format(user, String, Args).
In user.erl where io:format sends its io_request we have one server loop calling a cascade of functions that ultimately send the text to the tty port.
At no point there are messages sent from more than one process!
So I can't see any way for the messages to change order while travelling to the tty.
Where else can the order of reports change depending on if the messages are sent to tty or to mf?

Resources