Understanding supervisor duty in Erlang/Elixir - erlang

I wrote a new library called director.
It's a supervisor library.
One of its feature is giving a fun with arity 2 to director, and director will call function for every crash of process, first argument is crash reason and second is crash count, for example:
-module(director_test).
-behaviour(director).
-export([start_link/0, init/1]).
start_link() ->
director:start_link(?MODULE, []).
init([]) ->
ChildSpec = #{id => foo,
start => {m, f, args},
plan => [fun my_plan/2],
count => infinity},
{ok, [ChildSpec]}.
my_plan(normal, Count) when Count rem 10 == 0 ->
%% If process crashed with reason normal after every 10 times
%%, director will restart it after spending 3000 milliseconds.
{restart, 3000};
my_plan(normal, _Count) ->
%% If process crashed with reason normal director will restart its
restart;
my_plan(killed, _Count) ->
%% If process was killed, Director will delete it from its children
delete;
my_plan(Reason, Count) ->
%% For other reasons, director will crash with reason {foo_crashed, Reason}
{stop, {foo_crashed, Reason}}.
I announced my library in Slack and they was wondering about writing new supervisor in this way !
Someone said that "I tend to not let the supervisor handle back-off".
Finally they did not tell me clean information and i think i need to know more about supervisor and its duty, etc.
I think that a supervisor is a process that should understand when to restart which child and when to delete which child and when to not restart which child. Am i right?
Can you tell me some good features of OTP/Supervisor that i have not in Director? (List of director's features)

You are mixing the ideas of supervision and management.
Supervision is already a part of OTP. It is the basic idea that:
No process can ever possibly become an orphan
Crashes will be restarted or aborted, and this is an architectural decision made before internal logic is written.
Crashes can be logged externally (handled by a process other than whatever failed).
Error handling code, crash forensics, and so on never occur as part of supervision. Ever. (Complex logic leads to complex weirdness, and supervision needs to be simple, robust, and reliable.)
Management is something that may or may not be present in your system, so it is left up to you. It is the idea that you would have a single (usually named) process that guides the overall high-level task that your (supervised) workers are doing. Having a manager process gives you a single point of control for the overall effort being done -- which also means it is a single place you can tell that overall effort to start, stop, suspend itself, etc. and this is where you could add additional logic about selective restarts based on some crash condition.
Think of "supervision" as a low-level, system framework type idea. It is always the same in all programs just like opening a file or handling a network socket would be. Think of management as one discrete chunk of the actual problem your program needs to solve to accomplish its work.
Management may or may not be complex. Supervision must always be uniform and simple. Giving a supervisor too much responsibility makes them difficult to understand and debug, and often leads to business problems -- an overloaded supervisor can be a major problem in a system. Don't burden your supervisors with high-level management tasks.
I wrote an article about the "service -> worker pattern" in Erlang a while back. Hopefully it informs more than it confuses: https://zxq9.com/archives/1311

Please do not take this personally. You have asked for a feedback and I'm trying to give it to you.
After quickly looking at the docs and the code, I think the main problems with your library are:
You are introducing some complexity in the area where it's normally not needed. In the vast majority of Erlang programs you don't want to analyse why a process have crashed. Analysing it is prone to errors. So the "normal" solution is just to restart the process. If you introduce any logic at this point, you probably introduce some errors too. Such a program is harder to reason about and the advantages are disputable at least.
You are making an assumption that the exit reason is the reason why the process have exited. This is not necessarily true. The reason could have been propagated from its linked processes. If you wanted to really react on all possible exit reasons, you would have to make a transitive closure on all process exit reasons, all it's children exit reasons, all their children exit reasons etc. And you have to change it whenever any of the components changes which is very bad attitude, very error prone. And the introduced complexity (see 1) explodes very badly.
You introduce some "introspection" logic out of the context where the internal logic should be kept ideally - i.e. there's some knowledge about the internal working of the process used outside of its module - in the director's plan. This breaks encapsulation somewhat. The "normal" supervisor knows just how to start the process, it don't need any more information about the process internals.
Last but not least: you are probably solving a non-existing problem. Instead of developing a whole new solution, you should clearly identify the problems of an existing solution and try to solve them very directly and minimally.

Related

Is gen_server restart strategy copy state?

Erlang world not use try-catch as usual. I'm want to know how about performance when restart a process vs try-catch in mainstream language.
A Erlang process has it's small stack and heap concept which actually allocate in OS heap. Why it's effective to restart it?
Hope someone give me a deep in sight about Beam what to do when invoke a restart operation on a process.
Besides, how about use gen_server which maintain state in it's process. Will cause a copy state operate when gen_server restart?
Thanks
I recommend having a read of https://ferd.ca/the-zen-of-erlang.html
Here's my understanding: restart is effective for fixing "Heisenbug" which only happens when the (Erlang) process is in some weird state and/or trying to handle a "weird" message.
The presumption is that you revert to a known good state (by restarting), which should handle all normal messages correctly. Restart is not meant to "fix all the problems", and certainly not for things like bad configuration or missing internet connection. By this definition we can see it's very dangerous to copy the state when crash happened and try to recover from that, because this is defeating the whole point of going back to a known state.
The second point is, say this process only crashes when handling an action that only 0.001% (or whatever percentage is considered negligible) of all your users actually use, and it's not really important (e.g. a minor UI detail) then it's totally fine to just let it crash and restart, and don't need to fix it. I think it can be a productivity enabler for these cases.
Regarding your questions in the OP comment: yes just whatever your init callback returns, you can either build the entire starting state there or source from other places, totally depend on the use case.

Best-effort OTP supervision

What I'd like to do is change my supervisor to make a best effort to keep children running, but give up if their crash rate exceeds the intensity. That way the remainder of the children keep running. This doesn't appear to be possible with the existing supervisor configurations, though, so it looks like my only option may be to implement my own supervisor so I can have it behave this way when it receives EXIT.
Is there a way to implement custom OTP supervisor behavior like this without writing your own supervisor?
It sounds to me like what you want is an individual supervisor for each child, responsible for keeping it alive up to a limit, as you say, and as a layer above that have a single supervisor (one-for-one or simple-one-for-one) whose children are marked as temporary, so that when one of them gives up, the rest stay running.
You can't "extend" Supervisor to add different supervision behaviour, but you don't have to start from scratch either. The :supervisor module itself is implemented on top of :gen_server, so I would consult the source code of :supervisor (which you can find here) if you do find yourself needing some kind of custom supervision behaviour; it will give you a base to build from to avoid some of the pitfalls which you are likely to encounter.
I can expand my answer about alternative solutions once I have a better idea of your use case. As I mentioned in my comment, it sounds to me that you are likely doing something during init/1 of your processes which is prone to failure; init/1 is not the place to handle those things, because if it becomes impossible to succeed at that action temporarily, you will almost certainly blow the max restart intensity of the supervisor.
For example, let's assume you have a process which talks to the database, and requires a database connection; you do not want to try and connect to the database during init/1. Rather you should acquire the connection post-init (perhaps on first-use, or by immediately sending a post-init message to the process using Process.send_after(self(), :connect, 0)), and if the connection fails, return something like {:error, :database_unavailable} to any callers while you attempt to re-establish the connection. Designing with this approach will allow your supervision tree to remain stable, and it instead pushes the decision on how to deal with failure down to the clients who likely have better information on how it impacts them (i.e., should they retry the operation, return an error to their caller, exit with an exception, etc.)
You can use director too, it's more flexible for solving this problem.

Is there an Erlang behaviour that can act on its own instead of waiting to be called?

I'm writing an Erlang application that requires actively polling some remote resources, and I want the process that does the polling to fit into the OTP supervision trees and support all the standard facilities like proper termination, hot code reloading, etc.
However, the two default behaviours, gen_server and gen_fsm seem to only support operation based on callbacks. I could abuse gen_server to do that through calls to self or abuse gen_fsm by having a single state that always loops to itself with a timeout 0, but I'm not sure that's safe (i.e. doesn't exhaust the stack or accumulate unread messages in the mailbox).
I could make my process into a special process and write all that handling myself, but that effectively makes me reimplement the Erlang equivalent of the wheel.
So is there a behavior for code like this?
loop(State) ->
do_stuff(State), % without waiting to be called
loop(NewState).
And if not, is there a safe way to trick default behaviours into doing this without exhausting the stack or accumulating messages over time or something?
The standard way of doing that in Erlang is by using erlang:send_after/3. See this SO answer and also this example implementation.
Is it possible that you could employ an essentially non OTP compliant process? Although to be a good OTP citizen, you do ideally want to make your long running processes into gen_server's and gen_fsm's, sometimes you have to look beyond the standard issue rule book and consider why the rules exist.
What if, for example, your supervisor starts your gen_server, and your gen_server spawns another process (lets call it the active_poll process), and they link to each other so that they have shared fate (if one dies the other dies). The active_poll process is now indirectly supervised by the supervisor that spawned the gen_server, because if it dies, so will the gen_server, and they will both get restarted. The only problem you really have to solve now is code upgrade, but this is not too difficult - your gen_server gets a code_change callback call when the code is to be upgraded, and it could simply send a message to the active_poll process, which can make an appropriate fully qualified function call, and bingo, it's running the new code.
If this doesn't suit you for some reason and/or you MUST use gen_server/gen_fsm/similar directly...
I'm not sure that writing a 'special process' really gives you very much. If you wrote a special process correctly, such that it is in theory compliant to OTP design principals, it could still be ineffective in practice if it blocks or busy waits in a loop somewhere, and doesn't invoke sys when it should, so you really have at most a small optimisation over using gen_server/gen_fsm with a zero timeout (or by having an async message handler which does the polling and sends a message to self to trigger the next poll).
If what ever you are doing to actively poll can block (such as a blocking socket read for example), this is really big trouble, as gen_server, gen_fsm or a special process will all be stopped from fullfilling their usual obligations (which they would usually be able to either because the callback in the case of gen_server/gen_fsm returns, or because receive is called and the sys module invoked explicitly in the case of a special process).
If what you are doing to actively poll is non blocking though, you can do it, but if you poll without any delay then it effectively becomes a busy wait (it's not quite because the loop will include a receive call somewhere, which means the process will yield, giving the scheduler voluntary opportunity to run other processes, but it's not far off, and it will still be a relative CPU hog). If you can have a 1ms delay between each poll that makes a world of difference vs polling as rapidly as you can. It's not ideal, but if you MUST, it'll work. So use a timeout (as big as you can without it becoming a problem), or have an async message handler which does the polling and sends a message to self to trigger the next poll.

Non-defensive programming in Erlang

The following lines appears in http://aosabook.org/en/riak.html, in the second paragraph of the section: 15.1. An Abridged Introduction to Erlang:
"Calling the function with a negative number will result in a run time error, as none of the clauses match. Not handling this case is an example of non-defensive programming, a practice encouraged in Erlang."
Two questions: What is the way the idiomatic way to handle the resulting error in Erlang; and Why would this be better than explicitly covering all cases, as in languages like OCaml or Haskell?
If you code nothing for error cases, letting the system generating a run time error you get at least 3 advantages:
the code is smaller, easier to read, focused on the function to realize.
in error cases, the system will raise an error compliant with the OTP standard, you will benefit for free of all OTP mechanisms to handle the situation at the appropriate level.
you automatically avoid the "lasagna" error detection syndrome, where many code layers track the same error case.
Now you can concentrate on the error management: where will you handle the errors. Erlang offers the classical way with the try and catch statements, and has a more idiomatic way with the OTP supervision tree and the link and monitor mechanism.
In a few words you have the control on the consequence of one process crash (which processes will crash with it, which processes will be informed) and sophisticated means to restart them.
It is important to keep in mind that in erlang you generally starts many small processes which have a very limited role an responsibility, and in this case, letting them crash and restart really makes sense.
I am a fan of the learnyousomeerlang web site where you will find 3 chapters related to the error management:
Errors and Exceptions
Errors and Processes
Who Supervises The Supervisors?

Handling the cleanup of the gen_server state

I have a gen_server running which it must clean up its state whenever it is stopped normally or it crash unexpectedly. The cleanup basically consists in deleting a few files.
At this moment, when the gen_server crash or it is stopped normally, the cleanup is done in terminate/2.
Is there any reason why terminate/2 would not be called if the gen_server crash?
Should be any other process monitoring the gen_server waiting to do the cleanup if the gen_server dies unexpectedly?
So, the code is like this:
terminate(normal, State) ->
% Invoked when the process stops
% Clean up the mess
terminate(Error, State) ->
% Invoked when the process crashes
% Clean up the mess
EDIT: I found this email in the official mailing list which is talking about the same thing:
http://groups.google.com/group/erlang-programming/browse_thread/thread/9a1ba2d974775ce8
As Adam says below, if we want to avoid to trap the exists in the gen_server, we could use different approaches.
But if we trap the exists, terminate/2 seems to be a safe place to do the cleanup as it always will be called. Furthermore we must handle correctly when 'EXIT' is sent to terminate/2 and to handle_call/3 trying to propagate the errors correctly between workers and supervisors.
terminate/2 is called when a crash occur inside the gen_server even if it doesn't trap exits, it will not be called if it receives an 'EXIT' from some other process linked to it, in case you need to clean up then it should trap exits(using process_flag(trap_exit, true)).
This behavior is a bit unfortunate because it makes it difficult to write a reliable shutdown procedure for a gen_server process. Also, it is not a good habit to trap exits just for the sake of being able to run terminate/2, since you might catch a lot of other errors which makes it harder to debug the system.
I would consider three options:
Handle the left over files when the next instance of the process starts (for example, in init/1)
Trap exits, clean up the files, and then crash again with the same reason
Have a 3rd process which monitors the gen_server whose only purpose is to clean up the files
Option 1 is probably the best option, since at least the code doesn't trap exits and you get persistent state for free. Option 2 is not so nice for the reasons described above, that it can hide and obscure other errors. 3 is messy because the cleanup process might not be done before the gen_server is started again.
Think carefully about why you want to clean up, and if it really has to be done when the process crashes (it is a bug, after all). Be careful that you don't end up doing too much defensive programming.
This is quite fresh and relevant
When does terminate/2 get called in a gen_server?

Resources