Best-effort OTP supervision - erlang

What I'd like to do is change my supervisor to make a best effort to keep children running, but give up if their crash rate exceeds the intensity. That way the remainder of the children keep running. This doesn't appear to be possible with the existing supervisor configurations, though, so it looks like my only option may be to implement my own supervisor so I can have it behave this way when it receives EXIT.
Is there a way to implement custom OTP supervisor behavior like this without writing your own supervisor?

It sounds to me like what you want is an individual supervisor for each child, responsible for keeping it alive up to a limit, as you say, and as a layer above that have a single supervisor (one-for-one or simple-one-for-one) whose children are marked as temporary, so that when one of them gives up, the rest stay running.

You can't "extend" Supervisor to add different supervision behaviour, but you don't have to start from scratch either. The :supervisor module itself is implemented on top of :gen_server, so I would consult the source code of :supervisor (which you can find here) if you do find yourself needing some kind of custom supervision behaviour; it will give you a base to build from to avoid some of the pitfalls which you are likely to encounter.
I can expand my answer about alternative solutions once I have a better idea of your use case. As I mentioned in my comment, it sounds to me that you are likely doing something during init/1 of your processes which is prone to failure; init/1 is not the place to handle those things, because if it becomes impossible to succeed at that action temporarily, you will almost certainly blow the max restart intensity of the supervisor.
For example, let's assume you have a process which talks to the database, and requires a database connection; you do not want to try and connect to the database during init/1. Rather you should acquire the connection post-init (perhaps on first-use, or by immediately sending a post-init message to the process using Process.send_after(self(), :connect, 0)), and if the connection fails, return something like {:error, :database_unavailable} to any callers while you attempt to re-establish the connection. Designing with this approach will allow your supervision tree to remain stable, and it instead pushes the decision on how to deal with failure down to the clients who likely have better information on how it impacts them (i.e., should they retry the operation, return an error to their caller, exit with an exception, etc.)

You can use director too, it's more flexible for solving this problem.

Related

Is gen_server restart strategy copy state?

Erlang world not use try-catch as usual. I'm want to know how about performance when restart a process vs try-catch in mainstream language.
A Erlang process has it's small stack and heap concept which actually allocate in OS heap. Why it's effective to restart it?
Hope someone give me a deep in sight about Beam what to do when invoke a restart operation on a process.
Besides, how about use gen_server which maintain state in it's process. Will cause a copy state operate when gen_server restart?
Thanks
I recommend having a read of https://ferd.ca/the-zen-of-erlang.html
Here's my understanding: restart is effective for fixing "Heisenbug" which only happens when the (Erlang) process is in some weird state and/or trying to handle a "weird" message.
The presumption is that you revert to a known good state (by restarting), which should handle all normal messages correctly. Restart is not meant to "fix all the problems", and certainly not for things like bad configuration or missing internet connection. By this definition we can see it's very dangerous to copy the state when crash happened and try to recover from that, because this is defeating the whole point of going back to a known state.
The second point is, say this process only crashes when handling an action that only 0.001% (or whatever percentage is considered negligible) of all your users actually use, and it's not really important (e.g. a minor UI detail) then it's totally fine to just let it crash and restart, and don't need to fix it. I think it can be a productivity enabler for these cases.
Regarding your questions in the OP comment: yes just whatever your init callback returns, you can either build the entire starting state there or source from other places, totally depend on the use case.

Understanding supervisor duty in Erlang/Elixir

I wrote a new library called director.
It's a supervisor library.
One of its feature is giving a fun with arity 2 to director, and director will call function for every crash of process, first argument is crash reason and second is crash count, for example:
-module(director_test).
-behaviour(director).
-export([start_link/0, init/1]).
start_link() ->
director:start_link(?MODULE, []).
init([]) ->
ChildSpec = #{id => foo,
start => {m, f, args},
plan => [fun my_plan/2],
count => infinity},
{ok, [ChildSpec]}.
my_plan(normal, Count) when Count rem 10 == 0 ->
%% If process crashed with reason normal after every 10 times
%%, director will restart it after spending 3000 milliseconds.
{restart, 3000};
my_plan(normal, _Count) ->
%% If process crashed with reason normal director will restart its
restart;
my_plan(killed, _Count) ->
%% If process was killed, Director will delete it from its children
delete;
my_plan(Reason, Count) ->
%% For other reasons, director will crash with reason {foo_crashed, Reason}
{stop, {foo_crashed, Reason}}.
I announced my library in Slack and they was wondering about writing new supervisor in this way !
Someone said that "I tend to not let the supervisor handle back-off".
Finally they did not tell me clean information and i think i need to know more about supervisor and its duty, etc.
I think that a supervisor is a process that should understand when to restart which child and when to delete which child and when to not restart which child. Am i right?
Can you tell me some good features of OTP/Supervisor that i have not in Director? (List of director's features)
You are mixing the ideas of supervision and management.
Supervision is already a part of OTP. It is the basic idea that:
No process can ever possibly become an orphan
Crashes will be restarted or aborted, and this is an architectural decision made before internal logic is written.
Crashes can be logged externally (handled by a process other than whatever failed).
Error handling code, crash forensics, and so on never occur as part of supervision. Ever. (Complex logic leads to complex weirdness, and supervision needs to be simple, robust, and reliable.)
Management is something that may or may not be present in your system, so it is left up to you. It is the idea that you would have a single (usually named) process that guides the overall high-level task that your (supervised) workers are doing. Having a manager process gives you a single point of control for the overall effort being done -- which also means it is a single place you can tell that overall effort to start, stop, suspend itself, etc. and this is where you could add additional logic about selective restarts based on some crash condition.
Think of "supervision" as a low-level, system framework type idea. It is always the same in all programs just like opening a file or handling a network socket would be. Think of management as one discrete chunk of the actual problem your program needs to solve to accomplish its work.
Management may or may not be complex. Supervision must always be uniform and simple. Giving a supervisor too much responsibility makes them difficult to understand and debug, and often leads to business problems -- an overloaded supervisor can be a major problem in a system. Don't burden your supervisors with high-level management tasks.
I wrote an article about the "service -> worker pattern" in Erlang a while back. Hopefully it informs more than it confuses: https://zxq9.com/archives/1311
Please do not take this personally. You have asked for a feedback and I'm trying to give it to you.
After quickly looking at the docs and the code, I think the main problems with your library are:
You are introducing some complexity in the area where it's normally not needed. In the vast majority of Erlang programs you don't want to analyse why a process have crashed. Analysing it is prone to errors. So the "normal" solution is just to restart the process. If you introduce any logic at this point, you probably introduce some errors too. Such a program is harder to reason about and the advantages are disputable at least.
You are making an assumption that the exit reason is the reason why the process have exited. This is not necessarily true. The reason could have been propagated from its linked processes. If you wanted to really react on all possible exit reasons, you would have to make a transitive closure on all process exit reasons, all it's children exit reasons, all their children exit reasons etc. And you have to change it whenever any of the components changes which is very bad attitude, very error prone. And the introduced complexity (see 1) explodes very badly.
You introduce some "introspection" logic out of the context where the internal logic should be kept ideally - i.e. there's some knowledge about the internal working of the process used outside of its module - in the director's plan. This breaks encapsulation somewhat. The "normal" supervisor knows just how to start the process, it don't need any more information about the process internals.
Last but not least: you are probably solving a non-existing problem. Instead of developing a whole new solution, you should clearly identify the problems of an existing solution and try to solve them very directly and minimally.

Is it a common practice to exit kill/2 a process?

In my app, I plan to have many worker processes, that can potentially spend hours doing their work.
I want the user to be able to stop and delete the workers.
Is it acceptable to kill/2, exit the process?
Will it terminate the process even if it's in the middle of doing some work (i.e. downloading a file)?
Do supervisors offer a similar mechanism for stopping and removing children that are in the middle of doing some work?
Is it acceptable to kill, exit/2 the process? Will it terminate the
process even if it's in the middle of doing some work (i.e.
downloading a file)?
Yes. In order to terminate a process you may use exit/2 as you said. The termination procedure will be different if you set the Reason argument to be: noraml, OtherReason or kill.
It is explained very well in the Error Handling documentation, and also for more detailed explenation see this.
So you may choose whatever fits your application.
Do supervisors offer a similar mechanism for stopping and removing
children that are in the middle of doing some work?
Yes. As mentioned in the comment, there is a very good detailed documentation for it in Erlang's Supervisor documentation. I suggest you to carefully read all of it, but the main parts you're looking for are:
Defining the child_spec() when starting a child (mainly the shutdown and restart option).
terminate_child/2 for the actual termination of a child.
delete_child/2 for deleting a child after calling terminate_child/2.
You can read more about it here.

How to implement status in Erlang?

I am thinking an Erlang program that has many workers (loop receive), these workers almost always manipulate their status at the same time, ie. massive concurrent, the amount of workers is so big that keep their status in mnesia will cause performance problem, so I am thinking pass the status as args in each loop, then write to mnesia some time later. Is this a good practice? Is there a better way to do this? (roughly speaking, I'm looking for something like an instance with attributes in the object oriented language)
Thanks.
With Erlang, it is a good habit to see the processes as actor with a dedicated and limited role. With this in mind you will see that you will split your problem in different categories like:
Maintain the state of a connection with a user over the Internet,
Keep information such as login, user profile, friends, shop-cart...
log events
...
for each role you will have to decide if the state information must survive to the process.
In a lot of cases it is not necessary (case 1) and the solution is simply to keep the state in the argument of loop funtion of the process. I encourage you to look at the OTP behaviors, the gen_server and gen_fsm are made for this.
The case 2 obviously manipulates permanent data which must survive to a process crash or even a hardware crash. These data will be stored using dets, mnesia or any database adapted to your problem (Redis, CouchDB ...).
It is important to limit the information stored into external database, otherwise you will not benefit of this very powerful feature which is the lack of side effect. In other words, it is a very bad idea to have process behavior which depends on external information.

Is the process dictionary appropriate in this case?

I've read several comments here and elsewhere suggesting that Erlang's process dictionary was a bad idea and should die. Normally, as a total Erlang newbie, I'd just avoid it. However, in this situation my other options aren't great.
I have a main dispatcher function that looks something like this:
dispatch(State) ->
receive
{cmd1, Params} ->
NewState = do_cmd1_stuff(Params, State),
dispatch(NewState);
{cmd2, Params} ->
NewState = do_cmd2_stuff(Params, State),
dispatch(NewState);
BadMsg ->
log_error(BadMsg),
dispatch(State)
end.
Obviously, my names are more meaningful to me, but that's the gist of it. Deep down in a function called by a function called by a function called by do_cmd2_stuff(), I want to send out messages to all my users telling them about something I've done. In order to do that, I need to get the list of users from the point where I send the messages. The user list doesn't lend itself easily to sticking in the global state, since that's just one data structure representing the only block of data on which I operate.
The way I see it, I have a couple unpleasant options other than using the process dictionary. I can send the user list through all the various levels of functions down to the very bottom one that does the broadcasting. That's unpleasant because it causes all my functions to gain a parameter, whether they really care about it or not.
Alternatively, I could have all the do_cmdN_stuff() functions return a message to send. That's not great either though, since sending the message may not be the last thing I want to do and it clutters up my dispatcher with a bunch of {Msg, NewState} tuples. Furthermore, some of the functions might not have any messages to send some of the time.
Like I said earlier, I'm very new to Erlang. Maybe someone with more experience can point me at a better way. Is there one? Is the process dictionary appropriate in this case?
The general rule is that if you have doubts, you shouldn't use the process dictionary.
If the two options you mentioned aren't good enough (I personally like the one where you return the messages to send) and what you want is some particular piece of code to track users and forward messages to them, maybe what you want to do is have a process holding that info.
Pid ! {forward, Msg}
where Pid will take care of sending everything to a bunch of other processes. Now, you would still need to pass the Pid around, unless you give it a name in some registry to find it. Either with register/2, global or gproc.
A simple answer would be to nest your global within a state record, which is then threaded through the system, at least at the stop level. This makes it easy to add new fields to the state in the future, not an uncommon occurrence, and allow you to keep your global state data structure untouched. So initially
-record(state, {users=[],state_data}).
Defining it as a record makes it easy to access and extend when necessary.
As you mentioned you can always pass the user list as extra param, thats not so bad.
If you don't want to do this just put it in State. You can have a special State just for this part of the calculation that also contains the user list.
Then there always is the possibility of putting it in ETS or in another server process.
What exactly to do is hard to recommend since it depends a lot on your exact application and preferences.
Just choose from the mentioned possibilities as if the process dictionary doesn't exist. Maybe your code needs restructuring if none of the variants look elegant, there always is some better way without the process dictionary.
Its really bad it is still there, because its alluring to many beginning Erlang users.
You really should not use process dictionary. I accept using dictionary only if
It is short living process.
I have full control about the process from spawn to termination i.e. I use minimum and well known set of external modules.
I need performance gain badly. It means avoid copy of data when using ets and dict/gb_tree is too slow (for GC reason).
ad 1. is not your case, you are using in server. ad 2. I don't know if it is your case. ad 3. is not your case because you need list of recipient so you don't gain nothing from that process dictionary is very fast key/value storage. In your case I don't see any reason why you should not include what you need to your State. IMHO State is exactly the right place for it.
Its an interesting question because it involves the fundamentals of functional design.
My opinion:
Try as much as possible to make the function return the messages, then send them. This separates the two different tasks nicely, and separates the purely functional task from the one that causes side effects.
If this isn't possible, pass receivers as argument even if its a bit messy. If the broadcasting function uses that data, it should be given to it explicitly, for clarity and predictability.
Using ETS as Peer Stritzinger suggests is really not any better than the PD, both hides the fact that the broadcasting function uses the receiver list and makes it dependent on global data.
I'm not sure about the Erlang way of encapsulating some state in a process, as I GIVE TERRIBLE ADVICE suggests. Is it really any better that ETS or PD?
clutters up my dispatcher with a bunch
of {Msg, NewState}
This is my experience also, that you often end up like this. It's not particularly pretty, but functional design seems to encourage this. Could some language feature be introduced to make it more beautiful and natural?
EDIT:
6 years ago I wrote:
Could some language feature be introduced to make it more beautiful and natural?
After learning much more about functional programming I have realised that examples of this are state-monads and do-notation that are found in Haskell.
I would consider sending a special message to self() from deep inside the call stack, and handling it at the top level dispatch method that you've sketched, where list of users is available.

Resources