Supervisor callback for child normal exit - erlang

I am creating a test app where is one supervisor with simple_one_for_one strategy and many worker children added dynamically to it. How to implement callback (or receive a message) in supervisor that will be called when child exit normally?
Main goal is to notify some other process that all supervised worker processes are done and it's time to show final report.
How to design such kind of behavior? Should I create my own behavior that combine supervisor and gen_server, or there is a way to do this with standard otp behaviors?

There are two ways to do such a notification. The first is to simply monitor the child from the beginning. By using erlang:monitor/2, a third party can whether a process is alive or not. When the monitored process dies, the result will be turned into a message that will give the reason for it to the monitoring process.
The other way could be to use a bit of message sending in the process' terminate/2 function (terminate/3 if it's a gen_fsm). This far more brittle because the terminate function will not be called in all circumstances.
The monitor option is far superior.

Related

Is there an Erlang behaviour that can act on its own instead of waiting to be called?

I'm writing an Erlang application that requires actively polling some remote resources, and I want the process that does the polling to fit into the OTP supervision trees and support all the standard facilities like proper termination, hot code reloading, etc.
However, the two default behaviours, gen_server and gen_fsm seem to only support operation based on callbacks. I could abuse gen_server to do that through calls to self or abuse gen_fsm by having a single state that always loops to itself with a timeout 0, but I'm not sure that's safe (i.e. doesn't exhaust the stack or accumulate unread messages in the mailbox).
I could make my process into a special process and write all that handling myself, but that effectively makes me reimplement the Erlang equivalent of the wheel.
So is there a behavior for code like this?
loop(State) ->
do_stuff(State), % without waiting to be called
loop(NewState).
And if not, is there a safe way to trick default behaviours into doing this without exhausting the stack or accumulating messages over time or something?
The standard way of doing that in Erlang is by using erlang:send_after/3. See this SO answer and also this example implementation.
Is it possible that you could employ an essentially non OTP compliant process? Although to be a good OTP citizen, you do ideally want to make your long running processes into gen_server's and gen_fsm's, sometimes you have to look beyond the standard issue rule book and consider why the rules exist.
What if, for example, your supervisor starts your gen_server, and your gen_server spawns another process (lets call it the active_poll process), and they link to each other so that they have shared fate (if one dies the other dies). The active_poll process is now indirectly supervised by the supervisor that spawned the gen_server, because if it dies, so will the gen_server, and they will both get restarted. The only problem you really have to solve now is code upgrade, but this is not too difficult - your gen_server gets a code_change callback call when the code is to be upgraded, and it could simply send a message to the active_poll process, which can make an appropriate fully qualified function call, and bingo, it's running the new code.
If this doesn't suit you for some reason and/or you MUST use gen_server/gen_fsm/similar directly...
I'm not sure that writing a 'special process' really gives you very much. If you wrote a special process correctly, such that it is in theory compliant to OTP design principals, it could still be ineffective in practice if it blocks or busy waits in a loop somewhere, and doesn't invoke sys when it should, so you really have at most a small optimisation over using gen_server/gen_fsm with a zero timeout (or by having an async message handler which does the polling and sends a message to self to trigger the next poll).
If what ever you are doing to actively poll can block (such as a blocking socket read for example), this is really big trouble, as gen_server, gen_fsm or a special process will all be stopped from fullfilling their usual obligations (which they would usually be able to either because the callback in the case of gen_server/gen_fsm returns, or because receive is called and the sys module invoked explicitly in the case of a special process).
If what you are doing to actively poll is non blocking though, you can do it, but if you poll without any delay then it effectively becomes a busy wait (it's not quite because the loop will include a receive call somewhere, which means the process will yield, giving the scheduler voluntary opportunity to run other processes, but it's not far off, and it will still be a relative CPU hog). If you can have a 1ms delay between each poll that makes a world of difference vs polling as rapidly as you can. It's not ideal, but if you MUST, it'll work. So use a timeout (as big as you can without it becoming a problem), or have an async message handler which does the polling and sends a message to self to trigger the next poll.

Erlang process termination: Where/When does it happen?

Consider processes all linked in a tree, either a formal supervision tree, or some ad-hoc structure.
Now, consider some child or worker down in this tree, with a parent or supervisor above it. I have two questions.
We would like to "gracefully" exit this process if it needs to be killed or shutdown, because it could be halfway through updating some account balance. Assume we have properly coded up some terminate function and connected this process to others with the proper plumbing. Now assume this process is in its main loop doing work. The signal to terminate comes in. Where exactly (or possibly the question should be WHEN EXACTLY) does this termination happen? In other words, when will terminate be called? Will the thing just preempt itself right in the middle of the loop it is running and call terminate? Will it wait until the end of the loop but before starting the loop again? Will it only do it while in receive mode? Etc.
Same question but without terminate function having been coded. Assume parent process is a supervisor, and this child is following normal OTP conventions. Parent tells child to shutdown, or parent crashes or whatever. The child is in its main loop. When/where/how does shutdown occur? In the middle of the main loop? After it? Etc.
It is quite nicely explained in the docs (sections 12.4, 12.5, 12.6, 12.7).
There are two cases:
Your process terminated due to some bad logic.
It throws an error, so it can be in the middle of work and this could be bad. If you want to prevent that, you can try to define mechanism, that involves two processes. First one begins the transaction, second one does the actual work and after that, first one commits the changes. If something bad happens to second process (it dies, because of errors), the first one simply does not commit the changes.
You are trying to kill the process from outside. For example, when your supervisor restarts or linked process dies.
In this case, you can also be in the middle of something, but Erlang gives you the trap_exit flag. It means, that instead of dying, the process will receive a message, that you can handle. That in turn means, that terminate function will be called after you get to the receive block. So the process will finish one chunk of work and when it will be ready for next, it will call terminate and after that die.
So you can bypass the exiting by using trap_exit. You can also bypass the trap_exit sending exit(Pid, kill), which terminates process even if it traps exits.
There is no way to bypass exit(Pid, kill), so be careful with using it.

How to restart child with custom state using Erlang OTP supervisor behaviour?

I'm using OTP supervisor behaviour to supervise and restart child processes. However when the child dies I want to restart it with the same state it had before the crash.
If I write my own custom supervisor, I can just receive {EXIT,Pid,Reason} message and act upon it. When using OTP supervisor behaviour however it is all managed by OTP and I have no control over it. The only callback function I implement is init.
Is there any standard approach in case like this? How to customise the state of a child being restarted dynamically by the otp supervisor? How to get Pid of the terminating process using OTP? Or maybe its possible to get the state of the child just before termination, and then restore the child to the same state it had before it crashed?
Possibly restart with same state is not good idea. Probably wrong state lead process to crash and if you restart with same state, it will crash again. But if you want this, use external resource to keep it (like ets or mnesia).
Without knowing any details about what you are doing, I can imagine a world where the following makes sense:
the supervisor creates an ETS table and passes the table identifier to each child
a child process starts and, based on some relevant attribute of the child, consults the ETS table to look for state to load
every time a child's state changes it writes it to the ETS table
So, if I had 12 child processes representing the 12 Tribes of Cobol each would use its name as the key to the ETS table to look for state left behind by a previous incarnate upon starting. And each process would update the table (again using its name as the key) whenever its state changed.
The supervisor will automatically restart a killed child and step 2, above, would be executed in the child's init method. Step 3 would be dealt with in a child's handle_call, handle_cast and handle_info methods (I am making some assumptions about the nature of your processes). There are a number of restart strategies available via the supervisor that can even restart siblings if desired.
Hope this gives you some thoughts.
I think this sort of customizations of the OTP supervisor behaviour can't be done easily. The way OTP supervisors are designed forces me to follow some strict design practices. Most important one in this case is that supervisor shouldn't do anything else apart from monitoring its children and restarting them in case of abnormal termination. There should be no additional logic in the supervisor to not introduce any bugs in the supervisors which are critical part of supervision tree and fault tolerance.
when the child dies I want to restart it with the same state it had before the crash
- this is bad practice in general because child might've died because of the corrupted state it had before termination and restarting it with the same state in such case will surely cause problems
Is there any standard approach in case like this?
Customizing the state of the children within the supervisor, before restarting them acts against supervisor good design practices. Therefore this kind of tasks are usually done differently, for example by introducing another process, for example gen_server which would be responsible for starting children via supervisor (supervisor:start_child) and maintaining monitors on all processes. This additional process could do any required customizations before starting new child.
How to get Pid of the terminating process using OTP?
- in the additional process which starts children via supervisor:start_child you can monitor them and then listen to DOWN messages. For example in case of gen_server you would use handle_info function as below:
handle_info({'DOWN', Ref, process, _Pid, _}, S) ->
handle_down_worker(Ref, _Pid, S).
Or maybe its possible to get the state of the child just before termination, and then restore the child to the same state it had before it crashed?
- Correct me if I'm wrong but I think it is not possible in Erlang to send, along with the 'DOWN' message, the state of the process which child had, just before the termination. If that would be possible then I could just handle message similar to {DOWN, Pid, Reason, State} and restart the process with the same state or part of it. But then, I'm thinking.. How could you preserve the state of the suddenly dying child which was for example killed with exit(Pid, kill) ? I doubt that would be possible.

handle saving of transient gen_servers states when using a key-to-pid mechanism

I would like to know how to handle saving of transient gen_servers states when they are associated with a key.
To associate keys with processes, I use a process called pidstore. Pidstore eventually start processes.
I give a Key and a M,F,A to pidstore, it looks for the key in global, then either returns the pid if found or apply MFA (which must return {ok, Pid}), registers the Pid with the key in global and returns the Pid.
I may have many inactive gen_servers with a possibly huge state. So, i've set the handle_info callback to save the state in my database and then stops the process. The gen_servers are considered transient in their supervisor, so they won't be restarted until something needs them again.
Here starts the problems : If I call a process with its key, say {car, 23}, during the saving step of handle_info in the process which represents {car, 23}, i'll get the pid back as intended, because the process is saving and not finished. So i'll call my process with gen_server:call but i'll never have a response (and hit default 5 sec. timeout) because the process is stopping. (PROBLEM A)
To solve this problem, the process could unregister itself from global, then save its state, then stop. But if I need it after it's unregistered but before save is finished, I will load a new process, this process could load non-updated values in the database. (PROBLEM B)
To solve this again, I could ensure that loading and saving in the db are enqueued and can not be concurrent. This could be a bottleneck. (PROBLEM C)
I've thinking about another solution : my processes, before saving, could tell the pidstore that they are busy. The pidstore would keep a list of busy processes, and respond 'busy' to any demand on theese keys.
when the save is done, the pidstore would be told no_more_busy by the process and could start a new process when asked a key. (Even if the old process is not finished, it's done saving so it can just take his time to die alone).
This seems a bit messy to me but it feels simpler to make several attemps to get the Pid from the key instead of wrapping every call to a gen_server to handle the possible timeouts. (when the process is finishing but still registrered in global).
I'm a bit confused about all of theese half-problems and half-solutions. What is the design you use in this situation, or how can I avoid this situation ?
I hope my message is legible, please tell me about english errors too.
Thank You
Maybe you want to do the save to DB part in a gen_server:call. That would prevent other calls from coming in while you are writing to DB.
Generally it sounds to like you have created a process register. You might want to look into gproc (https://github.com/uwiger/gproc) which does a very good job at that if you want register locally. With gproc you can do exactly what you described above, use a key to register a process. Maybe it would be good enough if you register with gproc in your init function and unregister when writing to DB. You could also write to DB in your terminate function.
For now i decided to stick with erlang « let it crash » philosophy. If a process recieves messages as it is shuting down, those messages will not be answered and will trigger a gen_server:call/* timeout.
I think it will be boring to handle this timeout in the right place, i have not decided where at this time, but this is specific to my application so it is pointless here.

Handling the cleanup of the gen_server state

I have a gen_server running which it must clean up its state whenever it is stopped normally or it crash unexpectedly. The cleanup basically consists in deleting a few files.
At this moment, when the gen_server crash or it is stopped normally, the cleanup is done in terminate/2.
Is there any reason why terminate/2 would not be called if the gen_server crash?
Should be any other process monitoring the gen_server waiting to do the cleanup if the gen_server dies unexpectedly?
So, the code is like this:
terminate(normal, State) ->
% Invoked when the process stops
% Clean up the mess
terminate(Error, State) ->
% Invoked when the process crashes
% Clean up the mess
EDIT: I found this email in the official mailing list which is talking about the same thing:
http://groups.google.com/group/erlang-programming/browse_thread/thread/9a1ba2d974775ce8
As Adam says below, if we want to avoid to trap the exists in the gen_server, we could use different approaches.
But if we trap the exists, terminate/2 seems to be a safe place to do the cleanup as it always will be called. Furthermore we must handle correctly when 'EXIT' is sent to terminate/2 and to handle_call/3 trying to propagate the errors correctly between workers and supervisors.
terminate/2 is called when a crash occur inside the gen_server even if it doesn't trap exits, it will not be called if it receives an 'EXIT' from some other process linked to it, in case you need to clean up then it should trap exits(using process_flag(trap_exit, true)).
This behavior is a bit unfortunate because it makes it difficult to write a reliable shutdown procedure for a gen_server process. Also, it is not a good habit to trap exits just for the sake of being able to run terminate/2, since you might catch a lot of other errors which makes it harder to debug the system.
I would consider three options:
Handle the left over files when the next instance of the process starts (for example, in init/1)
Trap exits, clean up the files, and then crash again with the same reason
Have a 3rd process which monitors the gen_server whose only purpose is to clean up the files
Option 1 is probably the best option, since at least the code doesn't trap exits and you get persistent state for free. Option 2 is not so nice for the reasons described above, that it can hide and obscure other errors. 3 is messy because the cleanup process might not be done before the gen_server is started again.
Think carefully about why you want to clean up, and if it really has to be done when the process crashes (it is a bug, after all). Be careful that you don't end up doing too much defensive programming.
This is quite fresh and relevant
When does terminate/2 get called in a gen_server?

Resources