Erlang process termination: Where/When does it happen? - erlang

Consider processes all linked in a tree, either a formal supervision tree, or some ad-hoc structure.
Now, consider some child or worker down in this tree, with a parent or supervisor above it. I have two questions.
We would like to "gracefully" exit this process if it needs to be killed or shutdown, because it could be halfway through updating some account balance. Assume we have properly coded up some terminate function and connected this process to others with the proper plumbing. Now assume this process is in its main loop doing work. The signal to terminate comes in. Where exactly (or possibly the question should be WHEN EXACTLY) does this termination happen? In other words, when will terminate be called? Will the thing just preempt itself right in the middle of the loop it is running and call terminate? Will it wait until the end of the loop but before starting the loop again? Will it only do it while in receive mode? Etc.
Same question but without terminate function having been coded. Assume parent process is a supervisor, and this child is following normal OTP conventions. Parent tells child to shutdown, or parent crashes or whatever. The child is in its main loop. When/where/how does shutdown occur? In the middle of the main loop? After it? Etc.

It is quite nicely explained in the docs (sections 12.4, 12.5, 12.6, 12.7).
There are two cases:
Your process terminated due to some bad logic.
It throws an error, so it can be in the middle of work and this could be bad. If you want to prevent that, you can try to define mechanism, that involves two processes. First one begins the transaction, second one does the actual work and after that, first one commits the changes. If something bad happens to second process (it dies, because of errors), the first one simply does not commit the changes.
You are trying to kill the process from outside. For example, when your supervisor restarts or linked process dies.
In this case, you can also be in the middle of something, but Erlang gives you the trap_exit flag. It means, that instead of dying, the process will receive a message, that you can handle. That in turn means, that terminate function will be called after you get to the receive block. So the process will finish one chunk of work and when it will be ready for next, it will call terminate and after that die.
So you can bypass the exiting by using trap_exit. You can also bypass the trap_exit sending exit(Pid, kill), which terminates process even if it traps exits.
There is no way to bypass exit(Pid, kill), so be careful with using it.

Related

Why is spawn_link necessary in Erlang?

I understand that if you do spawn followed by link, the process may have died in the mean time. Why is that a problem? Can't link see that you're trying to link to a process that has already died? In that case couldn't it just behave as though the remote process died immediately after link was called?
I think it would be nice if you could do spawn and link separately, and not have to do them together in one atomic function, because a) that would make the language more orthogonal (spawn_link heavily overlaps with spawn and link) b) if I have a start function that's basically just a wrapper around spawn, I ALSO need to supply start_link. So the non-orthogonality is viral. Yuck!
Remember that links are bidirectional, so consider the case where process A spawns process B but then dies before being able to link to B. In this case, B has no idea that it is not linked to A, and it does not die when A dies.
With spawn_link this scenario can't happen because the spawn and link either occur together atomically, or they both fail.
Another reason, which cannot be emulated:
If a process is trapping exit signals, then if first spawns and then tries to link to a process that dies immediately, the failed link will result in a {'EXIT',..., noproc} exit signal.
If spawn_link is used, however, the exit signal will always carry the real exit reason.

How to restart child with custom state using Erlang OTP supervisor behaviour?

I'm using OTP supervisor behaviour to supervise and restart child processes. However when the child dies I want to restart it with the same state it had before the crash.
If I write my own custom supervisor, I can just receive {EXIT,Pid,Reason} message and act upon it. When using OTP supervisor behaviour however it is all managed by OTP and I have no control over it. The only callback function I implement is init.
Is there any standard approach in case like this? How to customise the state of a child being restarted dynamically by the otp supervisor? How to get Pid of the terminating process using OTP? Or maybe its possible to get the state of the child just before termination, and then restore the child to the same state it had before it crashed?
Possibly restart with same state is not good idea. Probably wrong state lead process to crash and if you restart with same state, it will crash again. But if you want this, use external resource to keep it (like ets or mnesia).
Without knowing any details about what you are doing, I can imagine a world where the following makes sense:
the supervisor creates an ETS table and passes the table identifier to each child
a child process starts and, based on some relevant attribute of the child, consults the ETS table to look for state to load
every time a child's state changes it writes it to the ETS table
So, if I had 12 child processes representing the 12 Tribes of Cobol each would use its name as the key to the ETS table to look for state left behind by a previous incarnate upon starting. And each process would update the table (again using its name as the key) whenever its state changed.
The supervisor will automatically restart a killed child and step 2, above, would be executed in the child's init method. Step 3 would be dealt with in a child's handle_call, handle_cast and handle_info methods (I am making some assumptions about the nature of your processes). There are a number of restart strategies available via the supervisor that can even restart siblings if desired.
Hope this gives you some thoughts.
I think this sort of customizations of the OTP supervisor behaviour can't be done easily. The way OTP supervisors are designed forces me to follow some strict design practices. Most important one in this case is that supervisor shouldn't do anything else apart from monitoring its children and restarting them in case of abnormal termination. There should be no additional logic in the supervisor to not introduce any bugs in the supervisors which are critical part of supervision tree and fault tolerance.
when the child dies I want to restart it with the same state it had before the crash
- this is bad practice in general because child might've died because of the corrupted state it had before termination and restarting it with the same state in such case will surely cause problems
Is there any standard approach in case like this?
Customizing the state of the children within the supervisor, before restarting them acts against supervisor good design practices. Therefore this kind of tasks are usually done differently, for example by introducing another process, for example gen_server which would be responsible for starting children via supervisor (supervisor:start_child) and maintaining monitors on all processes. This additional process could do any required customizations before starting new child.
How to get Pid of the terminating process using OTP?
- in the additional process which starts children via supervisor:start_child you can monitor them and then listen to DOWN messages. For example in case of gen_server you would use handle_info function as below:
handle_info({'DOWN', Ref, process, _Pid, _}, S) ->
handle_down_worker(Ref, _Pid, S).
Or maybe its possible to get the state of the child just before termination, and then restore the child to the same state it had before it crashed?
- Correct me if I'm wrong but I think it is not possible in Erlang to send, along with the 'DOWN' message, the state of the process which child had, just before the termination. If that would be possible then I could just handle message similar to {DOWN, Pid, Reason, State} and restart the process with the same state or part of it. But then, I'm thinking.. How could you preserve the state of the suddenly dying child which was for example killed with exit(Pid, kill) ? I doubt that would be possible.

handle saving of transient gen_servers states when using a key-to-pid mechanism

I would like to know how to handle saving of transient gen_servers states when they are associated with a key.
To associate keys with processes, I use a process called pidstore. Pidstore eventually start processes.
I give a Key and a M,F,A to pidstore, it looks for the key in global, then either returns the pid if found or apply MFA (which must return {ok, Pid}), registers the Pid with the key in global and returns the Pid.
I may have many inactive gen_servers with a possibly huge state. So, i've set the handle_info callback to save the state in my database and then stops the process. The gen_servers are considered transient in their supervisor, so they won't be restarted until something needs them again.
Here starts the problems : If I call a process with its key, say {car, 23}, during the saving step of handle_info in the process which represents {car, 23}, i'll get the pid back as intended, because the process is saving and not finished. So i'll call my process with gen_server:call but i'll never have a response (and hit default 5 sec. timeout) because the process is stopping. (PROBLEM A)
To solve this problem, the process could unregister itself from global, then save its state, then stop. But if I need it after it's unregistered but before save is finished, I will load a new process, this process could load non-updated values in the database. (PROBLEM B)
To solve this again, I could ensure that loading and saving in the db are enqueued and can not be concurrent. This could be a bottleneck. (PROBLEM C)
I've thinking about another solution : my processes, before saving, could tell the pidstore that they are busy. The pidstore would keep a list of busy processes, and respond 'busy' to any demand on theese keys.
when the save is done, the pidstore would be told no_more_busy by the process and could start a new process when asked a key. (Even if the old process is not finished, it's done saving so it can just take his time to die alone).
This seems a bit messy to me but it feels simpler to make several attemps to get the Pid from the key instead of wrapping every call to a gen_server to handle the possible timeouts. (when the process is finishing but still registrered in global).
I'm a bit confused about all of theese half-problems and half-solutions. What is the design you use in this situation, or how can I avoid this situation ?
I hope my message is legible, please tell me about english errors too.
Thank You
Maybe you want to do the save to DB part in a gen_server:call. That would prevent other calls from coming in while you are writing to DB.
Generally it sounds to like you have created a process register. You might want to look into gproc (https://github.com/uwiger/gproc) which does a very good job at that if you want register locally. With gproc you can do exactly what you described above, use a key to register a process. Maybe it would be good enough if you register with gproc in your init function and unregister when writing to DB. You could also write to DB in your terminate function.
For now i decided to stick with erlang « let it crash » philosophy. If a process recieves messages as it is shuting down, those messages will not be answered and will trigger a gen_server:call/* timeout.
I think it will be boring to handle this timeout in the right place, i have not decided where at this time, but this is specific to my application so it is pointless here.

Handling the cleanup of the gen_server state

I have a gen_server running which it must clean up its state whenever it is stopped normally or it crash unexpectedly. The cleanup basically consists in deleting a few files.
At this moment, when the gen_server crash or it is stopped normally, the cleanup is done in terminate/2.
Is there any reason why terminate/2 would not be called if the gen_server crash?
Should be any other process monitoring the gen_server waiting to do the cleanup if the gen_server dies unexpectedly?
So, the code is like this:
terminate(normal, State) ->
% Invoked when the process stops
% Clean up the mess
terminate(Error, State) ->
% Invoked when the process crashes
% Clean up the mess
EDIT: I found this email in the official mailing list which is talking about the same thing:
http://groups.google.com/group/erlang-programming/browse_thread/thread/9a1ba2d974775ce8
As Adam says below, if we want to avoid to trap the exists in the gen_server, we could use different approaches.
But if we trap the exists, terminate/2 seems to be a safe place to do the cleanup as it always will be called. Furthermore we must handle correctly when 'EXIT' is sent to terminate/2 and to handle_call/3 trying to propagate the errors correctly between workers and supervisors.
terminate/2 is called when a crash occur inside the gen_server even if it doesn't trap exits, it will not be called if it receives an 'EXIT' from some other process linked to it, in case you need to clean up then it should trap exits(using process_flag(trap_exit, true)).
This behavior is a bit unfortunate because it makes it difficult to write a reliable shutdown procedure for a gen_server process. Also, it is not a good habit to trap exits just for the sake of being able to run terminate/2, since you might catch a lot of other errors which makes it harder to debug the system.
I would consider three options:
Handle the left over files when the next instance of the process starts (for example, in init/1)
Trap exits, clean up the files, and then crash again with the same reason
Have a 3rd process which monitors the gen_server whose only purpose is to clean up the files
Option 1 is probably the best option, since at least the code doesn't trap exits and you get persistent state for free. Option 2 is not so nice for the reasons described above, that it can hide and obscure other errors. 3 is messy because the cleanup process might not be done before the gen_server is started again.
Think carefully about why you want to clean up, and if it really has to be done when the process crashes (it is a bug, after all). Be careful that you don't end up doing too much defensive programming.
This is quite fresh and relevant
When does terminate/2 get called in a gen_server?

Supervisor callback for child normal exit

I am creating a test app where is one supervisor with simple_one_for_one strategy and many worker children added dynamically to it. How to implement callback (or receive a message) in supervisor that will be called when child exit normally?
Main goal is to notify some other process that all supervised worker processes are done and it's time to show final report.
How to design such kind of behavior? Should I create my own behavior that combine supervisor and gen_server, or there is a way to do this with standard otp behaviors?
There are two ways to do such a notification. The first is to simply monitor the child from the beginning. By using erlang:monitor/2, a third party can whether a process is alive or not. When the monitored process dies, the result will be turned into a message that will give the reason for it to the monitoring process.
The other way could be to use a bit of message sending in the process' terminate/2 function (terminate/3 if it's a gen_fsm). This far more brittle because the terminate function will not be called in all circumstances.
The monitor option is far superior.

Resources