handle saving of transient gen_servers states when using a key-to-pid mechanism - erlang

I would like to know how to handle saving of transient gen_servers states when they are associated with a key.
To associate keys with processes, I use a process called pidstore. Pidstore eventually start processes.
I give a Key and a M,F,A to pidstore, it looks for the key in global, then either returns the pid if found or apply MFA (which must return {ok, Pid}), registers the Pid with the key in global and returns the Pid.
I may have many inactive gen_servers with a possibly huge state. So, i've set the handle_info callback to save the state in my database and then stops the process. The gen_servers are considered transient in their supervisor, so they won't be restarted until something needs them again.
Here starts the problems : If I call a process with its key, say {car, 23}, during the saving step of handle_info in the process which represents {car, 23}, i'll get the pid back as intended, because the process is saving and not finished. So i'll call my process with gen_server:call but i'll never have a response (and hit default 5 sec. timeout) because the process is stopping. (PROBLEM A)
To solve this problem, the process could unregister itself from global, then save its state, then stop. But if I need it after it's unregistered but before save is finished, I will load a new process, this process could load non-updated values in the database. (PROBLEM B)
To solve this again, I could ensure that loading and saving in the db are enqueued and can not be concurrent. This could be a bottleneck. (PROBLEM C)
I've thinking about another solution : my processes, before saving, could tell the pidstore that they are busy. The pidstore would keep a list of busy processes, and respond 'busy' to any demand on theese keys.
when the save is done, the pidstore would be told no_more_busy by the process and could start a new process when asked a key. (Even if the old process is not finished, it's done saving so it can just take his time to die alone).
This seems a bit messy to me but it feels simpler to make several attemps to get the Pid from the key instead of wrapping every call to a gen_server to handle the possible timeouts. (when the process is finishing but still registrered in global).
I'm a bit confused about all of theese half-problems and half-solutions. What is the design you use in this situation, or how can I avoid this situation ?
I hope my message is legible, please tell me about english errors too.
Thank You

Maybe you want to do the save to DB part in a gen_server:call. That would prevent other calls from coming in while you are writing to DB.
Generally it sounds to like you have created a process register. You might want to look into gproc (https://github.com/uwiger/gproc) which does a very good job at that if you want register locally. With gproc you can do exactly what you described above, use a key to register a process. Maybe it would be good enough if you register with gproc in your init function and unregister when writing to DB. You could also write to DB in your terminate function.

For now i decided to stick with erlang « let it crash » philosophy. If a process recieves messages as it is shuting down, those messages will not be answered and will trigger a gen_server:call/* timeout.
I think it will be boring to handle this timeout in the right place, i have not decided where at this time, but this is specific to my application so it is pointless here.

Related

Best-effort OTP supervision

What I'd like to do is change my supervisor to make a best effort to keep children running, but give up if their crash rate exceeds the intensity. That way the remainder of the children keep running. This doesn't appear to be possible with the existing supervisor configurations, though, so it looks like my only option may be to implement my own supervisor so I can have it behave this way when it receives EXIT.
Is there a way to implement custom OTP supervisor behavior like this without writing your own supervisor?
It sounds to me like what you want is an individual supervisor for each child, responsible for keeping it alive up to a limit, as you say, and as a layer above that have a single supervisor (one-for-one or simple-one-for-one) whose children are marked as temporary, so that when one of them gives up, the rest stay running.
You can't "extend" Supervisor to add different supervision behaviour, but you don't have to start from scratch either. The :supervisor module itself is implemented on top of :gen_server, so I would consult the source code of :supervisor (which you can find here) if you do find yourself needing some kind of custom supervision behaviour; it will give you a base to build from to avoid some of the pitfalls which you are likely to encounter.
I can expand my answer about alternative solutions once I have a better idea of your use case. As I mentioned in my comment, it sounds to me that you are likely doing something during init/1 of your processes which is prone to failure; init/1 is not the place to handle those things, because if it becomes impossible to succeed at that action temporarily, you will almost certainly blow the max restart intensity of the supervisor.
For example, let's assume you have a process which talks to the database, and requires a database connection; you do not want to try and connect to the database during init/1. Rather you should acquire the connection post-init (perhaps on first-use, or by immediately sending a post-init message to the process using Process.send_after(self(), :connect, 0)), and if the connection fails, return something like {:error, :database_unavailable} to any callers while you attempt to re-establish the connection. Designing with this approach will allow your supervision tree to remain stable, and it instead pushes the decision on how to deal with failure down to the clients who likely have better information on how it impacts them (i.e., should they retry the operation, return an error to their caller, exit with an exception, etc.)
You can use director too, it's more flexible for solving this problem.

Is there an Erlang behaviour that can act on its own instead of waiting to be called?

I'm writing an Erlang application that requires actively polling some remote resources, and I want the process that does the polling to fit into the OTP supervision trees and support all the standard facilities like proper termination, hot code reloading, etc.
However, the two default behaviours, gen_server and gen_fsm seem to only support operation based on callbacks. I could abuse gen_server to do that through calls to self or abuse gen_fsm by having a single state that always loops to itself with a timeout 0, but I'm not sure that's safe (i.e. doesn't exhaust the stack or accumulate unread messages in the mailbox).
I could make my process into a special process and write all that handling myself, but that effectively makes me reimplement the Erlang equivalent of the wheel.
So is there a behavior for code like this?
loop(State) ->
do_stuff(State), % without waiting to be called
loop(NewState).
And if not, is there a safe way to trick default behaviours into doing this without exhausting the stack or accumulating messages over time or something?
The standard way of doing that in Erlang is by using erlang:send_after/3. See this SO answer and also this example implementation.
Is it possible that you could employ an essentially non OTP compliant process? Although to be a good OTP citizen, you do ideally want to make your long running processes into gen_server's and gen_fsm's, sometimes you have to look beyond the standard issue rule book and consider why the rules exist.
What if, for example, your supervisor starts your gen_server, and your gen_server spawns another process (lets call it the active_poll process), and they link to each other so that they have shared fate (if one dies the other dies). The active_poll process is now indirectly supervised by the supervisor that spawned the gen_server, because if it dies, so will the gen_server, and they will both get restarted. The only problem you really have to solve now is code upgrade, but this is not too difficult - your gen_server gets a code_change callback call when the code is to be upgraded, and it could simply send a message to the active_poll process, which can make an appropriate fully qualified function call, and bingo, it's running the new code.
If this doesn't suit you for some reason and/or you MUST use gen_server/gen_fsm/similar directly...
I'm not sure that writing a 'special process' really gives you very much. If you wrote a special process correctly, such that it is in theory compliant to OTP design principals, it could still be ineffective in practice if it blocks or busy waits in a loop somewhere, and doesn't invoke sys when it should, so you really have at most a small optimisation over using gen_server/gen_fsm with a zero timeout (or by having an async message handler which does the polling and sends a message to self to trigger the next poll).
If what ever you are doing to actively poll can block (such as a blocking socket read for example), this is really big trouble, as gen_server, gen_fsm or a special process will all be stopped from fullfilling their usual obligations (which they would usually be able to either because the callback in the case of gen_server/gen_fsm returns, or because receive is called and the sys module invoked explicitly in the case of a special process).
If what you are doing to actively poll is non blocking though, you can do it, but if you poll without any delay then it effectively becomes a busy wait (it's not quite because the loop will include a receive call somewhere, which means the process will yield, giving the scheduler voluntary opportunity to run other processes, but it's not far off, and it will still be a relative CPU hog). If you can have a 1ms delay between each poll that makes a world of difference vs polling as rapidly as you can. It's not ideal, but if you MUST, it'll work. So use a timeout (as big as you can without it becoming a problem), or have an async message handler which does the polling and sends a message to self to trigger the next poll.

Erlang process termination: Where/When does it happen?

Consider processes all linked in a tree, either a formal supervision tree, or some ad-hoc structure.
Now, consider some child or worker down in this tree, with a parent or supervisor above it. I have two questions.
We would like to "gracefully" exit this process if it needs to be killed or shutdown, because it could be halfway through updating some account balance. Assume we have properly coded up some terminate function and connected this process to others with the proper plumbing. Now assume this process is in its main loop doing work. The signal to terminate comes in. Where exactly (or possibly the question should be WHEN EXACTLY) does this termination happen? In other words, when will terminate be called? Will the thing just preempt itself right in the middle of the loop it is running and call terminate? Will it wait until the end of the loop but before starting the loop again? Will it only do it while in receive mode? Etc.
Same question but without terminate function having been coded. Assume parent process is a supervisor, and this child is following normal OTP conventions. Parent tells child to shutdown, or parent crashes or whatever. The child is in its main loop. When/where/how does shutdown occur? In the middle of the main loop? After it? Etc.
It is quite nicely explained in the docs (sections 12.4, 12.5, 12.6, 12.7).
There are two cases:
Your process terminated due to some bad logic.
It throws an error, so it can be in the middle of work and this could be bad. If you want to prevent that, you can try to define mechanism, that involves two processes. First one begins the transaction, second one does the actual work and after that, first one commits the changes. If something bad happens to second process (it dies, because of errors), the first one simply does not commit the changes.
You are trying to kill the process from outside. For example, when your supervisor restarts or linked process dies.
In this case, you can also be in the middle of something, but Erlang gives you the trap_exit flag. It means, that instead of dying, the process will receive a message, that you can handle. That in turn means, that terminate function will be called after you get to the receive block. So the process will finish one chunk of work and when it will be ready for next, it will call terminate and after that die.
So you can bypass the exiting by using trap_exit. You can also bypass the trap_exit sending exit(Pid, kill), which terminates process even if it traps exits.
There is no way to bypass exit(Pid, kill), so be careful with using it.

Restarting erlang process and preserving state

I have a supervisor process which starts number of child processes. Currently when the child dies I spawn a new process with new Pid. This means I loose the state information of my child process which has just died. I want my clients to communicate with child processes using always the same identifier. Despite the fact that child process may die and be restarted by the supervisor.
I was thinking of registering child processes with unique names and storing child state in ets table. The question is - what is the recommended way of approaching such problem in Erlang?
Thanks!
Storing process state in an ets table would work for keeping your state around between crashes, and I usually use the global registry for giving processes persistent names. (Player 200 would be registered as {player, 200}.) I don't recommend using the local registry because it requires that you use atoms and if you have many child processes, you can chew up your limit of atoms in a hurry by creating them dynamically (like player_200, player_201, etc.)
Storing child state in the ets table has its own risks and issues, though. If a child crashes between the moment when an error occurs and when it saves to the ets table, you should be alright. However, what if you process data that causes the child to save garbage state, then crash on processing the next message? You'll restart the process, load the bad state from the ets table, and crash on your next message again. There are certainly ways to deal with this, but you should be aware that it is a possibility and work around it.
While Erlang hides the problems of distributing an ets table to all processes, it does so at the cost of CPU and potential contentions. If you're pushing a lot of changes to your ets table, you're going to pay for it in performance.
If your children are crashing, shouldn't you be looking for a way for them to remove the erroneous conditions, anyway? I would usually take a process crash as something that I needed to root cause and fix. ?
Using ETS tables is probably the way to go for keeping the state. Vinoski's article discusses how to make it possible to restart a crashed process while keeping the ETS table data.
As #user30997 points out the data in the table may actually be the reason the process crashed, so on restart you might want to validate the table (or set a limit on how many times the process will be restarted...)
For associating processes with id's you should take a look at gproc which is great for this.
Use eventsourcing, persist all events, and replay back to reconstruct the state. In case you need fast replays, make a snapshot. The example below:
https://github.com/bryanhunter/cqrs-with-erlang/tree/ndc-oslo
In fact, it would be nice to build a complete framework based on this example.

How can I restore process state after a crash?

What's a good way to persist state when restarting a crashed process?
I have a supervisor in an OTP application what watches several "subsystem" gen_servers.
For example, one is a "weather" subsystem that generates a new weather state every 15 minutes and handles queries for the current state of the weather. (Think the lemonade stand game)
If that gen_server crashes, I want it to be restarted, but it should be be restarted with the most recent weather state, not some arbitrary state hardcoded in init(). It wouldn't make sense for the simulation state to suddenly go from "hail storm" to "pleasant and breezy" just because of the crash.
I hesitate to use mnesia or ETS to store the state after every update because of the added complexity; is there an easier way?
As long as it just has to be during runtime a would suggest the usage of ETS. The value is by far greater than the complexity. The API is simple and if you're working with named tables the access is simple too. You only have to create the table before your gen_server is started by the supervisor.
Two - more complex - alternatives:
Build a pair of processes, one for the job to do, one for the state maintenance. Due to the simplicity of the second one it would be really reliable.
A real silly one could be the exchange of the child spec of the supervisor with the current state as argument each time the state is changing. (smile) No, just kidding.
is there an easier way?
when process died it sends message to supervisor that containing State of process, so you can use this value to store in supervisor (in mnesia or supervisor's state) and when your server will start (in init) it have to send sync call to supervisor to get State value. I haven't real example, but i hope it makes sense.
Anyway i don't really see problem to store State in mnesia.
sorry my English :)

Resources