How can I restore process state after a crash? - erlang

What's a good way to persist state when restarting a crashed process?
I have a supervisor in an OTP application what watches several "subsystem" gen_servers.
For example, one is a "weather" subsystem that generates a new weather state every 15 minutes and handles queries for the current state of the weather. (Think the lemonade stand game)
If that gen_server crashes, I want it to be restarted, but it should be be restarted with the most recent weather state, not some arbitrary state hardcoded in init(). It wouldn't make sense for the simulation state to suddenly go from "hail storm" to "pleasant and breezy" just because of the crash.
I hesitate to use mnesia or ETS to store the state after every update because of the added complexity; is there an easier way?

As long as it just has to be during runtime a would suggest the usage of ETS. The value is by far greater than the complexity. The API is simple and if you're working with named tables the access is simple too. You only have to create the table before your gen_server is started by the supervisor.
Two - more complex - alternatives:
Build a pair of processes, one for the job to do, one for the state maintenance. Due to the simplicity of the second one it would be really reliable.
A real silly one could be the exchange of the child spec of the supervisor with the current state as argument each time the state is changing. (smile) No, just kidding.

is there an easier way?
when process died it sends message to supervisor that containing State of process, so you can use this value to store in supervisor (in mnesia or supervisor's state) and when your server will start (in init) it have to send sync call to supervisor to get State value. I haven't real example, but i hope it makes sense.
Anyway i don't really see problem to store State in mnesia.
sorry my English :)

Related

How to implement status in Erlang?

I am thinking an Erlang program that has many workers (loop receive), these workers almost always manipulate their status at the same time, ie. massive concurrent, the amount of workers is so big that keep their status in mnesia will cause performance problem, so I am thinking pass the status as args in each loop, then write to mnesia some time later. Is this a good practice? Is there a better way to do this? (roughly speaking, I'm looking for something like an instance with attributes in the object oriented language)
Thanks.
With Erlang, it is a good habit to see the processes as actor with a dedicated and limited role. With this in mind you will see that you will split your problem in different categories like:
Maintain the state of a connection with a user over the Internet,
Keep information such as login, user profile, friends, shop-cart...
log events
...
for each role you will have to decide if the state information must survive to the process.
In a lot of cases it is not necessary (case 1) and the solution is simply to keep the state in the argument of loop funtion of the process. I encourage you to look at the OTP behaviors, the gen_server and gen_fsm are made for this.
The case 2 obviously manipulates permanent data which must survive to a process crash or even a hardware crash. These data will be stored using dets, mnesia or any database adapted to your problem (Redis, CouchDB ...).
It is important to limit the information stored into external database, otherwise you will not benefit of this very powerful feature which is the lack of side effect. In other words, it is a very bad idea to have process behavior which depends on external information.

How to restart child with custom state using Erlang OTP supervisor behaviour?

I'm using OTP supervisor behaviour to supervise and restart child processes. However when the child dies I want to restart it with the same state it had before the crash.
If I write my own custom supervisor, I can just receive {EXIT,Pid,Reason} message and act upon it. When using OTP supervisor behaviour however it is all managed by OTP and I have no control over it. The only callback function I implement is init.
Is there any standard approach in case like this? How to customise the state of a child being restarted dynamically by the otp supervisor? How to get Pid of the terminating process using OTP? Or maybe its possible to get the state of the child just before termination, and then restore the child to the same state it had before it crashed?
Possibly restart with same state is not good idea. Probably wrong state lead process to crash and if you restart with same state, it will crash again. But if you want this, use external resource to keep it (like ets or mnesia).
Without knowing any details about what you are doing, I can imagine a world where the following makes sense:
the supervisor creates an ETS table and passes the table identifier to each child
a child process starts and, based on some relevant attribute of the child, consults the ETS table to look for state to load
every time a child's state changes it writes it to the ETS table
So, if I had 12 child processes representing the 12 Tribes of Cobol each would use its name as the key to the ETS table to look for state left behind by a previous incarnate upon starting. And each process would update the table (again using its name as the key) whenever its state changed.
The supervisor will automatically restart a killed child and step 2, above, would be executed in the child's init method. Step 3 would be dealt with in a child's handle_call, handle_cast and handle_info methods (I am making some assumptions about the nature of your processes). There are a number of restart strategies available via the supervisor that can even restart siblings if desired.
Hope this gives you some thoughts.
I think this sort of customizations of the OTP supervisor behaviour can't be done easily. The way OTP supervisors are designed forces me to follow some strict design practices. Most important one in this case is that supervisor shouldn't do anything else apart from monitoring its children and restarting them in case of abnormal termination. There should be no additional logic in the supervisor to not introduce any bugs in the supervisors which are critical part of supervision tree and fault tolerance.
when the child dies I want to restart it with the same state it had before the crash
- this is bad practice in general because child might've died because of the corrupted state it had before termination and restarting it with the same state in such case will surely cause problems
Is there any standard approach in case like this?
Customizing the state of the children within the supervisor, before restarting them acts against supervisor good design practices. Therefore this kind of tasks are usually done differently, for example by introducing another process, for example gen_server which would be responsible for starting children via supervisor (supervisor:start_child) and maintaining monitors on all processes. This additional process could do any required customizations before starting new child.
How to get Pid of the terminating process using OTP?
- in the additional process which starts children via supervisor:start_child you can monitor them and then listen to DOWN messages. For example in case of gen_server you would use handle_info function as below:
handle_info({'DOWN', Ref, process, _Pid, _}, S) ->
handle_down_worker(Ref, _Pid, S).
Or maybe its possible to get the state of the child just before termination, and then restore the child to the same state it had before it crashed?
- Correct me if I'm wrong but I think it is not possible in Erlang to send, along with the 'DOWN' message, the state of the process which child had, just before the termination. If that would be possible then I could just handle message similar to {DOWN, Pid, Reason, State} and restart the process with the same state or part of it. But then, I'm thinking.. How could you preserve the state of the suddenly dying child which was for example killed with exit(Pid, kill) ? I doubt that would be possible.

Restarting erlang process and preserving state

I have a supervisor process which starts number of child processes. Currently when the child dies I spawn a new process with new Pid. This means I loose the state information of my child process which has just died. I want my clients to communicate with child processes using always the same identifier. Despite the fact that child process may die and be restarted by the supervisor.
I was thinking of registering child processes with unique names and storing child state in ets table. The question is - what is the recommended way of approaching such problem in Erlang?
Thanks!
Storing process state in an ets table would work for keeping your state around between crashes, and I usually use the global registry for giving processes persistent names. (Player 200 would be registered as {player, 200}.) I don't recommend using the local registry because it requires that you use atoms and if you have many child processes, you can chew up your limit of atoms in a hurry by creating them dynamically (like player_200, player_201, etc.)
Storing child state in the ets table has its own risks and issues, though. If a child crashes between the moment when an error occurs and when it saves to the ets table, you should be alright. However, what if you process data that causes the child to save garbage state, then crash on processing the next message? You'll restart the process, load the bad state from the ets table, and crash on your next message again. There are certainly ways to deal with this, but you should be aware that it is a possibility and work around it.
While Erlang hides the problems of distributing an ets table to all processes, it does so at the cost of CPU and potential contentions. If you're pushing a lot of changes to your ets table, you're going to pay for it in performance.
If your children are crashing, shouldn't you be looking for a way for them to remove the erroneous conditions, anyway? I would usually take a process crash as something that I needed to root cause and fix. ?
Using ETS tables is probably the way to go for keeping the state. Vinoski's article discusses how to make it possible to restart a crashed process while keeping the ETS table data.
As #user30997 points out the data in the table may actually be the reason the process crashed, so on restart you might want to validate the table (or set a limit on how many times the process will be restarted...)
For associating processes with id's you should take a look at gproc which is great for this.
Use eventsourcing, persist all events, and replay back to reconstruct the state. In case you need fast replays, make a snapshot. The example below:
https://github.com/bryanhunter/cqrs-with-erlang/tree/ndc-oslo
In fact, it would be nice to build a complete framework based on this example.

handle saving of transient gen_servers states when using a key-to-pid mechanism

I would like to know how to handle saving of transient gen_servers states when they are associated with a key.
To associate keys with processes, I use a process called pidstore. Pidstore eventually start processes.
I give a Key and a M,F,A to pidstore, it looks for the key in global, then either returns the pid if found or apply MFA (which must return {ok, Pid}), registers the Pid with the key in global and returns the Pid.
I may have many inactive gen_servers with a possibly huge state. So, i've set the handle_info callback to save the state in my database and then stops the process. The gen_servers are considered transient in their supervisor, so they won't be restarted until something needs them again.
Here starts the problems : If I call a process with its key, say {car, 23}, during the saving step of handle_info in the process which represents {car, 23}, i'll get the pid back as intended, because the process is saving and not finished. So i'll call my process with gen_server:call but i'll never have a response (and hit default 5 sec. timeout) because the process is stopping. (PROBLEM A)
To solve this problem, the process could unregister itself from global, then save its state, then stop. But if I need it after it's unregistered but before save is finished, I will load a new process, this process could load non-updated values in the database. (PROBLEM B)
To solve this again, I could ensure that loading and saving in the db are enqueued and can not be concurrent. This could be a bottleneck. (PROBLEM C)
I've thinking about another solution : my processes, before saving, could tell the pidstore that they are busy. The pidstore would keep a list of busy processes, and respond 'busy' to any demand on theese keys.
when the save is done, the pidstore would be told no_more_busy by the process and could start a new process when asked a key. (Even if the old process is not finished, it's done saving so it can just take his time to die alone).
This seems a bit messy to me but it feels simpler to make several attemps to get the Pid from the key instead of wrapping every call to a gen_server to handle the possible timeouts. (when the process is finishing but still registrered in global).
I'm a bit confused about all of theese half-problems and half-solutions. What is the design you use in this situation, or how can I avoid this situation ?
I hope my message is legible, please tell me about english errors too.
Thank You
Maybe you want to do the save to DB part in a gen_server:call. That would prevent other calls from coming in while you are writing to DB.
Generally it sounds to like you have created a process register. You might want to look into gproc (https://github.com/uwiger/gproc) which does a very good job at that if you want register locally. With gproc you can do exactly what you described above, use a key to register a process. Maybe it would be good enough if you register with gproc in your init function and unregister when writing to DB. You could also write to DB in your terminate function.
For now i decided to stick with erlang « let it crash » philosophy. If a process recieves messages as it is shuting down, those messages will not be answered and will trigger a gen_server:call/* timeout.
I think it will be boring to handle this timeout in the right place, i have not decided where at this time, but this is specific to my application so it is pointless here.

is membase a good persistence layer for a erlang gamer server?

I aim to create a browser game where players can set up buildings.
Each building will have several modules (engines, offices,production lines, ...). Each module will have enentually one or more actions running, like creation of 2OO 'item X' with ingredients Y, Z.
The game server will be set up with erlang : An OTP application as the server itself, and nitrogen as the web front.
I need persistence of data. I was thinking about the following :
When somebody or something interacts with a building, or a timer representing some production line ends up, a supervisor spawns a gen_server (if not already spawned) which loads the state of the building from a database, so the gen_server can answer messages like 'add this module', 'starts this action', 'store this production to warehouse', 'die', etc. (
But when a building don't receive any messages during X seconds or minutes, he will terminate (thanks to the gen_server timeout feature) and drop its current state back to the database.
So, as it will be a (soft) real time game, the gen_server must be set up very fastly. I was thinking of membase as the database, because it's known to have very good response time.
My question is : when a gen server is up an running, his states fills some memory, and this state is present in the memory handled by membase too, so the state use two times his size in memory. Is that a bad design ?
Is membase a good solution to handle persistence in my case ? would be use mnesia a better choice , or something else ?
I fear mnesia 2 Go (or 4 ?) table size limit because i don't know at the moment the average state size of my gen_servers (buildings in this example, butalso players, production lines, whatever) and i may have someday more than 1 player :)
Thank you
I agree with Hynek -Pichi- Vychodil. Riak is a great thing for key-valye storage.
We use Riak almost 95% for the same thing you described. Everything works so far without any issues. In case you will hit performance limitation of Riak - add more nodes and it good to go!
Another cool thing about Riak is its very low performance degradation over the time. You can find more information about benchmarking Riak here: http://joyeur.com/2010/10/31/riak-smartmachine-benchmark-the-technical-details/
In case you go with it:
a driver: https://github.com/basho/riak-erlang-client
a connection pool you may need to work with it: https://github.com/dweldon/riakpool
About membase and memory usage: I also tried membase, but I found that it is not suitable for my tasks - (membase declares fault tolerance, but I could not setup it in the way it should work with faults, even with help from membase guys I didn't succeed). So at the moment I use the following architecture: All players that are online and play the game are presented as player-processes (gen_server). All data data and business logic for each player is in its player-process. From time to time each player-process desides to save its state in riak.
So far seems to be very fast and efficient approach.
Update: Now we are with PostgreSQL. It is awesome!
You can look to bitcask or other Riak backends to store your data. Avoid IPC is definitely good idea, so keep it inside Erlang.

Resources