Restarting erlang process and preserving state - erlang

I have a supervisor process which starts number of child processes. Currently when the child dies I spawn a new process with new Pid. This means I loose the state information of my child process which has just died. I want my clients to communicate with child processes using always the same identifier. Despite the fact that child process may die and be restarted by the supervisor.
I was thinking of registering child processes with unique names and storing child state in ets table. The question is - what is the recommended way of approaching such problem in Erlang?
Thanks!

Storing process state in an ets table would work for keeping your state around between crashes, and I usually use the global registry for giving processes persistent names. (Player 200 would be registered as {player, 200}.) I don't recommend using the local registry because it requires that you use atoms and if you have many child processes, you can chew up your limit of atoms in a hurry by creating them dynamically (like player_200, player_201, etc.)
Storing child state in the ets table has its own risks and issues, though. If a child crashes between the moment when an error occurs and when it saves to the ets table, you should be alright. However, what if you process data that causes the child to save garbage state, then crash on processing the next message? You'll restart the process, load the bad state from the ets table, and crash on your next message again. There are certainly ways to deal with this, but you should be aware that it is a possibility and work around it.
While Erlang hides the problems of distributing an ets table to all processes, it does so at the cost of CPU and potential contentions. If you're pushing a lot of changes to your ets table, you're going to pay for it in performance.
If your children are crashing, shouldn't you be looking for a way for them to remove the erroneous conditions, anyway? I would usually take a process crash as something that I needed to root cause and fix. ?

Using ETS tables is probably the way to go for keeping the state. Vinoski's article discusses how to make it possible to restart a crashed process while keeping the ETS table data.
As #user30997 points out the data in the table may actually be the reason the process crashed, so on restart you might want to validate the table (or set a limit on how many times the process will be restarted...)
For associating processes with id's you should take a look at gproc which is great for this.

Use eventsourcing, persist all events, and replay back to reconstruct the state. In case you need fast replays, make a snapshot. The example below:
https://github.com/bryanhunter/cqrs-with-erlang/tree/ndc-oslo
In fact, it would be nice to build a complete framework based on this example.

Related

ETS entry limit, to use as a cache server

My idea is to use ETS as a temporary cache for my GenServer state.
For example when I restart my application, the GenServer state should be transported to ETS and when the application starts again, the GenServer should be able to get the state from there.
I would like to keep it simple, so all the state (a Map) from the GenServer should take a single entry. Is there a limit for entry-sizes?
Another approach would be, to simply create a file, and load it again, when needed. Maybe this is even better/simpler, but I am not sure :)
In case of an ETS table, the App could start on a completely other host, and connect to the Cache Node (ETS).
This can most certainly be done in a wide variety of ways. You could have a separate store, like Mnesia(which is ETS under the hood), Redis or a plain database. In the latter two cases, you would need to cast your Genserver state to a string and back doing: :erlang.term_to_binary and :erlang.binary_to_term respectively.
In the case that you are dealing with multiple GenServer processes that need to be cached in this way, e.g. every GenServer represents a unique customer cart for instance, then that unique identifier can be utilized as the key under which to store the state which can then later on be retrieved. This is particularly useful when you are running your shopping application on multiple nodes behind a load balancer, and every new request on part of a customer can get 'round robin'-ned around to any random node.
When the request comes in:
fetch the unique identifier belonging to that customer in one way or the other,
fetch the stored contents from wherever that may be(Mnesia/Redis/...),
spawn up a new GenServer process initialized with that stored contents,
do the various operations required for that request,
store the latest modified GenServer shopping cart into Redis/Mnesia/wherever,
tear down the GenServer and
respond to the request with whatever data is required.
Based on the Benchmarks I have done of ETS vs Redis on my local, it is no surprise that ETS is the more performant way to go, but ElastiCache is an awesome alternative if you are not in the mood to bother spinning up a dedicated Mnesia store.
In the case it pertains to a specific GenServer that needs to run, then you are most likely looking at failover as opposed to managing individual user requests.
In such a case, you could consider using something like: https://hexdocs.pm/elixir/GenServer.html#c:terminate/2 to have the state first persisted to some store and in your init make the GenServer first look in that store and reuse the cache accordingly.
The complicated matter here is in the scenario where you have multiple applications running, which key will you utilize in order to have the crashed application reinitialize the GenServer with the correct state?
There are several open ended questions over here that revolve around your exact use case, but what has been presented so far should give you a fair idea as to when it makes sense to utilize this caching solution and how to start implementing it.

How to implement status in Erlang?

I am thinking an Erlang program that has many workers (loop receive), these workers almost always manipulate their status at the same time, ie. massive concurrent, the amount of workers is so big that keep their status in mnesia will cause performance problem, so I am thinking pass the status as args in each loop, then write to mnesia some time later. Is this a good practice? Is there a better way to do this? (roughly speaking, I'm looking for something like an instance with attributes in the object oriented language)
Thanks.
With Erlang, it is a good habit to see the processes as actor with a dedicated and limited role. With this in mind you will see that you will split your problem in different categories like:
Maintain the state of a connection with a user over the Internet,
Keep information such as login, user profile, friends, shop-cart...
log events
...
for each role you will have to decide if the state information must survive to the process.
In a lot of cases it is not necessary (case 1) and the solution is simply to keep the state in the argument of loop funtion of the process. I encourage you to look at the OTP behaviors, the gen_server and gen_fsm are made for this.
The case 2 obviously manipulates permanent data which must survive to a process crash or even a hardware crash. These data will be stored using dets, mnesia or any database adapted to your problem (Redis, CouchDB ...).
It is important to limit the information stored into external database, otherwise you will not benefit of this very powerful feature which is the lack of side effect. In other words, it is a very bad idea to have process behavior which depends on external information.

handle saving of transient gen_servers states when using a key-to-pid mechanism

I would like to know how to handle saving of transient gen_servers states when they are associated with a key.
To associate keys with processes, I use a process called pidstore. Pidstore eventually start processes.
I give a Key and a M,F,A to pidstore, it looks for the key in global, then either returns the pid if found or apply MFA (which must return {ok, Pid}), registers the Pid with the key in global and returns the Pid.
I may have many inactive gen_servers with a possibly huge state. So, i've set the handle_info callback to save the state in my database and then stops the process. The gen_servers are considered transient in their supervisor, so they won't be restarted until something needs them again.
Here starts the problems : If I call a process with its key, say {car, 23}, during the saving step of handle_info in the process which represents {car, 23}, i'll get the pid back as intended, because the process is saving and not finished. So i'll call my process with gen_server:call but i'll never have a response (and hit default 5 sec. timeout) because the process is stopping. (PROBLEM A)
To solve this problem, the process could unregister itself from global, then save its state, then stop. But if I need it after it's unregistered but before save is finished, I will load a new process, this process could load non-updated values in the database. (PROBLEM B)
To solve this again, I could ensure that loading and saving in the db are enqueued and can not be concurrent. This could be a bottleneck. (PROBLEM C)
I've thinking about another solution : my processes, before saving, could tell the pidstore that they are busy. The pidstore would keep a list of busy processes, and respond 'busy' to any demand on theese keys.
when the save is done, the pidstore would be told no_more_busy by the process and could start a new process when asked a key. (Even if the old process is not finished, it's done saving so it can just take his time to die alone).
This seems a bit messy to me but it feels simpler to make several attemps to get the Pid from the key instead of wrapping every call to a gen_server to handle the possible timeouts. (when the process is finishing but still registrered in global).
I'm a bit confused about all of theese half-problems and half-solutions. What is the design you use in this situation, or how can I avoid this situation ?
I hope my message is legible, please tell me about english errors too.
Thank You
Maybe you want to do the save to DB part in a gen_server:call. That would prevent other calls from coming in while you are writing to DB.
Generally it sounds to like you have created a process register. You might want to look into gproc (https://github.com/uwiger/gproc) which does a very good job at that if you want register locally. With gproc you can do exactly what you described above, use a key to register a process. Maybe it would be good enough if you register with gproc in your init function and unregister when writing to DB. You could also write to DB in your terminate function.
For now i decided to stick with erlang « let it crash » philosophy. If a process recieves messages as it is shuting down, those messages will not be answered and will trigger a gen_server:call/* timeout.
I think it will be boring to handle this timeout in the right place, i have not decided where at this time, but this is specific to my application so it is pointless here.

How to atomatically delete specs of terminated children in a dynamic supervisor

No knowledge of USB needed for this question, just described it as it is to make the example more conrete.
I'm trying to implement a dynamic supervisor for specific devices on a USB bus.
These devices have addresses and appear and disappear during the lifetime of the system.
For each device I need a dynamic child for my supervisor.
These children are transient, so once they crash or terminate we don't restart them (because probably they are gone then).
I have a process that scans the USB port at certain times and produces a list of all addresses of the USB devices I want to handle.
I plan to call supervisor:which_children/1 before each scan to find out which devices are present but have no child process running.
In order to find out which addresses have children running I plan to create Id atoms for the childspec that contain the addresses (there are only a few addresses possible), e.g. adr_12 if the child handles address 12.
When I try to start/restart missing children I have the somewhat ugly situation that the child specs are not automatically deleted when the transient child terminates or crashes (at least I think that it is so). So I would need code like this:
case supervisor:start_child(my_sup, Spec) of
{error, already_present} ->
supervisor:restart_child(my_sup, Spec);
Any -> Any
end
Then there is the problem that I don't know if supervisor:which_children/1 also returns already terminated children.
So it would be best if children would be deleted after they transiently terminate.
Somehow all this feels inelegant to me so I'm asking myself (and you):
How can I resolve this most elegantly?
Is it better not to use a supervisor at all in this situation?
My gut feeling/knee jerk reaction is: 'You need to use a simple_one_for_one' supervisor for them, so their spec gets removed when it stops. If you need to be able to grab a specific process for communication, I would use the gproc application for that (or an ETS table).
It sounds to me like the children you want to dynamically add to your supervisor are very similar each other. Maybe a simple-one-for-one supervisor is what you need. These supervisors are "a simplified version of the one_for_one supervisor, where all child processes are dynamically added instances of the same process.". Every child will have the same child specs, so you don't need to specify it when you call the supervisor:add_child/2.
Also, mind that the above idea of creating an atom (e.g. adr_12) dynamically could be dangerous. Atoms are limited in an Erlang system (by default ~1000000). See the documentation for details.

How can I restore process state after a crash?

What's a good way to persist state when restarting a crashed process?
I have a supervisor in an OTP application what watches several "subsystem" gen_servers.
For example, one is a "weather" subsystem that generates a new weather state every 15 minutes and handles queries for the current state of the weather. (Think the lemonade stand game)
If that gen_server crashes, I want it to be restarted, but it should be be restarted with the most recent weather state, not some arbitrary state hardcoded in init(). It wouldn't make sense for the simulation state to suddenly go from "hail storm" to "pleasant and breezy" just because of the crash.
I hesitate to use mnesia or ETS to store the state after every update because of the added complexity; is there an easier way?
As long as it just has to be during runtime a would suggest the usage of ETS. The value is by far greater than the complexity. The API is simple and if you're working with named tables the access is simple too. You only have to create the table before your gen_server is started by the supervisor.
Two - more complex - alternatives:
Build a pair of processes, one for the job to do, one for the state maintenance. Due to the simplicity of the second one it would be really reliable.
A real silly one could be the exchange of the child spec of the supervisor with the current state as argument each time the state is changing. (smile) No, just kidding.
is there an easier way?
when process died it sends message to supervisor that containing State of process, so you can use this value to store in supervisor (in mnesia or supervisor's state) and when your server will start (in init) it have to send sync call to supervisor to get State value. I haven't real example, but i hope it makes sense.
Anyway i don't really see problem to store State in mnesia.
sorry my English :)

Resources