Is there any particular reason why there is no start_monitor as an equivalent of spawn_monitor?
Is this simply not needed since gen_servers are usually started by supervisors?
I would like to get a notifications when my temporary workers crash. What is the recommended way to do this in an OTP application?
First idea was to have a gen_server which would monitor workers started by a dynamic supervisor.
More Info:
As far as I know supervisors provide controlled start, shutdown and controlled restart in case of a crash (to get back to a well defined state).
In addition to that, I would like to run a function when a worker process crashes.
For example, I have a C nodes which connect to the Erlang node. Since the C node can't monitor processes (AFAIK) and is also limited in other ways how it can interact with Erlang, I have a "proxy" processes for connecting C nodes in order to keep the C node as simple as possible.
The C nodes do rpc calls to Erlang using ei_rpc_to and processing messages from the connected Erlang node. Messages are either results of rpc calls or "out-of-band" data/info for the C node.
The Erlang "proxy" process monitors its C node using monitor_node to detect if it vanished, but I also need a mechanism for informing the C node that its proxy process crashed. One way of detecting this would be when it does the next rpc call, since it would obviously fail, but since I already have the "out-of-band" message processing in place, I wanted to use that.
Other use case would be having clients which do REST requests to the Erlang cluster. This in turn starts workers that perform some tasks (which may take a long time). After a while the external client may want to get the status of the task. The worker can for example update the status in a Mnesia table, but if it crashes, who will update the table with the failure status.
I know there are many ways of achieving this, but I would like to know what is the Erlang way of doing this.
2nd Edit:
After reading the docs I saw that in a gen_server, terminate will get called (if it is defined with a matching clause). Would this be a viable option to a separate monitoring process? This looks a bit messy since terminate does not get called when receiving 'EXIT' from other processes, so I would also need to trap exits
Related
Problem: I have a type FlyServer and need to iterate through all the Fly processes. For various computations on the server.
How do I accomplish this?
One option is to have a GenServer list of all the FlyServer processes. But what if it crashes? And what if a player crashes and for whatever reason the GenServer keeping track of the processes isn't notified --- chime in if that scenario is unrealistic please.
I advise you to start your servers using a supervisor with a call to supervisor:start_child/2. The supervisor should use the strategy simple_one_for_one which is meant to create and supervise processes of the same kind.
Then you can get an updated list of all the chidren using the function supervisor:which_children/1
Every time a Fly process contacts the server, you can add its pid to a list, where the list is part of the gen_server's State.
The server can then monitor the Fly process, which means that when a Fly process terminates, the server will get sent a special message.
The server can implement a receive clause that pattern matches the special message and then removes the terminated process's pid from the list.
One option is to have a GenServer list of all the FlyServer processes.
But what if it crashes?
Then terminate(Reason, State) will be called in the callback module, which can save State to an ets, dets, or mnesia table. Of course, if someone trips over the cord that connects the server running the FlyServer to an electrical outlet, then execution will immediately halt and terminate() will not be called. See distributed erlang for solutions.
I am just reading Manning's Erlang & OTP In Action. Very good book, I think. It contains a nice TCP server example but I'd like to write a UDP server. This is how I structured my app so far.
my_app % app behaviour
|-- my_sup % root supervisor
|-- my_server.erl % gen_server to open UDP connection and dispatch
|-- my_worker_sup % simple_one_to_one supervisor to start workers
|-- my_worker_server % gen_server worker
So, my_app starts my_sup, which in turn starts my_worker_sup and my_server. The UDP connection is opened in my_server in active mode such that handle_info/2 is invoked on each new UDP message in response to which I call my_worker_sup:start_child/2 to pass the message to a new worker process for processing. (The last call to start_child/2 is in fact, as per the book's recommendation, wrapped in an API function to hide some of the details, but this is essentially what happens.)
Am I suffering from OTP fever? Should the my_worker_server really implement the gen_server behaviour? Do I need my_worker_sup at all?
I set it up in like this so that I can use my_worker_sup as a factory via the start_child/2 call but I only use the worker's init/1 and handle_info(timeout,State) functions to first setup state and then to process the message before shutting the worker down.
Should I just spawn the worker directly? Is another behaviour better suited, perhaps?
Thanks,
HC
The key answer to this question is: "how do you want your application to crash?"
If a worker dies, then what should happen? If this should stop everything, including the UDP connection, then surely you can just spawn_link them under the my_server directly, no supervisor tree needed. But if you want them to be able to gracefully restart or something else, then the above diagram is usually better. Perhaps add a monitor on the workers from my_server so it can keep a book of who is alive.
In my utp erlang library, I have almost the same construction. A master handles the UDP socket and forwards to workers based on a routing table kept in ETS. Each worker keeps a connection state and can handle the incoming information.
Since you don't track state, then your best bet is probably to run via proc_lib:spawn_link and then hook them to the s_1_1 supervisor as transient processes. That way, you will force too many crashes to be propagated up the supervisor tree but allow them to exit with normal. This allows you to have them run exactly once.
Note that you could also handle everything directly in the my_server, but then you will not be able to process data concurrently. This may or may not be acceptable. The general rule is to spawn a new process when you have concurrent work that needs to be executed next to each other, blocks or otherwise behaves in some way.
I'm still kind of new to the erlang/otp world, so I guess this is a pretty basic question. Nevertheless I'd like to know what's the correct way of doing the following.
Currently, I have an application with a top supervisor. The latter will supervise workers that call gen_tcp:accept (sleeping on it) and then spawn a process for each accepted connection. Note: To this question, it is irrelevant where the listen() is done.
My question is about the correct way of making these workers (the ones that sleep on gen_tcp:accept) respect the otp design principles, in such a way that they can handle system messages (to handle shutdown, trace, etc), according to what I've read here: http://www.erlang.org/doc/design_principles/spec_proc.html
So,
Is it possible to use one of the available behaviors like gen_fsm or gen_server for this? My guess would be no, because of the blocking call to gen_tcp:accept/1. Is it still possible to do it by specifying an accept timeout? If so, where should I put the accept() call?
Or should I code it from scratch (i.e: not using an existant behavior) like the examples in the above link? In this case, I thought about a main loop that calls gen_tcp:accept/2 instead of gen_tcp:accept/1 (i.e: specifying a timeout), and immediately afterwards code a receive block, so I can process the system messages. Is this correct/acceptable?
Thanks in advance :)
As Erlang is event driven, it is awkward to deal with code that blocks as accept/{1,2} does.
Personally, I would have a supervisor which has a gen_server for the listener, and another supervisor for the accept workers.
Handroll an accept worker to timeout (gen_tcp:accept/2), effectively polling, (the awkward part) rather than receiving an message for status.
This way, if a worker dies, it gets restarted by the supervisor above it.
If the listener dies, it restarts, but not before restarting the worker tree and supervisor that depended on that listener.
Of course, if the top supervisor dies, it gets restarted.
However, if you supervisor:terminate_child/2 on the tree, then you can effectively disable the listener and all acceptors for that socket. Later, supervisor:restart_child/2 can restart the whole listener+acceptor worker pool.
If you want an app to manage this for you, cowboy implements the above. Although http oriented, it easily supports a custom handler for whatever protocol to be used instead.
I've actually found the answer in another question: Non-blocking TCP server using OTP principles and here http://20bits.com/article/erlang-a-generalized-tcp-server
EDIT: The specific answer that was helpful to me was: https://stackoverflow.com/a/6513913/727142
You can make it as a gen_server similar to this one: https://github.com/alinpopa/qerl/blob/master/src/qerl_conn_listener.erl.
As you can see, this process is doing tcp accept and processing other messages (e.g. stop(Pid) -> gen_server:cast(Pid,{close}).)
HTH,
Alin
In all Erlang supervisor examples I have seen yet, there usually is a "master" supervisor who supervises the whole tree (or at least is the root node in the supervisor tree). What if the "master"-supervisor breaks? How should the "master"-supervisor be supervised?? any typical pattern?
The top supervisor is started in your application start/2 callback using start_link, this means that it links with the application process. If the application process receives an exit signal from the top supervisor dying it does one of two things:
If the application is started as an permanent application the entire node i terminated (and maybe restarted using HEART).
If the application is started as temporary the application stops running, no restart attempts will be made.
Typically Supervisor is set to "only" supervise other processes. Which mens there is no user written code which is executed by Supervisor - so it very unlikely to crash.
Of course, this cannot be enforced ... So typical pattern is to not have any application specific logic in Supervisor ... It should only Supervise - and do nothing else.
Good question. I have to concur that all of the examples and tutorials mostly ignore the issue - even if occasionally someone mentions the issue (without providing an example solution):
If you want reliability, use at least two computers, and then make them supervise each other. How to actually implement that with OTP is (with the current state of documentation and tutorials), however, appears to be somewhere between well hidden and secret.
I have been researching Mobile Agents, and was wondering if it is possible to send a running process to another node in erlang. I know it is possible to send a process on another node a message. I know it is possible to load a module on all nodes in a cluster. Is it possible to move a process that might be in some state on a particular node to another node and resume it's state. That is, does erlang provide strong mobility? Or is it possible to provide strong mobility in erlang?
Yes, it is possible, but there is no "Move process to node" call. However, if the process is built with a feature for migration, you can certainly do it by sending the function of the process and its state to another node and arrange for a spawn there. To get the identity of the process right, you will need to use either the global process registry or gproc, as the process will change pid.
There are other considerations as well: The process might be using an ETS table whose data are not present on the other node, or it may have stored stuff in the process dictionary (state from the random module comes to mind).
The general consensus in Erlang is that processes are not mobilized to move between machines. Rather, one either arranges for a takeover of applications between nodes should a node die. Or for distribution of the system so data are already distributed to another machine. In any case, the main problem of making state persistent in the event of errors still hold, mobility or not - and distribution is a nice tool to solve the persistence problem.