In my app, I plan to have many worker processes, that can potentially spend hours doing their work.
I want the user to be able to stop and delete the workers.
Is it acceptable to kill/2, exit the process?
Will it terminate the process even if it's in the middle of doing some work (i.e. downloading a file)?
Do supervisors offer a similar mechanism for stopping and removing children that are in the middle of doing some work?
Is it acceptable to kill, exit/2 the process? Will it terminate the
process even if it's in the middle of doing some work (i.e.
downloading a file)?
Yes. In order to terminate a process you may use exit/2 as you said. The termination procedure will be different if you set the Reason argument to be: noraml, OtherReason or kill.
It is explained very well in the Error Handling documentation, and also for more detailed explenation see this.
So you may choose whatever fits your application.
Do supervisors offer a similar mechanism for stopping and removing
children that are in the middle of doing some work?
Yes. As mentioned in the comment, there is a very good detailed documentation for it in Erlang's Supervisor documentation. I suggest you to carefully read all of it, but the main parts you're looking for are:
Defining the child_spec() when starting a child (mainly the shutdown and restart option).
terminate_child/2 for the actual termination of a child.
delete_child/2 for deleting a child after calling terminate_child/2.
You can read more about it here.
Related
In an example such as examples/allegro_hand, where a main thread advances the simulator and another sends commands to it over LCM, what's the cleanest way for each process to kill the other?
I'm struggling to kill the side process when the main process dies. I've wrapped the AdvanceTo with a try, and catch the error thrown when
MultibodyPlant's discrete update solver failed to converge
I can manually publish a boolean with drake::lcm::Publish within the catch block. In the side process, I subscribe and use something like this HandleStatus to process incoming messages. The corresponding HandleStatus isn't called unless I add a while(0 == lcm_.handleTimeout(10)) like this. When I do, the side process gets stuck waiting for a message, which doesn't come unless the simulation throws. Any advice for how to handle this case?
I'm able to kill the main process (allegro_single_object_simulation) by sending a boolean over LCM from the other (run_twisting_mug), AdvanceTo-ing to a smaller timestep within the main process, and checking the received boolean after each of the smaller AdvanceTos. This seems to work reliably, but may not be the cleanest solution.
If I'm thinking about this the wrong way and there's a better way to run an example like this, please let me know. Thanks!
We often use a process manager, like https://github.com/RobotLocomotion/libbot/tree/master/bot2-procman
to start and manage all of our processes. The ROS ecosystem has similar tools.
procman is open and available for you to use, but we don't consider it officially "supported" by the drake developers.
Erlang world not use try-catch as usual. I'm want to know how about performance when restart a process vs try-catch in mainstream language.
A Erlang process has it's small stack and heap concept which actually allocate in OS heap. Why it's effective to restart it?
Hope someone give me a deep in sight about Beam what to do when invoke a restart operation on a process.
Besides, how about use gen_server which maintain state in it's process. Will cause a copy state operate when gen_server restart?
Thanks
I recommend having a read of https://ferd.ca/the-zen-of-erlang.html
Here's my understanding: restart is effective for fixing "Heisenbug" which only happens when the (Erlang) process is in some weird state and/or trying to handle a "weird" message.
The presumption is that you revert to a known good state (by restarting), which should handle all normal messages correctly. Restart is not meant to "fix all the problems", and certainly not for things like bad configuration or missing internet connection. By this definition we can see it's very dangerous to copy the state when crash happened and try to recover from that, because this is defeating the whole point of going back to a known state.
The second point is, say this process only crashes when handling an action that only 0.001% (or whatever percentage is considered negligible) of all your users actually use, and it's not really important (e.g. a minor UI detail) then it's totally fine to just let it crash and restart, and don't need to fix it. I think it can be a productivity enabler for these cases.
Regarding your questions in the OP comment: yes just whatever your init callback returns, you can either build the entire starting state there or source from other places, totally depend on the use case.
What I'd like to do is change my supervisor to make a best effort to keep children running, but give up if their crash rate exceeds the intensity. That way the remainder of the children keep running. This doesn't appear to be possible with the existing supervisor configurations, though, so it looks like my only option may be to implement my own supervisor so I can have it behave this way when it receives EXIT.
Is there a way to implement custom OTP supervisor behavior like this without writing your own supervisor?
It sounds to me like what you want is an individual supervisor for each child, responsible for keeping it alive up to a limit, as you say, and as a layer above that have a single supervisor (one-for-one or simple-one-for-one) whose children are marked as temporary, so that when one of them gives up, the rest stay running.
You can't "extend" Supervisor to add different supervision behaviour, but you don't have to start from scratch either. The :supervisor module itself is implemented on top of :gen_server, so I would consult the source code of :supervisor (which you can find here) if you do find yourself needing some kind of custom supervision behaviour; it will give you a base to build from to avoid some of the pitfalls which you are likely to encounter.
I can expand my answer about alternative solutions once I have a better idea of your use case. As I mentioned in my comment, it sounds to me that you are likely doing something during init/1 of your processes which is prone to failure; init/1 is not the place to handle those things, because if it becomes impossible to succeed at that action temporarily, you will almost certainly blow the max restart intensity of the supervisor.
For example, let's assume you have a process which talks to the database, and requires a database connection; you do not want to try and connect to the database during init/1. Rather you should acquire the connection post-init (perhaps on first-use, or by immediately sending a post-init message to the process using Process.send_after(self(), :connect, 0)), and if the connection fails, return something like {:error, :database_unavailable} to any callers while you attempt to re-establish the connection. Designing with this approach will allow your supervision tree to remain stable, and it instead pushes the decision on how to deal with failure down to the clients who likely have better information on how it impacts them (i.e., should they retry the operation, return an error to their caller, exit with an exception, etc.)
You can use director too, it's more flexible for solving this problem.
I wrote a new library called director.
It's a supervisor library.
One of its feature is giving a fun with arity 2 to director, and director will call function for every crash of process, first argument is crash reason and second is crash count, for example:
-module(director_test).
-behaviour(director).
-export([start_link/0, init/1]).
start_link() ->
director:start_link(?MODULE, []).
init([]) ->
ChildSpec = #{id => foo,
start => {m, f, args},
plan => [fun my_plan/2],
count => infinity},
{ok, [ChildSpec]}.
my_plan(normal, Count) when Count rem 10 == 0 ->
%% If process crashed with reason normal after every 10 times
%%, director will restart it after spending 3000 milliseconds.
{restart, 3000};
my_plan(normal, _Count) ->
%% If process crashed with reason normal director will restart its
restart;
my_plan(killed, _Count) ->
%% If process was killed, Director will delete it from its children
delete;
my_plan(Reason, Count) ->
%% For other reasons, director will crash with reason {foo_crashed, Reason}
{stop, {foo_crashed, Reason}}.
I announced my library in Slack and they was wondering about writing new supervisor in this way !
Someone said that "I tend to not let the supervisor handle back-off".
Finally they did not tell me clean information and i think i need to know more about supervisor and its duty, etc.
I think that a supervisor is a process that should understand when to restart which child and when to delete which child and when to not restart which child. Am i right?
Can you tell me some good features of OTP/Supervisor that i have not in Director? (List of director's features)
You are mixing the ideas of supervision and management.
Supervision is already a part of OTP. It is the basic idea that:
No process can ever possibly become an orphan
Crashes will be restarted or aborted, and this is an architectural decision made before internal logic is written.
Crashes can be logged externally (handled by a process other than whatever failed).
Error handling code, crash forensics, and so on never occur as part of supervision. Ever. (Complex logic leads to complex weirdness, and supervision needs to be simple, robust, and reliable.)
Management is something that may or may not be present in your system, so it is left up to you. It is the idea that you would have a single (usually named) process that guides the overall high-level task that your (supervised) workers are doing. Having a manager process gives you a single point of control for the overall effort being done -- which also means it is a single place you can tell that overall effort to start, stop, suspend itself, etc. and this is where you could add additional logic about selective restarts based on some crash condition.
Think of "supervision" as a low-level, system framework type idea. It is always the same in all programs just like opening a file or handling a network socket would be. Think of management as one discrete chunk of the actual problem your program needs to solve to accomplish its work.
Management may or may not be complex. Supervision must always be uniform and simple. Giving a supervisor too much responsibility makes them difficult to understand and debug, and often leads to business problems -- an overloaded supervisor can be a major problem in a system. Don't burden your supervisors with high-level management tasks.
I wrote an article about the "service -> worker pattern" in Erlang a while back. Hopefully it informs more than it confuses: https://zxq9.com/archives/1311
Please do not take this personally. You have asked for a feedback and I'm trying to give it to you.
After quickly looking at the docs and the code, I think the main problems with your library are:
You are introducing some complexity in the area where it's normally not needed. In the vast majority of Erlang programs you don't want to analyse why a process have crashed. Analysing it is prone to errors. So the "normal" solution is just to restart the process. If you introduce any logic at this point, you probably introduce some errors too. Such a program is harder to reason about and the advantages are disputable at least.
You are making an assumption that the exit reason is the reason why the process have exited. This is not necessarily true. The reason could have been propagated from its linked processes. If you wanted to really react on all possible exit reasons, you would have to make a transitive closure on all process exit reasons, all it's children exit reasons, all their children exit reasons etc. And you have to change it whenever any of the components changes which is very bad attitude, very error prone. And the introduced complexity (see 1) explodes very badly.
You introduce some "introspection" logic out of the context where the internal logic should be kept ideally - i.e. there's some knowledge about the internal working of the process used outside of its module - in the director's plan. This breaks encapsulation somewhat. The "normal" supervisor knows just how to start the process, it don't need any more information about the process internals.
Last but not least: you are probably solving a non-existing problem. Instead of developing a whole new solution, you should clearly identify the problems of an existing solution and try to solve them very directly and minimally.
I'd like some thoughts on whether using fork{} to 'background' a process from a rails app is such a good idea or not...
From what I gather fork{my_method; Process#setsid} does in fact do what it's supposed to do.
1) creates another processes with a different PID
2) doesn't interrupt the calling process (e.g. it continues w/o waiting for the fork to finish)
3) executes the child until it finishes
..which is cool, but is it a good idea? What exactly is fork doing? Does it create a duplicate instance of my entire rails mongrel/passenger instance in memory? If so that would be very bad. Or, does it somehow do it without consuming a huge swath of memory.
My ultimate goal was to do away with my background daemon/queue system in favor of forking these processes (primarily sending emails) -- but if this won't save memory then it's definitely a step in the wrong direction
The fork does make a copy of your entire process, and, depending on exactly how you are hooked up to the application server, a copy of that as well. As noted in the other discussion this is done with copy-on-write so it's tolerable. Unix is built around fork(2), after all, so it has to manage it fairly fast. Note that any partially buffered I/O, open files, and lots of other stuff are also copied, as well as the state of the program that is spring-loaded to write them out, which would be incorrect.
I have a few thoughts:
Are you using Action Mailer? It seems like email would be easily done with AM or by Process.popen of something. (Popen will do a fork, but it is immediately followed by an exec.)
immediately get rid of all that state by executing Process.exec of another ruby interpreter plus your functionality. If there is too much state to transfer or you really need to use those duplicated file descriptors, you might do something like IO#popen instead so you can send the subprocess work to do. The system will share the pages containing the text of the Ruby interpreter of the subprocess with the parent automatically.
in addition to the above, you might want to consider the use of the daemons gem. While your rails process is already a daemon, using the gem might make it easier to keep one background task running as a batch job server, and make it easy to start, monitor, restart if it bombs, and shut down when you do...
if you do exit from a fork(2)ed subprocess, use exit! instead of exit
having a message queue and a daemon already set up, like you do, kinda sounds like a good solution to me :-)
Be aware that it will prevent you from using JRuby on Rails as fork() is not implemented (yet).
The semantics of fork is to copy the entire memory space of the process into a new process, but many (most?) systems will do that by just making a copy of the virtual memory tables and marking it copy-on-write. That means that (at first, at least) it doesn't use that much more physical memory, just enough to make the new tables and other per-process data structures.
That said, I'm not sure how well Ruby, RoR, etc. interacts with copy-on-write forking. In particular garbage collection could be problematic if it touches many memory pages (causing them to be copied).