Cleanest way to kill Drake simulation from another process - drake

In an example such as examples/allegro_hand, where a main thread advances the simulator and another sends commands to it over LCM, what's the cleanest way for each process to kill the other?
I'm struggling to kill the side process when the main process dies. I've wrapped the AdvanceTo with a try, and catch the error thrown when
MultibodyPlant's discrete update solver failed to converge
I can manually publish a boolean with drake::lcm::Publish within the catch block. In the side process, I subscribe and use something like this HandleStatus to process incoming messages. The corresponding HandleStatus isn't called unless I add a while(0 == lcm_.handleTimeout(10)) like this. When I do, the side process gets stuck waiting for a message, which doesn't come unless the simulation throws. Any advice for how to handle this case?
I'm able to kill the main process (allegro_single_object_simulation) by sending a boolean over LCM from the other (run_twisting_mug), AdvanceTo-ing to a smaller timestep within the main process, and checking the received boolean after each of the smaller AdvanceTos. This seems to work reliably, but may not be the cleanest solution.
If I'm thinking about this the wrong way and there's a better way to run an example like this, please let me know. Thanks!

We often use a process manager, like https://github.com/RobotLocomotion/libbot/tree/master/bot2-procman
to start and manage all of our processes. The ROS ecosystem has similar tools.
procman is open and available for you to use, but we don't consider it officially "supported" by the drake developers.

Related

Is gen_server restart strategy copy state?

Erlang world not use try-catch as usual. I'm want to know how about performance when restart a process vs try-catch in mainstream language.
A Erlang process has it's small stack and heap concept which actually allocate in OS heap. Why it's effective to restart it?
Hope someone give me a deep in sight about Beam what to do when invoke a restart operation on a process.
Besides, how about use gen_server which maintain state in it's process. Will cause a copy state operate when gen_server restart?
Thanks
I recommend having a read of https://ferd.ca/the-zen-of-erlang.html
Here's my understanding: restart is effective for fixing "Heisenbug" which only happens when the (Erlang) process is in some weird state and/or trying to handle a "weird" message.
The presumption is that you revert to a known good state (by restarting), which should handle all normal messages correctly. Restart is not meant to "fix all the problems", and certainly not for things like bad configuration or missing internet connection. By this definition we can see it's very dangerous to copy the state when crash happened and try to recover from that, because this is defeating the whole point of going back to a known state.
The second point is, say this process only crashes when handling an action that only 0.001% (or whatever percentage is considered negligible) of all your users actually use, and it's not really important (e.g. a minor UI detail) then it's totally fine to just let it crash and restart, and don't need to fix it. I think it can be a productivity enabler for these cases.
Regarding your questions in the OP comment: yes just whatever your init callback returns, you can either build the entire starting state there or source from other places, totally depend on the use case.

Best-effort OTP supervision

What I'd like to do is change my supervisor to make a best effort to keep children running, but give up if their crash rate exceeds the intensity. That way the remainder of the children keep running. This doesn't appear to be possible with the existing supervisor configurations, though, so it looks like my only option may be to implement my own supervisor so I can have it behave this way when it receives EXIT.
Is there a way to implement custom OTP supervisor behavior like this without writing your own supervisor?
It sounds to me like what you want is an individual supervisor for each child, responsible for keeping it alive up to a limit, as you say, and as a layer above that have a single supervisor (one-for-one or simple-one-for-one) whose children are marked as temporary, so that when one of them gives up, the rest stay running.
You can't "extend" Supervisor to add different supervision behaviour, but you don't have to start from scratch either. The :supervisor module itself is implemented on top of :gen_server, so I would consult the source code of :supervisor (which you can find here) if you do find yourself needing some kind of custom supervision behaviour; it will give you a base to build from to avoid some of the pitfalls which you are likely to encounter.
I can expand my answer about alternative solutions once I have a better idea of your use case. As I mentioned in my comment, it sounds to me that you are likely doing something during init/1 of your processes which is prone to failure; init/1 is not the place to handle those things, because if it becomes impossible to succeed at that action temporarily, you will almost certainly blow the max restart intensity of the supervisor.
For example, let's assume you have a process which talks to the database, and requires a database connection; you do not want to try and connect to the database during init/1. Rather you should acquire the connection post-init (perhaps on first-use, or by immediately sending a post-init message to the process using Process.send_after(self(), :connect, 0)), and if the connection fails, return something like {:error, :database_unavailable} to any callers while you attempt to re-establish the connection. Designing with this approach will allow your supervision tree to remain stable, and it instead pushes the decision on how to deal with failure down to the clients who likely have better information on how it impacts them (i.e., should they retry the operation, return an error to their caller, exit with an exception, etc.)
You can use director too, it's more flexible for solving this problem.

grpc iOS stream, send only when GRXWriter.state is started?

I'm using grpc in iOS with bidirectional streams.
For the stream that I write to, I subclassed GRXWriter and I'm writing to it from a background thread.
I want to be as quick as possible. However, I see that GRXWriter's status switches between started and paused, and I sometimes get an exception when I write to it during the paused state. I found that before writing, I have to wait for GRXWriter.state to become started. Is this really a requirement? Is GRXWriter only allowed to write when its state is started? It switches very often between started and paused, and this feels like it may be slowing me down.
Another issue with this state check is that my code looks ugly. Is there any other way that I can use bidirectional streams in a nicer way? In C# grpc, I just get a stream that I write freely to.
Edit: I guess the reason I'm asking is this: in my thread that writes to GRXWriter, I have a while loop that keeps checking whether state is started and does nothing if it is not. Is there a better way to do this rather than polling the state?
The GRXWriter pauses because the gRPC Core only accepts one write operation pending at a time. The next one has to wait until the first one completes. So the GRPCCall instance will block the writer until the previous write is completed, by modifying its state!
In terms of the exception, I am not sure why you are getting the problem. GRXWriter is more like an abstract class and it seems you did your own implementation by inheriting from it. If you really want to do so, it might be helpful to refer to GRXBufferedPipe, which is an internal implementation. In particular, if you want to avoid waiting in a loop for writing, writing again in the setter of GRXWriter's state should be a good option.

In what cases would you use a process monitor instead of a try catch around a function

I know supervisors can monitor many processes and OTP supervisors provide nice defaults like if there's too many errors in X time then don't retry.
My question is when do you personally choose try catch over a process monitor.
The tools you are speaking of have two completely different use cases.
You should only run things in a process that needs to happen in parallel with something else. Therefore, I would argue that you should use try/catch when you need to catch errors in a situation when you are fine waiting for the result (i.e. a sequential program). You should use an external process for when you need to run something in parallel, and you should monitor it if you are interested in exceptions happening in that process.
That being said, there are of course edge cases where externalizing an activity into a process makes some sense, like special garbage collection corner cases for example (it is sometimes easier/faster to garbage collect an activity by just killing its process).
Performance wise there are so many parameters involved (overhead of try/catch vs monitor, the frequency your code is run at etc.) that you'd have to benchmark for your case.
I prefer to use try/catch only in situations where throwing of error is normal behaviour e.g. say, lookup_value/1 in grpoc (it throws badarg if key is not registered, while I'd expect and prefer to get undefined instead).
That's Erlang philosophy as I understand it: you should program for good case and not try to be defensive too much i.e. there are some cases where you should care while in most cases just let it crash.

How would someone create a preemptive scheduler for the Lua VM?

I've been looking at lua and lvm.c. I'd very much like to implement an interface to allow me to control the VM interpreter state.
Cooperative multitasking from within lua would not work for me (user contributed code)
The debug hook gets me only about 50% of the way there, instruction execution limits, but it raises an exception which just crashes the running lua code - but I need to be able to tweak it even further.
I want to create a system where 10's of thousands of lua user scripts are running - individual threads would not work, and the execution limits would cause headache for beginning developers, I'm going to control execution speeds too. but ultimately
while true do
end
will execute forever, and I really don't care that it is.
Any ideas, help or other implementations that I could look at?
EDIT: This is not about sandboxing pretend I'm an expert in that field for this conversation
EDIT: I do not want to use an internally ran lua code coroutine based controller.
EDIT: I want to run one thread, and manage a large number of user contributed lua scripts, an external process level control mechansim would not scale at all.
You can search for Lua Sandbox implementations; for example, this wiki page and SO question provide some pointers. Note that most of the effort in sandboxing is focused on not allowing you to execute bad code, but not necessarily on preventing infinite loops. For better control you may need to combine Lua sandboxing with something like LXC or cpulimit. (not relevant based on the comments)
If you are looking for something Lua-based, lightweight, but not necessarily 100% foolproof, then you can try running your client code in a separate coroutine and set a debug hook on that coroutine that will be triggered every N-th line. In that hook you can check if the process you are running exceeded its quotes. You also need to take care of new coroutines started as those need to have their own hooks set (you either need to disable coroutine.create/wrap or to replace them with something that sets the debug hook you need).
The code in this case may look like:
local coro = coroutine.create(client_func)
debug.sethook(coro, debug_hook, "l", 1000) -- trigger hook on every 1000th line
It's not foolproof, because it may block on some IO operation and the debug hook will not help there.
[Edit based on updated question and comments]
Between "no lua code coroutine based controller" and "no external process control mechanism" I don't think you are left with much choice. It may be that your only option is to run one VM per user script and somehow give ticks to those VMs (there was a recent question on SO on this, but I can't find it). Before going this route, I would still try to do this with coroutines (which should scale to tens of thousands easily; Tir claims supporting 1M active users with coroutine-based architecture).
The mechanism would roughly look like this: you install the debug hook as I shown above and from that hook you yield back to your controller, which then decides what other coroutine (user script) to resume. I have this very mechanism working in the Lua debugger I've been developing (although it only does it for one client script). This doesn't protect you from IO calls that can block and for that you may still need to have a watchdog at the VM level to see if it's been blocked for longer than needed.
If you need to serialize and deserialize running code fragments that preserve upvalues and such, then Pluto is probably your only option.
Look at implementing lua_lock and lua_unlock.
http://www.lua.org/source/5.1/llimits.h.html#lua_lock
Take a look at lulu. It is lua VM written on lua. It's for Lua 5.1
For newer version you need to do some work. But it's then you really can make a schelduler.
Take a look at this,
https://github.com/amilamad/preemptive-task-scheduler-for-lua
I maintain this project. It,s a non blocking preemptive scheduler for running lua code. Suitable for long running game scripts.

Resources