Erlang: create filewatcher - erlang

I have to implement file watcher functionality in Erlang: There should be a process that list files if specific directory and do something, when files appear.
I take a look at OTP. So at the moment I have following ideas:
1. Create Supervisor that will control gen_servers (one server per folder)
2. Create WatchServer - gen_server for each folder that I want to monitor.
3. Create ProcessFileServer - gen server that should do something with files )assume copy to different folder=
So First problem: WatchServer should not wait for request, it should generate one in predefined intervals.
At the moment I have created a timer in init/1 function and handle on_timer event in handle_info function.
Now questions:
1. Are there better ideas?
2. How should I inform ProcessFileServer that file found? It seams to me that it would be much more convenient create WatchServers and ProcessServers independently, but in this case I do not know to whom send message?
May be there are some similar project/libs available?

if you are using Linux, you can use inotify. It is a kernel service that lets you subscribe to file system events. Don't poll the filesystem, let the filesystem call you.
you can try https://github.com/massemanet/inotify for observing your directory.
Ulf

In Erlang it is very cheap to create processes (orders of magnitudes compared to other systems).
Therefore I recommend to create a new ProcessFileServer each time a new file to process is appearing. When it is done with just terminate the process with exit reason normal.
I would suggest the following structure:
top_supervisor
|
+-----------------------+-------------------------+
| |
directory_supervisor processing_supervisor
| simple_one_for_one
+----------+-----...-----+ |
| | | starts children transient
| | | |
dir_watcher_1 dir_watcher_2 dir_watcher_n +-------------+------+---...----+
| | |
proc_file_1 proc_file_2 proc_file_n
When a dir_watcher notices a new file appeared. It calls the processing_supervisors supervisor:start_child\2 function, with the extra parameter of the file pathe e.g.
The processing_supervisor should start its children with transient restart policy.
So if one of the proc_file servers is crashing it will be restarted, but when they terminate with exit reason normal they are not restarted. So you just exit normal when done and crash when whatever else happens.
If you don't overdo it, cyclic polling for files is Ok. If the system becomes loaded because of this polling you can investigate in kernel notification systems (e.g. FreeBSD KQUEUE or the higher level services building upon it on MacOSX) to send you a message when a file appears in a directory. These services however have a complexity because it is necessary for them to throw up their hands if too many events happen (otherwise they wouldn't be a performance improvement but the opposite). So you will have to have a robust polling solution as a fallback anyway.
So don't do premature optimization and start with polling, adding improvements (which would be isolated in the dir_watcher servers) when it gets necessary.
Regarding the comment what behaviour to use as dir_watcher process since it doesn't use much of gen_servers functionality:
There is no problem with only using part of gen_servers posibilities, in fact it is very common not to use all of it. In your case you only set up a timer in init and use handle_info to do your work. The rest of the gen_server is just the unchanged template.
If you later want changing parameters like poll frequency it is easy to add into this.
gen_fsm is much less used since it only fits a quite limited model and is not very flexible. I use it only when it really fits 100% to the requirement (which it does almost never).
In a case where you just want a simple plain Erlang server you can use the spawn functions in proc_lib to get just the minimal functionality to run under a supervisor.
A interesting way to write more natural Erlang code and still have the OTP advantages is plain_fsm, here you have the advantages of selective receive and flexible message handling needed especially when handling protocols paired with the nice features of OTP.
Having said all this: if I would write a dir_watcher I'd just use a gen_server and use only what I need. The unused functionality doesn't really cost you anything and everybody understands what it does.

I have written such a library, based on polling. (It would be nice to extend it to use inotify on platforms where this is supported.) It was originally meant to be used in EUnit, but I turned into a separate project instead. You can find it here:
https://github.com/richcarl/file_monitor

Related

Best-effort OTP supervision

What I'd like to do is change my supervisor to make a best effort to keep children running, but give up if their crash rate exceeds the intensity. That way the remainder of the children keep running. This doesn't appear to be possible with the existing supervisor configurations, though, so it looks like my only option may be to implement my own supervisor so I can have it behave this way when it receives EXIT.
Is there a way to implement custom OTP supervisor behavior like this without writing your own supervisor?
It sounds to me like what you want is an individual supervisor for each child, responsible for keeping it alive up to a limit, as you say, and as a layer above that have a single supervisor (one-for-one or simple-one-for-one) whose children are marked as temporary, so that when one of them gives up, the rest stay running.
You can't "extend" Supervisor to add different supervision behaviour, but you don't have to start from scratch either. The :supervisor module itself is implemented on top of :gen_server, so I would consult the source code of :supervisor (which you can find here) if you do find yourself needing some kind of custom supervision behaviour; it will give you a base to build from to avoid some of the pitfalls which you are likely to encounter.
I can expand my answer about alternative solutions once I have a better idea of your use case. As I mentioned in my comment, it sounds to me that you are likely doing something during init/1 of your processes which is prone to failure; init/1 is not the place to handle those things, because if it becomes impossible to succeed at that action temporarily, you will almost certainly blow the max restart intensity of the supervisor.
For example, let's assume you have a process which talks to the database, and requires a database connection; you do not want to try and connect to the database during init/1. Rather you should acquire the connection post-init (perhaps on first-use, or by immediately sending a post-init message to the process using Process.send_after(self(), :connect, 0)), and if the connection fails, return something like {:error, :database_unavailable} to any callers while you attempt to re-establish the connection. Designing with this approach will allow your supervision tree to remain stable, and it instead pushes the decision on how to deal with failure down to the clients who likely have better information on how it impacts them (i.e., should they retry the operation, return an error to their caller, exit with an exception, etc.)
You can use director too, it's more flexible for solving this problem.

How would someone create a preemptive scheduler for the Lua VM?

I've been looking at lua and lvm.c. I'd very much like to implement an interface to allow me to control the VM interpreter state.
Cooperative multitasking from within lua would not work for me (user contributed code)
The debug hook gets me only about 50% of the way there, instruction execution limits, but it raises an exception which just crashes the running lua code - but I need to be able to tweak it even further.
I want to create a system where 10's of thousands of lua user scripts are running - individual threads would not work, and the execution limits would cause headache for beginning developers, I'm going to control execution speeds too. but ultimately
while true do
end
will execute forever, and I really don't care that it is.
Any ideas, help or other implementations that I could look at?
EDIT: This is not about sandboxing pretend I'm an expert in that field for this conversation
EDIT: I do not want to use an internally ran lua code coroutine based controller.
EDIT: I want to run one thread, and manage a large number of user contributed lua scripts, an external process level control mechansim would not scale at all.
You can search for Lua Sandbox implementations; for example, this wiki page and SO question provide some pointers. Note that most of the effort in sandboxing is focused on not allowing you to execute bad code, but not necessarily on preventing infinite loops. For better control you may need to combine Lua sandboxing with something like LXC or cpulimit. (not relevant based on the comments)
If you are looking for something Lua-based, lightweight, but not necessarily 100% foolproof, then you can try running your client code in a separate coroutine and set a debug hook on that coroutine that will be triggered every N-th line. In that hook you can check if the process you are running exceeded its quotes. You also need to take care of new coroutines started as those need to have their own hooks set (you either need to disable coroutine.create/wrap or to replace them with something that sets the debug hook you need).
The code in this case may look like:
local coro = coroutine.create(client_func)
debug.sethook(coro, debug_hook, "l", 1000) -- trigger hook on every 1000th line
It's not foolproof, because it may block on some IO operation and the debug hook will not help there.
[Edit based on updated question and comments]
Between "no lua code coroutine based controller" and "no external process control mechanism" I don't think you are left with much choice. It may be that your only option is to run one VM per user script and somehow give ticks to those VMs (there was a recent question on SO on this, but I can't find it). Before going this route, I would still try to do this with coroutines (which should scale to tens of thousands easily; Tir claims supporting 1M active users with coroutine-based architecture).
The mechanism would roughly look like this: you install the debug hook as I shown above and from that hook you yield back to your controller, which then decides what other coroutine (user script) to resume. I have this very mechanism working in the Lua debugger I've been developing (although it only does it for one client script). This doesn't protect you from IO calls that can block and for that you may still need to have a watchdog at the VM level to see if it's been blocked for longer than needed.
If you need to serialize and deserialize running code fragments that preserve upvalues and such, then Pluto is probably your only option.
Look at implementing lua_lock and lua_unlock.
http://www.lua.org/source/5.1/llimits.h.html#lua_lock
Take a look at lulu. It is lua VM written on lua. It's for Lua 5.1
For newer version you need to do some work. But it's then you really can make a schelduler.
Take a look at this,
https://github.com/amilamad/preemptive-task-scheduler-for-lua
I maintain this project. It,s a non blocking preemptive scheduler for running lua code. Suitable for long running game scripts.

Event manager process in erlang. Named processes or Pids?

I have event manager process that dispatches events to subscribers (e.g. http_session_created, http_sesssion_destroyed). If Pid is used instead of named process, I must put it into functions to operate with event manager but if Named process is used, code will be more clear.
Which variant is right?
Thank you!
While there is no actual difference to the process naming a process, registering it, makes it global. You in essence you are telling the system that here is a global service which anyone can use.
From you description it more sounds like you are giving them names to save the, small, effort of carrying them around in your loop. If this is the case I would put their pids in a record with all the other state data you carry around. This much better indicates the type of the processes.
If you have a fixed set of "subscriber" processes, then use registered names IMO.
If, on the contrary, you have a publish/subscribe sort of architecture where subscribers come and go, then you need an infrastructure to track those and from this point it doesn't really matter if you use Pid() or names.
One of the drawbacks of using registered names is that you need to track them in your code base to avoid "collisions". So it is up to you: personally, I tend to favor named processes as, like you say, it makes reading the code clearer. One way or another, OTP doesn't care.

using Kernel#fork for backgrounding processes, pros? cons?

I'd like some thoughts on whether using fork{} to 'background' a process from a rails app is such a good idea or not...
From what I gather fork{my_method; Process#setsid} does in fact do what it's supposed to do.
1) creates another processes with a different PID
2) doesn't interrupt the calling process (e.g. it continues w/o waiting for the fork to finish)
3) executes the child until it finishes
..which is cool, but is it a good idea? What exactly is fork doing? Does it create a duplicate instance of my entire rails mongrel/passenger instance in memory? If so that would be very bad. Or, does it somehow do it without consuming a huge swath of memory.
My ultimate goal was to do away with my background daemon/queue system in favor of forking these processes (primarily sending emails) -- but if this won't save memory then it's definitely a step in the wrong direction
The fork does make a copy of your entire process, and, depending on exactly how you are hooked up to the application server, a copy of that as well. As noted in the other discussion this is done with copy-on-write so it's tolerable. Unix is built around fork(2), after all, so it has to manage it fairly fast. Note that any partially buffered I/O, open files, and lots of other stuff are also copied, as well as the state of the program that is spring-loaded to write them out, which would be incorrect.
I have a few thoughts:
Are you using Action Mailer? It seems like email would be easily done with AM or by Process.popen of something. (Popen will do a fork, but it is immediately followed by an exec.)
immediately get rid of all that state by executing Process.exec of another ruby interpreter plus your functionality. If there is too much state to transfer or you really need to use those duplicated file descriptors, you might do something like IO#popen instead so you can send the subprocess work to do. The system will share the pages containing the text of the Ruby interpreter of the subprocess with the parent automatically.
in addition to the above, you might want to consider the use of the daemons gem. While your rails process is already a daemon, using the gem might make it easier to keep one background task running as a batch job server, and make it easy to start, monitor, restart if it bombs, and shut down when you do...
if you do exit from a fork(2)ed subprocess, use exit! instead of exit
having a message queue and a daemon already set up, like you do, kinda sounds like a good solution to me :-)
Be aware that it will prevent you from using JRuby on Rails as fork() is not implemented (yet).
The semantics of fork is to copy the entire memory space of the process into a new process, but many (most?) systems will do that by just making a copy of the virtual memory tables and marking it copy-on-write. That means that (at first, at least) it doesn't use that much more physical memory, just enough to make the new tables and other per-process data structures.
That said, I'm not sure how well Ruby, RoR, etc. interacts with copy-on-write forking. In particular garbage collection could be problematic if it touches many memory pages (causing them to be copied).

Erlang gen_server vs stateless module

I've recently finished Joe's book and quite enjoyed it.
I'm since then started coding a soft realtime application with erlang and I have to say I am a bit confused at the use of gen_server.
When should I use gen_server instead of a simple stateless module?
I define a stateless module as follow:
- A module that takes it's state as a parameter (much like ETS/DETS) as opposed to keeping it internally (like gen_server)
Say for an invoice manager type module, should it initialize and return state which I'd then pass subsequently to it?
SomeState = InvoiceManager:Init(),
SomeState = InvoiceManager:AddInvoice(SomeState, AnInvoiceFoo).
Suppose I'd need multiple instances of the invoice manager state (say my application manages multiple companies each with their own invoices), should they each have a gen_server with internal state to manage their invoices or would it better fit to simply have the stateless module above?
Where is the line between the two?
(Note the invoice manage example above is just that, an example to illustrate my question)
I don't really think you can make that distinction between what you call a stateless module and gen_server. In both cases there is a recursive receive loop which carries state in at least one argument. This main loop handles requests, does work depending on the requests and, when necessary, sends results back the requesters. The main loop will most likely handle a number of administrative requests as well which may not be part of the main API/protocol.
The difference is that gen_server abstracts away the main receive loop and allows the user to only the write the actual user code. It will also handle many administrative OTP functions for you. The main difference is that the user code is in another module which means that you see the passed through state more easily. Unless you actually manage to write your code in one big receive loop and not call other functions to do the work there is no real difference.
Which method is better depends very much on what you need. Using gen_server will simplify your code and give you added functionality "for free" but it can be more restrictive. Rolling your own will give you more power but also you give more possibilities to screww things up. It is probably a little faster as well. What do you need?
It strongly depend of your needs and application design. When you need shared state between processes you have to use process to keep this state. Then gen_server, gen_fsm or other gen_* is your friend. You can avoid this design when your application is not concurrent or this design doesn't bring you some other benefits. For example break your application to processes will lead to simpler design. In other case sometimes you can choose single process design and using "stateless" modules for performance or such. "stateless" module is best choice for very simply stateless (pure functional) tasks. gen_server is often best choice for thinks that seems naturally "process". You must use it when you want share something between processes (using processes can be constrained by scalability or concurrency).
Having used both models, I must say that using the provided gen_server helps me stay structured more easily. I guess this is why it is included in the OTP stack of tools: gen_server is a good way to get the repetitive boiler-plate out of the way.
If you have shared state over multiple processes you should probably go with gen_server and if the state is just local to one process a stateless module will do fine.
I suppose your invoices (or whatever they stand for) should be persistent, so they would end up in an ETS/Mnesia table anyway. If this is so, you should create a stateless module where you put your API for accessing the invoice table.

Resources