I have a server application made in Erlang. In it I have an mnesia table
that store some information on photos. In the spirit of "everything is a
process" I decided to wrap that table in a gen_server module, so that the
gen_server module is the only one that directly accesses the table. Querying
and adding information to that table is done by sending messages to that process
(which has a registered name). The idea is that there will be several client
processes querying information from that table.
This works just fine, but that gen_server module has no state. Everything it
requires is stored in the mnesia table. So, I wonder if a gen_server is perhaps
not the best model for encapsulating that table?
Should I simply not make it a process, and instead only encapsulate the table
through the functions in that module? In case of a bug in that module, that
would cause the calling process to crash, which I think might be better, because
it would only affect a single client, as opposed to now, when it would cause the
gen_server process to crash, leaving everyone without access to the table (until
the supervisor restarts it).
Any input is greatly appreciated.
I guess according to Occam's razor there is no need for this gen_server to exist, especially since there is absolutely no state stored in it. Such process could be needed in situations when you need access to the table (or any other resource) to be strictly sequential (for example you might want to avoid any aborted transactions at cost of a bottleneck).
Encapsulating access to the table in a module is a good solution. It creates no additional complexity, while providing proper level of abstraction and encapsulation.
I'm not sure I understand why you've decided to encapsulate a table with a process. Mnesia is designed to mediate multiple concurrent accesses to tables, both locally and distributed across a cluster.
Creating an API module that performs all the particular table access operations and updates is a good idea as the API functions will convey your intent better in the code that calls them. It will be more readable than putting the mnesia operations directly into the calling code.
An API module also gives you the option to switch from mnesia to some other storage system later if you need to. Using mnesia transactions inside your API module protects you from some programming errors as mnesia will roll-back operations that crash. The API module will always be available to callers and allows any number of callers to perform operations concurrently, whereas a gen_server based API has a point of failure, the process, that can render the API unavailable.
The only thing a gen_server based API gives you over a pure-functional API is to serialize access to the table - which is an unusual requirement and unless you specifically need it, it will be a performance killer.
It may be a good idea to handle a mnesia table using single gen_server process when you want to use dirty access and avoid transactions. This approach might be faster than txs, but as usually you need to benchmark it.
Related
Erlang is great about cleaning things up by not having shared state. But what happens when you do want shared state? For example: configuration options, statistics gathering, event/callback servers. Spawning up a new processes with some record as state or using the process dictionary is a way to accomplish shared state. You would loop that process over and over and reply to any messages. Multiple processes will just query that process using what are essentially impure getter and setter functions that wrap around message passing, but here we just turned Erlang into an impure object that's slower than a java object because of the reduction system taking turns is slower than just having a memory mutex around each global state. It even has the possibility of having a mailbox overflow if we're not careful.
So what do you do if you want fast shared state? Reddis, a database, mnesia, spawns looping state? How do you make centralized state more purely functional in erlang?
Use a public (anyone can read or write) or protected (one writer, multiple readers) ets table created with the named_table option. Each process needing access to the shared state in the table can get to the table by its name.
I am thinking an Erlang program that has many workers (loop receive), these workers almost always manipulate their status at the same time, ie. massive concurrent, the amount of workers is so big that keep their status in mnesia will cause performance problem, so I am thinking pass the status as args in each loop, then write to mnesia some time later. Is this a good practice? Is there a better way to do this? (roughly speaking, I'm looking for something like an instance with attributes in the object oriented language)
Thanks.
With Erlang, it is a good habit to see the processes as actor with a dedicated and limited role. With this in mind you will see that you will split your problem in different categories like:
Maintain the state of a connection with a user over the Internet,
Keep information such as login, user profile, friends, shop-cart...
log events
...
for each role you will have to decide if the state information must survive to the process.
In a lot of cases it is not necessary (case 1) and the solution is simply to keep the state in the argument of loop funtion of the process. I encourage you to look at the OTP behaviors, the gen_server and gen_fsm are made for this.
The case 2 obviously manipulates permanent data which must survive to a process crash or even a hardware crash. These data will be stored using dets, mnesia or any database adapted to your problem (Redis, CouchDB ...).
It is important to limit the information stored into external database, otherwise you will not benefit of this very powerful feature which is the lack of side effect. In other words, it is a very bad idea to have process behavior which depends on external information.
My understanding of the message passing system is that it is serialized and therefore all the reads from different processes are serialized even if the data isn't changing. I would like to have the data read concurrently if possible to take advantage of distributed computing. Is this possible?
You are correct in that messages will be handled sequentially in a process receiving them.
If the data really is static (well, even if it changes sometimes) consider using an ETS table for this kind of scenario. ETS tables are highly optimized for concurrent access whenever applicable. Unless someone is writing to an ETS table (or row) all clients can read the data concurrently from the table.
If you have different processes on the same computer (IMO, this is not a distributed computing), binary type is not serialized, it is passed by reference. So you can read large block of data by many processes without actually copying it. The very idea of "data read concurrently" in a really distributed world doesn't seem right to me (ETS is not an exception).
P.S. Well, what I meant in the last statement was "it doesn't save you from serializing".
I've read several comments here and elsewhere suggesting that Erlang's process dictionary was a bad idea and should die. Normally, as a total Erlang newbie, I'd just avoid it. However, in this situation my other options aren't great.
I have a main dispatcher function that looks something like this:
dispatch(State) ->
receive
{cmd1, Params} ->
NewState = do_cmd1_stuff(Params, State),
dispatch(NewState);
{cmd2, Params} ->
NewState = do_cmd2_stuff(Params, State),
dispatch(NewState);
BadMsg ->
log_error(BadMsg),
dispatch(State)
end.
Obviously, my names are more meaningful to me, but that's the gist of it. Deep down in a function called by a function called by a function called by do_cmd2_stuff(), I want to send out messages to all my users telling them about something I've done. In order to do that, I need to get the list of users from the point where I send the messages. The user list doesn't lend itself easily to sticking in the global state, since that's just one data structure representing the only block of data on which I operate.
The way I see it, I have a couple unpleasant options other than using the process dictionary. I can send the user list through all the various levels of functions down to the very bottom one that does the broadcasting. That's unpleasant because it causes all my functions to gain a parameter, whether they really care about it or not.
Alternatively, I could have all the do_cmdN_stuff() functions return a message to send. That's not great either though, since sending the message may not be the last thing I want to do and it clutters up my dispatcher with a bunch of {Msg, NewState} tuples. Furthermore, some of the functions might not have any messages to send some of the time.
Like I said earlier, I'm very new to Erlang. Maybe someone with more experience can point me at a better way. Is there one? Is the process dictionary appropriate in this case?
The general rule is that if you have doubts, you shouldn't use the process dictionary.
If the two options you mentioned aren't good enough (I personally like the one where you return the messages to send) and what you want is some particular piece of code to track users and forward messages to them, maybe what you want to do is have a process holding that info.
Pid ! {forward, Msg}
where Pid will take care of sending everything to a bunch of other processes. Now, you would still need to pass the Pid around, unless you give it a name in some registry to find it. Either with register/2, global or gproc.
A simple answer would be to nest your global within a state record, which is then threaded through the system, at least at the stop level. This makes it easy to add new fields to the state in the future, not an uncommon occurrence, and allow you to keep your global state data structure untouched. So initially
-record(state, {users=[],state_data}).
Defining it as a record makes it easy to access and extend when necessary.
As you mentioned you can always pass the user list as extra param, thats not so bad.
If you don't want to do this just put it in State. You can have a special State just for this part of the calculation that also contains the user list.
Then there always is the possibility of putting it in ETS or in another server process.
What exactly to do is hard to recommend since it depends a lot on your exact application and preferences.
Just choose from the mentioned possibilities as if the process dictionary doesn't exist. Maybe your code needs restructuring if none of the variants look elegant, there always is some better way without the process dictionary.
Its really bad it is still there, because its alluring to many beginning Erlang users.
You really should not use process dictionary. I accept using dictionary only if
It is short living process.
I have full control about the process from spawn to termination i.e. I use minimum and well known set of external modules.
I need performance gain badly. It means avoid copy of data when using ets and dict/gb_tree is too slow (for GC reason).
ad 1. is not your case, you are using in server. ad 2. I don't know if it is your case. ad 3. is not your case because you need list of recipient so you don't gain nothing from that process dictionary is very fast key/value storage. In your case I don't see any reason why you should not include what you need to your State. IMHO State is exactly the right place for it.
Its an interesting question because it involves the fundamentals of functional design.
My opinion:
Try as much as possible to make the function return the messages, then send them. This separates the two different tasks nicely, and separates the purely functional task from the one that causes side effects.
If this isn't possible, pass receivers as argument even if its a bit messy. If the broadcasting function uses that data, it should be given to it explicitly, for clarity and predictability.
Using ETS as Peer Stritzinger suggests is really not any better than the PD, both hides the fact that the broadcasting function uses the receiver list and makes it dependent on global data.
I'm not sure about the Erlang way of encapsulating some state in a process, as I GIVE TERRIBLE ADVICE suggests. Is it really any better that ETS or PD?
clutters up my dispatcher with a bunch
of {Msg, NewState}
This is my experience also, that you often end up like this. It's not particularly pretty, but functional design seems to encourage this. Could some language feature be introduced to make it more beautiful and natural?
EDIT:
6 years ago I wrote:
Could some language feature be introduced to make it more beautiful and natural?
After learning much more about functional programming I have realised that examples of this are state-monads and do-notation that are found in Haskell.
I would consider sending a special message to self() from deep inside the call stack, and handling it at the top level dispatch method that you've sketched, where list of users is available.
I've recently finished Joe's book and quite enjoyed it.
I'm since then started coding a soft realtime application with erlang and I have to say I am a bit confused at the use of gen_server.
When should I use gen_server instead of a simple stateless module?
I define a stateless module as follow:
- A module that takes it's state as a parameter (much like ETS/DETS) as opposed to keeping it internally (like gen_server)
Say for an invoice manager type module, should it initialize and return state which I'd then pass subsequently to it?
SomeState = InvoiceManager:Init(),
SomeState = InvoiceManager:AddInvoice(SomeState, AnInvoiceFoo).
Suppose I'd need multiple instances of the invoice manager state (say my application manages multiple companies each with their own invoices), should they each have a gen_server with internal state to manage their invoices or would it better fit to simply have the stateless module above?
Where is the line between the two?
(Note the invoice manage example above is just that, an example to illustrate my question)
I don't really think you can make that distinction between what you call a stateless module and gen_server. In both cases there is a recursive receive loop which carries state in at least one argument. This main loop handles requests, does work depending on the requests and, when necessary, sends results back the requesters. The main loop will most likely handle a number of administrative requests as well which may not be part of the main API/protocol.
The difference is that gen_server abstracts away the main receive loop and allows the user to only the write the actual user code. It will also handle many administrative OTP functions for you. The main difference is that the user code is in another module which means that you see the passed through state more easily. Unless you actually manage to write your code in one big receive loop and not call other functions to do the work there is no real difference.
Which method is better depends very much on what you need. Using gen_server will simplify your code and give you added functionality "for free" but it can be more restrictive. Rolling your own will give you more power but also you give more possibilities to screww things up. It is probably a little faster as well. What do you need?
It strongly depend of your needs and application design. When you need shared state between processes you have to use process to keep this state. Then gen_server, gen_fsm or other gen_* is your friend. You can avoid this design when your application is not concurrent or this design doesn't bring you some other benefits. For example break your application to processes will lead to simpler design. In other case sometimes you can choose single process design and using "stateless" modules for performance or such. "stateless" module is best choice for very simply stateless (pure functional) tasks. gen_server is often best choice for thinks that seems naturally "process". You must use it when you want share something between processes (using processes can be constrained by scalability or concurrency).
Having used both models, I must say that using the provided gen_server helps me stay structured more easily. I guess this is why it is included in the OTP stack of tools: gen_server is a good way to get the repetitive boiler-plate out of the way.
If you have shared state over multiple processes you should probably go with gen_server and if the state is just local to one process a stateless module will do fine.
I suppose your invoices (or whatever they stand for) should be persistent, so they would end up in an ETS/Mnesia table anyway. If this is so, you should create a stateless module where you put your API for accessing the invoice table.