Erlang in-memory linked list -- Exhausting memory?

Erlang in-memory linked list -- Exhausting memory? - erlang

I tried to make an Erlang in-memory datastore that would receive messages and add them to a list. Here's the current incarnation. The trouble is, I'm receiving about 200 messages per second and this easily exhausts the memory available.
Once a minute, I send a {write, Pid} message that should clear out and clean up this list, but it doesn't look like it's being garbage collected.
What am I doing wrong? I think I'm approaching this from the completely wrong direction...
datastore(Db) ->
receive
{put, Data} ->
datastore(lists:concat([Data,Db]));
{write, Responder} ->
ScratchName = "ScratchFile.dat",
{ok, ScratchDevice} = file:open(ScratchName,[write]),
file:write(ScratchDevice,Db),
ok = file:close(ScratchDevice),
Responder ! {load, ScratchName},
datastore([])
end.

First spontaneous comment is that file:open will open the file, truncate it, and then write to it. So every time in the loop will overwrite any previous data. So if the Responder is slow with its loading of the file, there could be data you did not expect in the file.
Second reaction is that you don't have to do this buffering yourself. If you open the file with the option {delayed_write, Size, Delay}, and set Size and Delay to values that fit your purpose, you get precisely what you are trying to implement here by just writing all the time.
Third reaction is that you are probably doing the wrong thing if you use a file to communicate between different parts of your system. What are you attempting to do?
ps.
If you need a new random filename, you can easily generate one with erlang:now/0 and io_lib:format/2. As an added bonus they will sort in creation order.

This is a very wrong way of buffering in Erlang. Data Structures such as ETS (http://www.erlang.org/doc/man/ets.html) have been designed to handle thousands and millions of IN-MEMORY Erlang Data Structures with ease. Please, do not use Lists or Queues for handling too much data. If a part of your code will be handling data which other parts of the application are supposed to consume and yet you know that the consumers will be doing it a slower rate as compared to the part that is generating or getting the data, then you need a more robust way of buffering (ETS Tables).
Another thing is that usually, processes are a point of failure in a system. If a process is used to buffer or hold on to very essential data, even if that data is instantaneous but critical to the system, what would happen at that time when the process exits or dies ? ETS tables have been designed in a way that they can provide data access to all processes even applications within the same VM (of type public). In this way, all processes can use the data, reading as much as they want (concurrently) but what you would do is to ensure consistency by having one writer / updater.
ETS Tables rarely fail in an application as compared to the frequency at which processes fail. Most recently, a method that helps us to redeem data in a failing ETS table has been introduced ( ets:give_away/3 ).
Another thing, in a comment above, you have mentioned that you are working for a large Company. Usually, with large teams, its better you evaluate a number of options and make intensive tests against several depending the nature of the application you are developing. To avoid side effects, its best that you identify which data structures are best to use for what. For example, for in-memory storage, capable of handling 200 messages per second, if tested properly, Lists and Files would fail against ETS Tables.

Related

Sending big lists between processes in Erlang / Elixir

Suppose you want to send a big list (1_000_000 entries for example) to a different process in Erlang / Elixir.
Does send block until the data of the whole list is sent, or is that somehow done asynchronously by the Erlang VM?
Is simply sending the list a bad practice? What are the alternatives?

Yes and no. Copying of huge structure doesn't block receiver but sender. See my answer to a similar question.
Yes, it is bad practice if you do it regularly. Alternatively:
Refactor your code so you could partition data to different processes.
Store data in ets. (It's not a magic bullet, you still copy data in and out of ets but it depends on your access pattern.)
Store data as binary.
It depends on what do you do. You should not have a big heap in a process so probably you should refactor your code.

How to implement status in Erlang?

I am thinking an Erlang program that has many workers (loop receive), these workers almost always manipulate their status at the same time, ie. massive concurrent, the amount of workers is so big that keep their status in mnesia will cause performance problem, so I am thinking pass the status as args in each loop, then write to mnesia some time later. Is this a good practice? Is there a better way to do this? (roughly speaking, I'm looking for something like an instance with attributes in the object oriented language)
Thanks.

With Erlang, it is a good habit to see the processes as actor with a dedicated and limited role. With this in mind you will see that you will split your problem in different categories like:
Maintain the state of a connection with a user over the Internet,
Keep information such as login, user profile, friends, shop-cart...
log events
...
for each role you will have to decide if the state information must survive to the process.
In a lot of cases it is not necessary (case 1) and the solution is simply to keep the state in the argument of loop funtion of the process. I encourage you to look at the OTP behaviors, the gen_server and gen_fsm are made for this.
The case 2 obviously manipulates permanent data which must survive to a process crash or even a hardware crash. These data will be stored using dets, mnesia or any database adapted to your problem (Redis, CouchDB ...).
It is important to limit the information stored into external database, otherwise you will not benefit of this very powerful feature which is the lack of side effect. In other words, it is a very bad idea to have process behavior which depends on external information.

In erlang is it possible to have concurrent readers of data owned by another process?

My understanding of the message passing system is that it is serialized and therefore all the reads from different processes are serialized even if the data isn't changing. I would like to have the data read concurrently if possible to take advantage of distributed computing. Is this possible?

You are correct in that messages will be handled sequentially in a process receiving them.
If the data really is static (well, even if it changes sometimes) consider using an ETS table for this kind of scenario. ETS tables are highly optimized for concurrent access whenever applicable. Unless someone is writing to an ETS table (or row) all clients can read the data concurrently from the table.

If you have different processes on the same computer (IMO, this is not a distributed computing), binary type is not serialized, it is passed by reference. So you can read large block of data by many processes without actually copying it. The very idea of "data read concurrently" in a really distributed world doesn't seem right to me (ETS is not an exception).
P.S. Well, what I meant in the last statement was "it doesn't save you from serializing".

Is the process dictionary appropriate in this case?

I've read several comments here and elsewhere suggesting that Erlang's process dictionary was a bad idea and should die. Normally, as a total Erlang newbie, I'd just avoid it. However, in this situation my other options aren't great.
I have a main dispatcher function that looks something like this:
dispatch(State) ->
receive
{cmd1, Params} ->
NewState = do_cmd1_stuff(Params, State),
dispatch(NewState);
{cmd2, Params} ->
NewState = do_cmd2_stuff(Params, State),
dispatch(NewState);
BadMsg ->
log_error(BadMsg),
dispatch(State)
end.
Obviously, my names are more meaningful to me, but that's the gist of it. Deep down in a function called by a function called by a function called by do_cmd2_stuff(), I want to send out messages to all my users telling them about something I've done. In order to do that, I need to get the list of users from the point where I send the messages. The user list doesn't lend itself easily to sticking in the global state, since that's just one data structure representing the only block of data on which I operate.
The way I see it, I have a couple unpleasant options other than using the process dictionary. I can send the user list through all the various levels of functions down to the very bottom one that does the broadcasting. That's unpleasant because it causes all my functions to gain a parameter, whether they really care about it or not.
Alternatively, I could have all the do_cmdN_stuff() functions return a message to send. That's not great either though, since sending the message may not be the last thing I want to do and it clutters up my dispatcher with a bunch of {Msg, NewState} tuples. Furthermore, some of the functions might not have any messages to send some of the time.
Like I said earlier, I'm very new to Erlang. Maybe someone with more experience can point me at a better way. Is there one? Is the process dictionary appropriate in this case?

The general rule is that if you have doubts, you shouldn't use the process dictionary.
If the two options you mentioned aren't good enough (I personally like the one where you return the messages to send) and what you want is some particular piece of code to track users and forward messages to them, maybe what you want to do is have a process holding that info.
Pid ! {forward, Msg}
where Pid will take care of sending everything to a bunch of other processes. Now, you would still need to pass the Pid around, unless you give it a name in some registry to find it. Either with register/2, global or gproc.

A simple answer would be to nest your global within a state record, which is then threaded through the system, at least at the stop level. This makes it easy to add new fields to the state in the future, not an uncommon occurrence, and allow you to keep your global state data structure untouched. So initially
-record(state, {users=[],state_data}).
Defining it as a record makes it easy to access and extend when necessary.

As you mentioned you can always pass the user list as extra param, thats not so bad.
If you don't want to do this just put it in State. You can have a special State just for this part of the calculation that also contains the user list.
Then there always is the possibility of putting it in ETS or in another server process.
What exactly to do is hard to recommend since it depends a lot on your exact application and preferences.
Just choose from the mentioned possibilities as if the process dictionary doesn't exist. Maybe your code needs restructuring if none of the variants look elegant, there always is some better way without the process dictionary.
Its really bad it is still there, because its alluring to many beginning Erlang users.

You really should not use process dictionary. I accept using dictionary only if
It is short living process.
I have full control about the process from spawn to termination i.e. I use minimum and well known set of external modules.
I need performance gain badly. It means avoid copy of data when using ets and dict/gb_tree is too slow (for GC reason).
ad 1. is not your case, you are using in server. ad 2. I don't know if it is your case. ad 3. is not your case because you need list of recipient so you don't gain nothing from that process dictionary is very fast key/value storage. In your case I don't see any reason why you should not include what you need to your State. IMHO State is exactly the right place for it.

Its an interesting question because it involves the fundamentals of functional design.
My opinion:
Try as much as possible to make the function return the messages, then send them. This separates the two different tasks nicely, and separates the purely functional task from the one that causes side effects.
If this isn't possible, pass receivers as argument even if its a bit messy. If the broadcasting function uses that data, it should be given to it explicitly, for clarity and predictability.
Using ETS as Peer Stritzinger suggests is really not any better than the PD, both hides the fact that the broadcasting function uses the receiver list and makes it dependent on global data.
I'm not sure about the Erlang way of encapsulating some state in a process, as I GIVE TERRIBLE ADVICE suggests. Is it really any better that ETS or PD?
clutters up my dispatcher with a bunch
of {Msg, NewState}
This is my experience also, that you often end up like this. It's not particularly pretty, but functional design seems to encourage this. Could some language feature be introduced to make it more beautiful and natural?
EDIT:
6 years ago I wrote:
Could some language feature be introduced to make it more beautiful and natural?
After learning much more about functional programming I have realised that examples of this are state-monads and do-notation that are found in Haskell.

I would consider sending a special message to self() from deep inside the call stack, and handling it at the top level dispatch method that you've sketched, where list of users is available.

Use it for JSON data transfer

I am trying to use RabbitMQ for a distributed system that would work something like:
a producer puts in a queue a JSON-formatted list of order ids
several consumers pull out of that queue, do the business logic with that order ids and the result (JSON formatted) as well is put back into another queue
from the second queue, another consumer will take the data and pass it back to the caller
I am still very new to RabbitMQ and I am wondering if this model is the right approach, given the fact that the data should be back as fast as possible (sometimes in the matter of seconds, max 5) so there are real time requirements.
Also, how large can the message passed to a queue can be? The JSON that the producer will get back will be fairly large, based on what the consumer does.
Thanks for any ideas!

See page 47 in this presentation (InfoQ) for a great comparision between different messaging formats.

There's nothing wrong with the design you suggested.
The slight wrinkle is that enforcing "real time requirements" isn't straightforward. For instance, it's not currently possible to expire messages within a queue, so this would need to be handled by the clients when consuming messages.
The total size of messages in RabbitMQ <=1.8.1 was bounded by the amount of available RAM. As of 2.0.0, it's bounded by the amount of available disk space (i.e. rabbit will page messages to disk if it's running low on memory). Individual message sizes are recorded as 32-bit integers (IIRC), so individual messages cannot be larger than ~4GB; if this is a problem, consider saving the JSONs to network storage and passing some ID to them in the messages. Other than this, there aren't any constraints.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart