Erlang/OTP framework's error_logger hangs under fairly high load - erlang

My application is basically a content based router which will route MMS events.
The logger I am using is the one that comes with the OTP framework in SASL mode "error_logger"
The issue is ::
I am using a client to generate MMS events with default values. This client (in Java) has the ability to send high load of events in multiple THREADS
I am sending 100 events in 10 threads (each thread sending 10 MMS events) to the my router written in Erlang/OTP.
The problem is, when such high load is received by my router , my Logger hangs i.e it stops updating my Log file. But the router is still able to route the events.
The conclusions that I have come up with is ::
Scheduling problem in Erlang when such high load of events is received (a separate process for each event).
A very unlikely dead-loack state.
Might be due to sending events in multiple threads rather than sending them sequentially. But I guess a router will be connected to multiple service provider boxes, so I thought of sending events in threads.
Can anybody help mw in demystifying the problem?

You already have a good answer, but I'll add to the discussion.
The error_logger is by default using cached write operations to disk. So one possibility is that you don't really notice this while under low load, but under high load your writes get stuck in the cache for a while.
On a side note: there should be no problem having multiple threads doing calls to Erlang.
Another way of testing this is to add your own logger to error_logger, and see what happens. Possibly printing to the shell or something else that is "fast".

Which version of Erlang are you using? Prior to R14A (R13B4 maybe?), there was a performance penalty when you invoked a selective receive when the message queue contained a lot of messages. This behaviour meant that in a process that receives lots of messages (error_logger being the canonical example), if it was barely keeping up with the load then a small spike in load could cause the cost of processing to spike up and stay there as the new processing cost was higher than the process could bear. This problem has been solved in R14A.
Secondly - why are you sending a high volume of events/calls/logs to a text logger? Formatting strings for output to a human readable log file is a lot more expensive than using a binary disk_log for instance. Reducing the cost of logging will help, but reducing the volume of logs will help even more. Maybe investigate exactly why you need to log these things and see if you can't record them another (less expensive) way.
Problems with error_logger are often symptoms of some other overload problem. Try looking at the message queue sizes for all your processes when this problem occurs and see if something else is backed up too. The following erlang shellcode might help:
[ { P, element(2, process_info(P, message_queue_len)) }
|| P <- erlang:processes(), is_process_alive(P) ]

Related

django channels and running event loop

For a game website, I want a player to contest either agains a human or an AI.
I am using Django + Channels (Django-4.0.2 asgiref-3.5.0 channels-3.0.4)
This is a long way of learning...
Human vs Human: the game take place is the web browser turn by turn. Each time a player connects, it opens a websocket connexion, a move is sent through the socket, processed by the consumer (validated and saved in the database) and sent to the other player.
It is managed only with sync programming.
Human vs AI: I try to use the same route as previously. A test branch check if the game is against the computer and process a move instead of receiving it from the other end of the websocket. This AI move can be a blocking operation as it can take from 2 to 5sec.
I don't want the receive method of the consumer to wait for the AI to return its move, since I have other operations to do quickly (like update some informations on the client side).
Then I thought I could easily take advantage of the allegedly already existing event loop of the channels framework. I could send the AI thinking process to this loop and return the result later to the client through the send method of the consumer.
However, when I write:
loop = asyncio.get_event_loop()
loop.create_task(my_AI_thinking())
Django raises a runtime effort error (the same as described here: https://github.com/django/asgiref/issues/278) telling me there is no running event loop.
The solution seemed to be to upgrade asgiref to 3.5.0 which I did but issue not solved.
I think I am a little bit short of background, and some enlightments should help me to understand a little bit more what is the root cause of this fail.
My first questions would be:
In the combo django + channels + asgi: which is in charge to run the eventloop?
How to check if indeed one event loop is running whatever the thread?
Maybe your answers wil raise other questions.
Did you try running your event_loop example on Django 3.2? (and/or with different Python version)? I experienced various problems with Django 4.0 & Python 3.10, so I keep with Django 3.2 and Python3.7/3.8/3.9 for now, maybe your errors are one of these problems?
If you won't be able to get event_loop running, I see two possible alternative solutions:
Open two WS connections: one only for the moves, and the other for all the other stuff, such as updating information on Player's UI, etc.
You can also use multiprocessing to "manually" send calculating AI move to other thread, and then join the two threads again, after receiving the result (the move). To be honest, multiprocessing in Python is quite simple -- it's pretty handy, if you are familiar with the idea of multithreaded applications.
Unfortunately, I have not yet used event loops in channels myself, maybe someone more experienced in that matter will be able to better address your issue.

Unordered socket read & close notification using IOCP

Most server framework/examples using sockets and I/O completion ports makes notifications in a way I couldn't completely figure out the purpose.
Upon read packets are processed, usually they are reordered to circumvent thread scheduling issues processing packets out of order no matter IOCP ensure a FIFO queue.
The problem is when a socket is closed gracefully or by an error. I saw in both situation, and again by the o.s. thread scheduler, the close notification may be sent to the application (i.e. http server using the framework) "before" the notification of data previously readed.
I think that the close notification should be queued in such way so the application receives it after previous reads.
Is there any intended use in most code I saw or my behavior may be correct depending on the situation?
What you suggest makes sense and I would imagine that any code that handles graceful close (a read returning 0 bytes) would do so by processing it after any proceeding successful read. Errors coming out of GetQueuedCompletionStatus(), such as connection reset errors, etc, are harder to integrate into the receive flow as they occur out of band as far as the receive data is concerned. Your question's a bit vague and depends very much on the code you're using and how you (or the people who wrote that code) want to handle these things. There is no single correct way, IMHO.

is membase a good persistence layer for a erlang gamer server?

I aim to create a browser game where players can set up buildings.
Each building will have several modules (engines, offices,production lines, ...). Each module will have enentually one or more actions running, like creation of 2OO 'item X' with ingredients Y, Z.
The game server will be set up with erlang : An OTP application as the server itself, and nitrogen as the web front.
I need persistence of data. I was thinking about the following :
When somebody or something interacts with a building, or a timer representing some production line ends up, a supervisor spawns a gen_server (if not already spawned) which loads the state of the building from a database, so the gen_server can answer messages like 'add this module', 'starts this action', 'store this production to warehouse', 'die', etc. (
But when a building don't receive any messages during X seconds or minutes, he will terminate (thanks to the gen_server timeout feature) and drop its current state back to the database.
So, as it will be a (soft) real time game, the gen_server must be set up very fastly. I was thinking of membase as the database, because it's known to have very good response time.
My question is : when a gen server is up an running, his states fills some memory, and this state is present in the memory handled by membase too, so the state use two times his size in memory. Is that a bad design ?
Is membase a good solution to handle persistence in my case ? would be use mnesia a better choice , or something else ?
I fear mnesia 2 Go (or 4 ?) table size limit because i don't know at the moment the average state size of my gen_servers (buildings in this example, butalso players, production lines, whatever) and i may have someday more than 1 player :)
Thank you
I agree with Hynek -Pichi- Vychodil. Riak is a great thing for key-valye storage.
We use Riak almost 95% for the same thing you described. Everything works so far without any issues. In case you will hit performance limitation of Riak - add more nodes and it good to go!
Another cool thing about Riak is its very low performance degradation over the time. You can find more information about benchmarking Riak here: http://joyeur.com/2010/10/31/riak-smartmachine-benchmark-the-technical-details/
In case you go with it:
a driver: https://github.com/basho/riak-erlang-client
a connection pool you may need to work with it: https://github.com/dweldon/riakpool
About membase and memory usage: I also tried membase, but I found that it is not suitable for my tasks - (membase declares fault tolerance, but I could not setup it in the way it should work with faults, even with help from membase guys I didn't succeed). So at the moment I use the following architecture: All players that are online and play the game are presented as player-processes (gen_server). All data data and business logic for each player is in its player-process. From time to time each player-process desides to save its state in riak.
So far seems to be very fast and efficient approach.
Update: Now we are with PostgreSQL. It is awesome!
You can look to bitcask or other Riak backends to store your data. Avoid IPC is definitely good idea, so keep it inside Erlang.

Erlang in-memory linked list -- Exhausting memory?

I tried to make an Erlang in-memory datastore that would receive messages and add them to a list. Here's the current incarnation. The trouble is, I'm receiving about 200 messages per second and this easily exhausts the memory available.
Once a minute, I send a {write, Pid} message that should clear out and clean up this list, but it doesn't look like it's being garbage collected.
What am I doing wrong? I think I'm approaching this from the completely wrong direction...
datastore(Db) ->
receive
{put, Data} ->
datastore(lists:concat([Data,Db]));
{write, Responder} ->
ScratchName = "ScratchFile.dat",
{ok, ScratchDevice} = file:open(ScratchName,[write]),
file:write(ScratchDevice,Db),
ok = file:close(ScratchDevice),
Responder ! {load, ScratchName},
datastore([])
end.
First spontaneous comment is that file:open will open the file, truncate it, and then write to it. So every time in the loop will overwrite any previous data. So if the Responder is slow with its loading of the file, there could be data you did not expect in the file.
Second reaction is that you don't have to do this buffering yourself. If you open the file with the option {delayed_write, Size, Delay}, and set Size and Delay to values that fit your purpose, you get precisely what you are trying to implement here by just writing all the time.
Third reaction is that you are probably doing the wrong thing if you use a file to communicate between different parts of your system. What are you attempting to do?
ps.
If you need a new random filename, you can easily generate one with erlang:now/0 and io_lib:format/2. As an added bonus they will sort in creation order.
This is a very wrong way of buffering in Erlang. Data Structures such as ETS (http://www.erlang.org/doc/man/ets.html) have been designed to handle thousands and millions of IN-MEMORY Erlang Data Structures with ease. Please, do not use Lists or Queues for handling too much data. If a part of your code will be handling data which other parts of the application are supposed to consume and yet you know that the consumers will be doing it a slower rate as compared to the part that is generating or getting the data, then you need a more robust way of buffering (ETS Tables).
Another thing is that usually, processes are a point of failure in a system. If a process is used to buffer or hold on to very essential data, even if that data is instantaneous but critical to the system, what would happen at that time when the process exits or dies ? ETS tables have been designed in a way that they can provide data access to all processes even applications within the same VM (of type public). In this way, all processes can use the data, reading as much as they want (concurrently) but what you would do is to ensure consistency by having one writer / updater.
ETS Tables rarely fail in an application as compared to the frequency at which processes fail. Most recently, a method that helps us to redeem data in a failing ETS table has been introduced ( ets:give_away/3 ).
Another thing, in a comment above, you have mentioned that you are working for a large Company. Usually, with large teams, its better you evaluate a number of options and make intensive tests against several depending the nature of the application you are developing. To avoid side effects, its best that you identify which data structures are best to use for what. For example, for in-memory storage, capable of handling 200 messages per second, if tested properly, Lists and Files would fail against ETS Tables.

Use it for JSON data transfer

I am trying to use RabbitMQ for a distributed system that would work something like:
a producer puts in a queue a JSON-formatted list of order ids
several consumers pull out of that queue, do the business logic with that order ids and the result (JSON formatted) as well is put back into another queue
from the second queue, another consumer will take the data and pass it back to the caller
I am still very new to RabbitMQ and I am wondering if this model is the right approach, given the fact that the data should be back as fast as possible (sometimes in the matter of seconds, max 5) so there are real time requirements.
Also, how large can the message passed to a queue can be? The JSON that the producer will get back will be fairly large, based on what the consumer does.
Thanks for any ideas!
See page 47 in this presentation (InfoQ) for a great comparision between different messaging formats.
There's nothing wrong with the design you suggested.
The slight wrinkle is that enforcing "real time requirements" isn't straightforward. For instance, it's not currently possible to expire messages within a queue, so this would need to be handled by the clients when consuming messages.
The total size of messages in RabbitMQ <=1.8.1 was bounded by the amount of available RAM. As of 2.0.0, it's bounded by the amount of available disk space (i.e. rabbit will page messages to disk if it's running low on memory). Individual message sizes are recorded as 32-bit integers (IIRC), so individual messages cannot be larger than ~4GB; if this is a problem, consider saving the JSONs to network storage and passing some ID to them in the messages. Other than this, there aren't any constraints.

Resources