Erlang ETS memory fragmentation

Erlang ETS memory fragmentation - erlang

I have an erlang cluster where erlang:memory() 'total' is between 2-2.5GB from idle to busy time, day in day out. ets memory usage is around 440M and stays around there no matter what. The data within ets is heavily transient, completely changes throughout the day. Tomorrows data is guaranteed to have no commonality to today's.
Linux top says beam is using like 10 gigabytes. free -m 'used' agrees with that (the machine really only runs beam). The overall memory usage of the system grows regularly, like 1% per day on 16GB systems. There is some variance across nodes, but not by alot, and OS 'used' memory is always several times more than erlang:memory() total.
erlang:system_info({allocator, ets_alloc}) shows 20 allocators. Most have data that looks something like this (full output of command is here):
{mbcs_pool,[{blocks,2054},
{blocks_size,742672},
{carriers,10},
{carriers_size,17825792}]},
1) Does this mean that 742K bytes (words?) of memory are actually taking 17M of OS memory?
2) As this post suggests, should we add '+MEas bf' to the VM args, in order to reduce overhead?
3) What else can I do to avoid actually running out of memory?
This is R17.5 but we will be migrating to R19.3 in next deployment (this week). We don't have recon in the current deployment but will be adding it in the next deployment. Also, can't imagine this matters, but beam is running inside an alpine container.

In case someone else runs into this later: this was not actually leaked memory.
The default memory allocator strategy of erlang may not be optimal for your use, depending what you do, and depending on how erlang is configured to allocate blocks. Turns out, in some cases, "free" memory from erlang point of view won't necessarily be immediately released to the OS due to allocator fragmentation.
It's somewhat explained here: http://erlang.org/doc/man/erts_alloc.html
The default allocator strategy for the version of erlang we used at the time is aoffcbf (address order first fit carrier best fit). In our case, this resulted in very high memory fragmentation (10+GB overhead worth). When troubleshooting these things, erlang:system_info(allocator) and erlang:system_info({allocator, Alloc}) are your friend. Changing to aobff (address order best fit) resulted in much more efficient memory usage. In truth, as long as the machine didn't run out of physical memory, it wouldn't matter, but for us, we were getting dangerously close to the physical limit. And you do not want to start paging. With aobff, we never passed 4GB, even after the node being up 18 months. With the aoffcbf we would pass 10GB in a few weeks.
As always, YMMV, as it all depends what type, size, etc.. of blocks are allocated, and how long they live.

Related

memory management and segmentation faults in modern day systems (Linux)

In modern-day operating systems, memory is available as an abstracted resource. A process is exposed to a virtual address space (which is independent from address space of all other processes) and a whole mechanism exists for mapping any virtual address to some actual physical address.
My doubt is:
If each process has its own address space, then it should be free to access any address in the same. So apart from permission restricted sections like that of .data, .bss, .text etc, one should be free to change value at any address. But this usually gives segmentation fault, why?
For acquiring the dynamic memory, we need to do a malloc. If the whole virtual space is made available to a process, then why can't it directly access it?
Different runs of a program results in different addresses for variables (both on stack and heap). Why is it so, when the environments for each run is same? Does it not affect the amount of addressable memory available for usage? (Does it have something to do with address space randomization?)
Some links on memory allocation (e.g. in heap).
The data available at different places is very confusing, as they talk about old and modern times, often not distinguishing between them. It would be helpful if someone could clarify the doubts while keeping modern systems in mind, say Linux.
Thanks.

Technically, the operating system is able to allocate any memory page on access, but there are important reasons why it shouldn't or can't:
different memory regions serve different purposes.
code. It can be read and executed, but shouldn't be written to.
literals (strings, const arrays). This memory is read-only and should be.
the heap. It can be read and written, but not executed.
the thread stack. There is no reason for two threads to access each other's stack, so the OS might as well forbid that. Moreover, the tread stack can be de-allocated when the tread ends.
memory-mapped files. Any changes to this region should affect a specific file. If the file is open for reading, the same memory page may be shared between processes because it's read-only.
the kernel space. Normally the application should not (or can not) access that region - only kernel code can. It's basically a scratch space for the kernel and it's shared between processes. The network buffer may reside there, so that it's always available for writes, no matter when the packet arrives.
...
The OS might assume that all unrecognised memory access is an attempt to allocate more heap space, but:
if an application touches the kernel memory from user code, it must be killed. On 32-bit Windows, all memory above 1<<31 (top bit set) or above 3<<30 (top two bits set) is kernel memory. You should not assume any unallocated memory region is in the user space.
if an application thinks about using a memory region but doesn't tell the OS, the OS may allocate something else to that memory (OS: sure, your file is at 0x12341234; App: but I wanted to store my data there). You could tell the OS by touching the end of your array (which is unreliable anyways), but it's easier to just call an OS function. It's just a good idea that the function call is "give me 10MB of heap", not "give me 10MB of heap starting at 0x12345678"
If the application allocates memory by using it then it typically does not de-allocate at all. This can be problematic as the OS still has to hold the unused pages (but the Java Virtual Machine does not de-allocate either, so hey).
Different runs of a program results in different addresses for variables
This is called memory layout randomisation and is used, alongside of proper permissions (stack space is not executable), to make buffer overflow attacks much more difficult. You can still kill the app, but not execute arbitrary code.
Some links on memory allocation (e.g. in heap).
Do you mean, what algorithm the allocator uses? The easiest algorithm is to always allocate at the soonest available position and link from each memory block to the next and store the flag if it's a free block or used block. More advanced algorithms always allocate blocks at the size of a power of two or a multiple of some fixed size to prevent memory fragmentation (lots of small free blocks) or link the blocks in a different structures to find a free block of sufficient size faster.
An even simpler approach is to never de-allocate and just point to the first (and only) free block and holds its size. If the remaining space is too small, throw it away and ask the OS for a new one.
There's nothing magical about memory allocators. All they do is to:
ask the OS for a large region and
partition it to smaller chunks
without
wasting too much space or
taking too long.
Anyways, the Wikipedia article about memory allocation is http://en.wikipedia.org/wiki/Memory_management .
One interesting algorithm is called "(binary) buddy blocks". It holds several pools of a power-of-two size and splits them recursively into smaller regions. Each region is then either fully allocated, fully free or split in two regions (buddies) that are not both fully free. If it's split, then one byte suffices to hold the size of the largest free block within this block.

How to reserve memory for my application and leave a specified amount remaining?

I'm planning an application which will involve loading many pictures at one time and thus requires a large chunk of memory. For example, I might have 50 image objects created at once, taking a total of 1GB of RAM. But when the user goes to load 20 more pictures, I'd like to make sure that amount of memory is already reserved and ready.
Now this part might seem a little backwards from normal. Rather than specifying how much memory my application shall reserve, instead I need to specify how much memory to leave free for other applications, and adjust my application's memory periodically according to this specification. I must say I've never worked with reserving memory at all, and especially won't know how to leave this remaining available memory.
So for example, if the computer has 2048 MB of RAM, and the option is set to leave 50 MB free for other applications, and there is already 10MB of RAM being used by other apps, then it should reserve 2048-50-10 = 1988 MB for my app.
The trouble I foresee is suppose the user opens another application which requires 1GB. My app has to catch this and shrink its self.
Does this even sound like a feasible approach? Basically, I need to make sure there is as much memory reserved as possible at any given time, while leaving a decent amount available for other apps. Would it make a significant impact on performance if I do this, or not much at all? I might be loading and unloading images at rapid paces, and I don't want it to reserve/free this memory on demand, I want it to stay reserved.

+1 for Sertac's mentioning of how SQL Server rides the line of allocating memory it needs, but releasing memory when Windows complains.
Applications can receive Window's complaints by using the CreateMemoryResourceNotification:
hLowMemory := CreateMemoryResourceNotification(LowMemoryResourceNotification);
Applications can use memory resource notification events to scale the
memory usage as appropriate. If available memory is low, the
application can reduce its working set. If available memory is high,
the application can allocate more memory.
Any thread of the calling
process can specify the memory resource notification handle in a call
to the QueryMemoryResourceNotification function or one of the wait functions.
The state of the object is signaled when the specified
memory condition exists. This is a system-wide event, so all
applications receive notification when the object is signaled. Note
that there is a range of memory availability where neither the
LowMemoryResourceNotification or HighMemoryResourceNotification object
is signaled. In this case, applications should attempt to keep the
memory use constant.
But it's also worth mentioning that you might as well allocate memory that you need. Your operating system has a very sophisiticated set of algorithms to swap out the least used memory when memory pressure is high. You can take advantage of this by simply allocating all the memory that you need. When Windows starts to run low, it will find those pages of memory that you are using the least and swap them out to disk. (This is how a well-known reverse proxy works).
The only thing left is to decide if you want to free some images when Windows says it's running low on RAM. But if you're not using the memory, it is going to be swapped out to disk for you.

It's not realistic to account for other apps. Just ignore them. The system will page things in and out as needed. If you really wanted to do this you'd have to dynamically adapt to other processes as they start and finish. That's really not realistic. What's more it's not practical to inquire of other processes how much memory they need. Leave it all to the system.
Set a budget for your app and make sure you don't exceed it. Keep the most recently used images in memory and when you approach your memory budget throw away the least recently used images to make space.
If you are stressing the available resources then make sure you use FastMM and enable LARGE_ADDRESS_AWARE for your app so that you get 4GB address space when running on a 64 bit OS.

Should a process always consume the same amount of memory if executed in the same way?

Hi folks and thanks for your time in advance.
I'm currently extending our C# test framework to monitor the memory consumed by our application. The intention being that a bug is potentially raised if the memory consumption significantly jumps on a new build as resources are always tight.
I'm using System.Diagnostics.Process.GetProcessByName and then checking the PrivateMemorySize64 value.
During developing the new test, when using the same build of the application for consistency, I've seen it consume differing amounts of memory despite supposedly executing exactly the same code.
So my question is, if once an application has launched, fully loaded and in this case in it's idle state, hence in an identical state from run to run, can I expect the private bytes consumed to be identical from run to run?
I need to clarify that I can expect memory usage to be consistent as any degree of varience starts to reduce the effectiveness of the test as a degree of tolerance would need to be introduced, something I'd like to avoid.
So...
1) Should the memory usage be 100% consistent presuming the application is behaving consistenly? This was my expectation.
or
2) Is there is any degree of variance in the private byte usage returned by windows or in the memory it allocates when requested by an app?
Currently, if the answer is memory consumed should be consistent as I was expecteding, the issue lies in our app actually requesting a differing amount of memory.
Many thanks
H

Almost everything in .NET uses the runtime's garbage collector, and when exactly it runs and how much memory it frees depends on a lot of factors, many of which are out of your hands. For example, when another program needs a lot of memory, and you have a lot of collectable memory at hand, the GC might decide to free it now, whereas when your program is the only one running, the GC heuristics might decide it's more efficient to let collectable memory accumulate a bit longer. So, short answer: No, memory usage is not going to be 100% consistent.
OTOH, if you have really big differences between runs (say, a few megabytes on one run vs. half a gigabyte on another), you should get suspicious.

If the program is deterministic (like all embedded programs should be), then yes. In an OS environment you are very unlikely to get the same figures due to memory fragmentation and numerous other factors.
Update:
Just noted this a C# app, so no, but the numbers should be relatively close (+/- 10% or less).

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören

So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.

I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören

Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Checking the amount of available RAM within a running program

A friend of mine was asked, during a job interview, to write a program that measures the amount of available RAM. The expected answer was using malloc() in a binary-search manner: allocating larger and larger portions of memory until getting a failure message, reducing the portion size, and summing the amount of allocated memory.
I believe that this method will measure the amount of virtual, not physical, memory. But I got curious about the matter.
Is there a way to tell the amount of available RAM from within the program, without using exec(dmesg |grep -i memory) ?

You are correct: malloc() makes no distinction between physical or virtual memory. In fact, that's the whole point of virtual memory: to make such details irrelevant to programs.
You can find out but it is OS-specific. For example, Linux.

The only way to do this is to use some OS-specific functionality. Using malloc() is useless for a number of reasons:
it measures virtual memory
the OS may well have per-process cap on memory allocations
allocating much more memory than is physically available often degrades the platforms stability to the point where "go back one" algorithm suggested in the question probably won't work

this is OS specific and you should collect such information from the OS services unless you want to make your own memory management layer

Using malloc() will only tell you how much memory can be allocated to a single process. There may be reasons why this is lower than the total amount of virtual memory. For instance, you might have OS quota or a per-process 32-bit-limited address space.
(And, of course, virtual memory >= RAM)

Very OS specific but for Linux the information about system memory is in /proc/meminfo. You can also probably use the sysctl interface (http://www.linuxjournal.com/article/2365) to get this data in a C program.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart