The go test tool has a profiler which can tell you the amount of allocations you did inside the code.
However, seeing libraries such as this one:
https://github.com/valyala/fasthttp
stating "Zero memory allocations in hot paths"... what does that mean? and how do you achieve this in Go?
I personally don't like their use of language as it sounds like something a marketer would say... All they mean to say is that no allocations will occur in that code because a buffer has been allocated in advance for use there.
So to be clear, they mean 'in this limited scope no allocations will occur'. How do you achieve this? By allocating a sufficiently large buffer in advance of that and then leveraging it in the scope.
The intent of packages author(s) is to speed up request handling by allocating up front at the cost of using more memory (or having a more constant hold on memory at least, in theory the buffer could be the same size as what would need to be allocated).
If you're curious about the implementation details take a look in files like byte_buffer.go and args.go and you'll find that there is a pool of buffer objects allocated in advance so that your handler code doesn't have to do an allocation for the response body ect. Instead you obtain a buffer from the pool (already allocated) and write the response data to it and then when you're done it's released back into the pool for reuse. In the standard scenario you'd instead allocate space for the response body and after the response is returned that object would leave scope and the memory would be freed. As I mentioned in the paragraph above, moving all of this upfront means when your service starts up it will obtain and hold a larger amount of memory than a similar service which used net/http since it will instead obtain and release memory on an as needed basis.
Related
I'm using instruments on a device to try to figure out if I have any memory leaking or abandoned. Specifically I am using leaks and allocations. While instruments doesn't point out any leaks, that doesn't mean I don't have memory issues. I've been working on this for weeks, and I don't seem to be any closer to figuring out what issues I have (ugh).
I am testing a particular action by taking a heapshot after the action and repeating. After the first few "settling" generations, I can see that the growth and persistent count all start out at a certain number (several kb). After many repeated iterations (say 10-20), some (not all) slowly end up draining to 0. It takes a while, but it does happen. The generations where there remains persistent memory never actually show me anything that I find helpful, as the stack trace show all system libraries.
So my questions are:
What does this type of behavior indicate? Do I have memory issues? Is there some type of lazy release of memory somewhere?
In a sea of iterations that show persistent memory, what does one zero heap growth iteration mean?
If the stack trace for a particular generation points only to system libraries, does this mean the heap growth for that generation is valid or that there is a bug? Or could it still mean that there is something holding onto the memory on my end?
What does it mean when the stack trace shows your library and method, but it is greyed out like the system code and has a little house icon, vs a a line with your library and method that is in black and has a little person icon?
If I have something like a retain cycle - wouldn't the persistent growth be consistent?
Any answers to insights would be extremely helpful!
I'll take a stab at your questions:
What does this type of behavior indicate? Do I have memory issues? Is
there some type of lazy release of memory somewhere?
Since you can't know how the system frameworks manage their private memory needs, you must assume that yes, there could be lazy/deferred release of memory happening any time you call into the system frameworks, which in most apps is "all the time". Beyond not being able to rule it out, I can say with some certainty that there definitely are long-lived allocations triggered by seemingly-innocuous system framework usage. (See the discussion of UIWebView's long-lived memory use in this answer for an example.)
In a sea of
iterations that show persistent memory, what does one zero heap growth
iteration mean?
Hard to say. A good first-order guess might be that the heap growth associated with the iteration was somehow exactly offset by a lazy/deferred release of the memory allocated for a previous iteration.
If the stack trace for a particular generation points
only to system libraries, does this mean the heap growth for that
generation is valid or that there is a bug? Or could it still mean
that there is something holding onto the memory on my end?
If Instruments shows heap growth, then that heap growth almost certainly exists. Whether that heap growth is something you have direct control over depends. If you make no calls into system frameworks (not likely), then it's definitely your fault. Once you make a call into the system frameworks, you have to accept the possibility that the framework might allocate memory that stays allocated after your call returns.
What does
it mean when the stack trace shows your library and method, but it is
greyed out like the system code and has a little house icon, vs a a
line with your library and method that is in black and has a little
person icon?
Lines being greyed out indicates that Instruments doesn't have debug symbols for that line. That's all. It doesn't indicate anything specific with regard to memory use.
If I have something like a retain cycle - wouldn't the
persistent growth be consistent?
If each iteration created a new object graph with cyclic retains, then yes, you would expect that each iteration would cause heap growth by at least the size of that object graph. That said, small object graphs can easily be lost in the "noise." If you have suspicions, one way is to have objects of a "suspect" class perform a huge allocation that will make them stand out from the "noise." For instance, make your object malloc a megabyte (or more) for every instance (and, obviously, free it when the instance is deallocated.) This can help problem areas stick out where they might not have originally.
Note: 32 bit application, which is not planned to be migrated to 64 bit.
I'm working with a very memory consuming application and have pretty much optimized all the relevant paths in respect to memory allocation/de-allocation. (there are no memory leaks, no handle leaks, no any other kind of leaks in the application itself AFAIK and tested. 3rd party libs which I cannot touch are of course candidates but unlikely in my scenario)
The application will frequently allocate large single and bi-dimensional dynamic arrays of single and packed records of up to 4 singles. By large I mean 5000x5000 of record(single,single,single,single) is normal. Also having even 6 or 7 such arrays in work at a given time. This is needed as there are a lot of cross-computations made on these arrays and having them read from disk would be a real performance killer.
Having this clarified, I am getting out of memory errors a lot because of these large dynamic arrays which will not go away after releasing them, no matter if I setlength them to 0 or finalize them. This is of course something FastMM is doing in order to be fast, I know that much.
I am tracking both FastMM allocated blocks and process consumed memory (RAM + PF) by using:
function CurrentProcessMemory(AWaitForConsistentRead:boolean): Cardinal;
var
MemCounters: TProcessMemoryCounters;
LastRead:Cardinal;
maxCnt:integer;
begin
result := 0;// stupid D2010 compiler warning
maxCnt := 0;
repeat
Inc(maxCnt);
// this is a stabilization loop;
// in tight loops, the system doesn't get
// much chance to release allocated resources, which in turn will get falsely
// reported by this function as still being used, resulting in a false-positive
// memory leak report in the application.
// so we do a tight loop here, waiting, until the application reported memory
// gets stable.
LastRead := result;
MemCounters.cb := SizeOf(MemCounters);
if GetProcessMemoryInfo(GetCurrentProcess,
#MemCounters,
SizeOf(MemCounters)) then
Result := MemCounters.WorkingSetSize + MemCounters.PagefileUsage
else
RaiseLastOSError;
if AWaitForConsistentRead and (LastRead <> 0) and (abs(LastRead - result)>1024) then
begin
sleep(60);
application.processmessages;
end;
until (not AWaitForConsistentRead) or (abs(LastRead - result)<1024) or (maxCnt>1000);
// 60 seconds wait is a bit too much
// so if the system is that "unstable", let's just forget it.
end;
function CurrentFastMMMemory:Cardinal;
var mem:TMemoryManagerUsageSummary;
begin
GetMemoryManagerUsageSummary(mem);
result := mem.AllocatedBytes + mem.OverheadBytes;
end;
I am running the code on a 64bit computer and my top memory consumption before crashes is about 3.3 - 3.4 GB. After that, I get memory/resources related crashes anywhere in the application. Took me some time to pin it down on the large dynamic arrays usage which were buried down in some 3rd party library.
The way I am getting over this is that I made the application resume itself from where it left off, by re-starting itself and closing with certain parameters.
This is all nice and dandy if memory consumption is fair and current operation finishes.
The big problem happens when the current memory usage is 1GB and the next operation to process requires 2.5 GB memory or more to be processed. My current code limited itself to an upper value of 1.5 GB used memory before resuming, but in this situation, I'd have to drop the limit down under 1 GB which would basically have the application resume itself after each operation and not even that guaranteeing that everything will be fine.
What if another operation will have a larger data set to process and it will require a total of 4GB or more memory?
To note that I am not talking about actual 4 GB in memory, but consumed memory by allocating huge dynamic arrays which the OS doesn't get back once de-allocated and hence it still sees it as consumed, so it adds up.
So, my next point of attack is to force fastmm to release all (or at least part of) memory to the OS. I'm specifically targeting the huge dynamic arrays here. Again, these are in a 3rd party library so re-coding that is not really in the top options. It's much easier and faster to tinker in the fastmm code and write a proc to release the memory.
I can't switch from FastMM as currently the entire application and some of the 3rd party libs are heavily coded around the use of PushAllocationGroup in order to quickly find and pinpoint any memory leaks. I know I can write a dummy FastMM unit to solve the compilation references, but I will be left without this quick and certain leak detection.
In conclusion: is there any way I can force FastMM to release at least some of it's large blocks to the OS? (well, sure there is, the actual question is: did anybody write it and if so, mind sharing?)
Thanks
later edit:
I will come up with a small relevant test application soon. It doesn't appear to be that easy to mock up one
I doubt that the issue is actually down to FastMM. For huge memory blocks, FastMM will not do any sub-allocation. Your allocation request will be handled with a straight VirtualAlloc. And then deallocation is VirtualFree.
That's assuming that you are allocating those 380MB objects in one contiguous block. I suspect that what you actually have are ragged 2D dynamic arrays. And they are not single allocations. a 5000x5000 ragged 2D dynamic arrays takes 5001 allocations to initialise. One for the row pointers, and 5000 for the rows. Those will be medium FastMM blocks. There will be sub-allocation.
I think you are asking too much. In my experience, any time you need over 3GB of memory in a 32 bit process, it's game over. Fragmentation of address space will stop you before you run out of memory. You cannot hope for this to work. Switch to 64 bit, or use a cleverer, less demanding allocation pattern. Or do you really need dense 2D arrays? Can you use sparse storage?
If you cannot alleviate your memory demands that way, you could use memory mapped files. This would allow you to make use of the extra memory that your 64 bit system has. The system's disk cache can be larger than 4GB and so your app can traverse more than 4GB of memory without actually needing to hit the disk.
You could certainly try different memory managers. I honestly do not hold out any hope that it would help. You could write a trivial replacement memory manager that used HeapAlloc. And enable the low fragmentation heap (enabled by default from Vista on). But I sincerely doubt that it will help. I'm afraid that there won't be a quick fix for you. To resolve this you face a more fundamental modification to your code.
Your issue as others have said is most likely attributable to memory fragmentation. You could test this by using VirtualQuery to create a picture of how memory is allocated to your application. You will very likely find that although you may have more than enough total memory for a new array, you don't have enough contiguous memory.
FastMem already does a lot to try and avoid problems due to memory fragmentation. "Small" allocations are done at the low end of the address space, whereas "large" allocations are done at the high end. This avoids a common problem where a series of large then small allocations followed by all large allocations being released results in a large amount of fragmented memory that is almost unusable. (Certainly unusable by anything slightly larger than the original large allocations.)
To see the benfits of FastMem's approach, imagine your memory layed out as follows:
Each digit represent a 100mb block.
[0123456789012345678901234567890123456789]
Small allocations represented by "s".
Large allocations repestented by capital letters.
[0sssss678901GGGGFFFFEEEEDDDDCCCCBBBBAAAA]
Now if you free all your large blocks, you should have no trouble performing similar large allocations later.
[0sssss6789012345678901234567890123456789]
The problem is that "large" and "small" are relative, and highly dependent on the nature of your application. FastMem defines a dividing line between "large" and "small". If you happen to have some small allocations that FastMem would classify as large, you may encounter the following problem.
[0sss4sGGGGsFFFFsEEEEsDDDDsCCCCsBBBBsAAAA]
Now if you free the large blocks you're left with:
[0sss4s6789s1234s6789s1234s6789s1234s6789]
And an attempt to allocate something larger than 400mb will fail.
Options
You may be able to tweak the FastMem settings so that all your "small" allocations are also considered small by FastMem. However, there are a few situations where this won't work:
Any DLLs you use that allocate memory to your application but bypass FastMem may still cause fragmentation.
If you don't release all your large blocks together, those that remain may induce fragmentation which will slowly get worse over time.
You could take on the task of memory management yourself.
Allocate one very large block e.g. 3.5GB which you keep for the entire lifetime of the application.
Instead of using dynamic arrays, you determine the pointer locations to use when setting up a new array.
Of course the simplest alternative would be to go 64-bit.
You could consider alternate data structures.
Do you really need array lookup capability? If not, another structure that allocates in smaller chunks may suffice.
Even if you do need array lookup, consider a paged array. Sparse arrays are a combination of arrays and linked lists. Data is stored on pages, with linked lists chaining each page.
A simple variant (since you mentioned your arrays are 2 dimensional) would be to leverage that: One dimension forms its own array providing a lookup into one of multiple arrays for the second dimension.
Related to the alternate data structures option, consider storing some data on disk. Yes performance will be slower. But if an efficient caching mechanism can be found, then maybe not so much. It would be better to be a little slower, but not crashing.
Dynamic arrays are reference counted in Delphi, so they should be automatic released when they are not used anymore.
Like strings, they are handled with COW (copy on write) when shared/stored in several variables/objects. So it seems you have some kind of memory/reference leak (e.g. an object in memory that holds still are reference to an array).
Just to be sure: you are not doing any kind of low level pointer tricks, aren't you?
So please yes, post a test program (or send the complete program private via email) so one of us can take a look at it.
I am working on a Browser application in which I use a UIWebView for opening web pages. I run the Instruments tool with Memory Monitor. I am totally confused by the terms which are used in Instruments and why they're important. Please explain some of my questions with proper reasons:
Live Bytes is important for checking memory optimization or memory consumption? Why ?
Why would I care about the Overall Bytes/ Real Memory, if it contains also released objects?
When and why are these terms used (Live Bytes/ Overall Bytes/Real Memory)?
Thanks
"Live Bytes" means "memory which has been allocated, but not yet deallocated." It's important because it's the most easily graspable measure of "how much memory your app is using."
"Overall Bytes" means "all memory which has ever been allocated including memory that has been deallocated." This is less useful, but gives you some idea of "heap churn." Churn leads to fragmentation, and heap fragmentation can be a problem (albeit a pretty obscure one these days.)
"Real Memory" is an attempt to distinguish how much physical RAM is in use (as opposed to how many bytes of address space are valid). This is different from "Live Bytes" because "Live Bytes" could include ranges of memory that correspond to memory-mapped files (or shared memory, or window backing stores, or whatever) that are not currently paged into physical RAM. Even if you don't use memory-mapped files or other exotic VM allocation methods, the system frameworks do, and you use them, so this distinction will always have some importance to every process.
EDIT: Since you're clearly concerned about memory use incurred by using UIWebView, let me see if I can shed some light on that:
There is a certain memory "price" to using UIWebView at all (i.e. global caches and the like). These include various global font caches, JavaScript JIT caches, and stuff like that. Most of these are going to behave like singletons: allocated the first time you use them (indirectly by using UIWebView) and never deallocated until the process ends. There are also some variable size global caches (like those that cache web responses; CFURL typically manages these) but those are expected to be managed by the system. The collective "weight" of these things with respect to UIWebView is, as you've seen, non-trivial.
I don't have any knowledge of UIKit or WebKit internals, but I would expect that if you had a discussion with someone who did, their response to the question of "Why is my use of UIWebView causing so much memory use?" would be two pronged: The first prong would be "this is the price of admission for using UIWebView -- it's basically like running a whole web browser in your process." The second prong would be "system framework caches are automatically managed by the system" by which they would mean that, for instance, the CFURL caches (which is one of the things that using UIWebView causes to be created) are managed by the system, so if a memory warning came in, the system frameworks would be responsible for evicting things from those caches to reduce the memory consumed by them; you have no control over those, and you just have to trust that the system frameworks will do what needs to be done. (That doesn't help you in the case where whatever the system cache managers do isn't aggressive enough for you, but you're not going to get any more control over them, so you need to attack the issue from another angle, either way.) If you're wondering why the memory use doesn't go down once you deallocate your UIWebView, this is your answer. There's a bunch of stuff it's doing behind the scenes, that you can't control.
The expectation that allocating, using, and then deallocating a UIWebView is a net-zero operation ignores some non-trivial, inherent and unavoidable side-effects. The existence of such side-effects is not (in and of itself) indicative of a bug in UIWebView. There are side effects like this all over the place. If you were to create a trivial application that did nothing but launch and then terminate after one spin of the run loop, and you set a breakpoint on exit(), and looked at the memory that had been allocated and never freed, there would be thousands of allocations. This is a very common pattern used throughout the system frameworks and in virtually every application.
What does this mean for you? It means that you effectively have two choices: Use UIWebView and pay the "price of admission" in memory consumption, or don't use UIWebView.
I suspect the answer to my question is language specific, so I'd like to know about C and C++. When I call free() on a buffer or use delete[], how does the program know how much memory to free?
Where is the size of the buffer or of the dynamically allocated array stored and why isn't it available to the programmer as well?
Each implementation will be different, but typically the runtime allocates a bit more than asked for, and uses some hidden fields at the start of the block to remember the allocated size. The address returned to the caller is therefore offset a bit from the start of the memory claimed from the heap.
It isn't available to the caller because the true amount of memory claimed from the heap is an implementation detail, and will vary between compilers and platforms. As for knowing how much the caller asked for, rather than how much was allocated from the heap... well, the language designers assume that the programmer is capable of remembering this if needed.
The heap keeps track of all memory blocks, both allocated and free, specifically for that purpose. Typical (if naive) implemenation allocates memory, uses several bytes in the beginning for bookkeeping, and returns the address past those bytes. On subsequent operations (free/realloc), it would subtract a few bytes to get to the bookkeeping area.
Some heap implementations (say, Windows' GlobalAlloc()) let you know the block size given the starting address. But in the C/C++ RTL heap, no such service.
Note that the malloc() sometimes overallocates memory, so the information about mallocated block size would be of limited utility. C++ new[]'ed arrays, that's a whole another matter - for those, knowing exact array size is essential for array destruction to work properly. Still, there's no such thing in C++ as a dynamic_sizeof operator.
The memory allocator that gave you that chunk of memory is responsible for all that maintenance data. Typically it's stored in the beginning of the chunk (right before the actual address you use) so it's easy to access on freeing.
Regarding to your other question: why should your app know about it? It's not your concern. It decouples memory allocation management from the app so you can use different allocators (for performance or debugging reasons).
It's stored internally in a location dependent on the language/compiler/OS.
Sometimes it is available (.Length in C# for example), though that may only refer to how much memory you're allowed to use, and not the object's total size.
Usually because the size to free is stored somewhere within the allocated buffer. A common technique is to have the size stored in memory just previous to the returned pointer.
Why isn't such information available to the programmer? I don't really know. I guess its because an implementation may be able to provide memory allocation without actually needing to store its size, and such implementation -if it exists- shouldn't be penalized by the others.
It's not so much language specific. It's all done by the memory manager.
How it knows depends on how the memory manager manages memory. The general idea is that the memory manager allocates more memory than you ask for. It stores extra data about the allocated blocks of memory in those locations. Thus, when you release the memory, it uses the information stored in those locations (reconstructed based on the given pointer) and figures out how much actual memory to stop managing.
Don't confound deallocation and destruction.
free() knows the size of the memory because of some internal magic ("implementation-defined"), e.g. the allocator could keep a list of all the allocated memory regions indexed by their corresponding pointers and just look up the pointer to know what to deallocate; or that information could be stored next to the allocated memory itself in some hidden block of data.
The array-delete expression delete[] arr; does not only deallocate memory, but it also invokes all destructors. For that purpose, it is not sufficient to just know the memory size, but we also need to know the number of elements. For that purpose, new T[N] actually allocates more than sizeof(T) * N bytes of memory, so the array-deleter knows how many destructors to call. All that memory is properly deallocated by the corresponding delete-operator.
I was going through some of the decisions made to make Xara Xtreme, an open source SVG graphics application. Their memory management decision was quite intriguing to me since I naively took it for granted that on-demand dynamic allocation as the way of writing object oriented application.
The explanation from the documentation is
How on earth can static allocations be efficient?
If you are used to large dynamic data structures, this may seem strange
to you. Firstly, all our objects (and
thus allocation size) are far smaller
(on average) than each dynamic area
allocation within a program such as
Impression. This means that though
there are likely to be many holes
within memory, they are small. Also,
we have far more allocated objects
within memory, and thus these holes
quickly get filled. Furthermore,
virtual memory managers will free up
any pages of memory that contain no
allocations and give this memory back
to the operating system so that it may
be used again (either by us, or by
another task).
We benefit greatly from
the fact that whenever we allocate
memory in this manner, we do not have
to move any memory about. This proved
a bottleneck in ArtWorks which also
had many small allocations being used
concurrently. more
In brief, the presence of plenty of small objects and the need to prevent memory move are the reasons given for choosing static allocation. I don't have clear understanding about the reasons mentioned.
Though this talks about static allocation, what I see from the cursory look at the code is that a block of memory is dynamically allocated at the application start and kept alive till the application ends, roughly simulating static allocation.
Could you explain in what situations Static Allocation fares better than on-demand Dynamic Allocation in order to consider it as the main mode of allocation in a serious applications?
It's quicker because you avoid the overhead of calling a system routine to manage your storage. malloc() maintains a heap, so every request requires a scan for an appropriately-sized block, possibly resizing the block, updating the block list to mark this block as used, etc. If you're allocating a lot of small objects, this overhead can be excessive. With static allocation you can create an allocation pool and just maintain a simple bitmap to show which areas are in use. This assumes that each object is the same size, so you commonly create one pool per object type.
In short, there's really no such thing as static allocation other than the space allocated for your functions themselves and other read-only kinds of memory. (Do an assemble-only "gcc -S" and look for all the memory blocks, if you're interested.) If you're making and breaking objects, you're dynamically allocating. That being said, there's nothing to stop you from tightly controlling the allocation mechanism itself.
That's what functions like mallinfo() and mallopt() do for controlling how malloc() does its magic. However, that might not even be good enough for you. If you know all your chunks are going to be the same size, you can allocate and deallocate much more efficiently. And if you know you have 3 sizes of stuff, you can keep 3 arenas of memory each with their own allocator.
On top of this, you have the situation at runtime where the process doesn't have enough room and needs to ask the os for more - that involves a system call that is more expensive than just incrementing an array index. On unix, it's usually brk() or sbrk() or the like. And that can take valuable time.
Another, rarer situation, would be if you need to multiply-allocate things. Like 3 threads need to share information and only when all 3 release it does it get freed. That's something nonstandard and not generally covered by typical mallopt() or even pthread-specific memory or mutex/semaphore-locked chunks.
So if you have high speed optimization issues or you are running on an embedded system where you need to squeeze all you can out of the available memory, then "static allocation", or at least controlling the allocation mechanism, may be the way to go.