MPI internal buffer memory issue

MPI internal buffer memory issue - memory

I am dealing with a Fortran code with MPI parallelization where I'm running out of memory for intensive runs. I'm careful to allocate nearly all of the memory that is required at the beginning of my simulation. Subroutine static memory allocation is typically small, but if I was to run out of memory due to these subroutines, it would occur early in the simulation because the memory allocation should not grow as time progresses. My issue is that far into the simulation, I am running into memory errors such as:
Insufficient memory to allocate Fortran RTL message buffer, message #174 = hex 000000ae.
The only thing that I can think of is that my MPI calls are using memory that I cannot preallocate at the beginning of the simulation. I'm using mostly MPI_Allreduce, MPI_Alltoall, and MPI_Alltoallv while the simulation is running and sometimes I am passing large amounts of data. Could the memory issues be a result of internal buffers created by MPI? How can I prevent a surprise memory issue like this? Can this internal buffer grow during the simulation?
I've looked at Valgrind and besides the annoying MPI warnings, I'm not seeing any other memory issues.

It's a bit hard to tell if MPI is at fault here without knowing more details. You can try massif (one of the valgrind tools) to find out where memory is being allocated.
Be sure you don't introduce any resource leaks: If you create new MPI resources (communicators, groups, requests etc.), make sure to release them properly.
In general, be aware of the buffer sizes required for all-to-all communication, especially at large scale. Use MPI_IN_PLACE if possible, or send data in small chunks rather than as single large blocks.

Related

Is there a delay in reporting the vmRSS in the status file?

I have a large Fortran code in which I have edited the main routine to run multiple times in a loop.
I see the program memory growing as the loop runs, and am trying to track down where the memory is being leaked. (I have used massif, but it didn't help)
I am now attempting to monitor the memory use with reads of the /proc/'pid'/status file and examining vmRSS and vmSIZE.
My question is, as the loop runs multiple times the memory used grows at different points within the loop - despite the fact that the loop does the same thing each time.
So is there a delay in the reporting of memory use in the status file, and if so how would I go about tracking down where the memory is being allocated in this way.

Note that vmRSS is not solely due to what your program is doing, it's affected by other things going on in your system as well. If there is memory pressure, the OS may decide to swap out less used memory pages, unmap mapped pages, etc., thus reducing the RSS.
Also, when your code allocates memory e.g. with the ALLOCATE statement, RSS doesn't increase, just the vmSIZE. RSS only comes into play when you actually start writing to the allocated memory.

How to solve memory segmentation and force FastMM to release memory to OS?

Note: 32 bit application, which is not planned to be migrated to 64 bit.
I'm working with a very memory consuming application and have pretty much optimized all the relevant paths in respect to memory allocation/de-allocation. (there are no memory leaks, no handle leaks, no any other kind of leaks in the application itself AFAIK and tested. 3rd party libs which I cannot touch are of course candidates but unlikely in my scenario)
The application will frequently allocate large single and bi-dimensional dynamic arrays of single and packed records of up to 4 singles. By large I mean 5000x5000 of record(single,single,single,single) is normal. Also having even 6 or 7 such arrays in work at a given time. This is needed as there are a lot of cross-computations made on these arrays and having them read from disk would be a real performance killer.
Having this clarified, I am getting out of memory errors a lot because of these large dynamic arrays which will not go away after releasing them, no matter if I setlength them to 0 or finalize them. This is of course something FastMM is doing in order to be fast, I know that much.
I am tracking both FastMM allocated blocks and process consumed memory (RAM + PF) by using:
function CurrentProcessMemory(AWaitForConsistentRead:boolean): Cardinal;
var
MemCounters: TProcessMemoryCounters;
LastRead:Cardinal;
maxCnt:integer;
begin
result := 0;// stupid D2010 compiler warning
maxCnt := 0;
repeat
Inc(maxCnt);
// this is a stabilization loop;
// in tight loops, the system doesn't get
// much chance to release allocated resources, which in turn will get falsely
// reported by this function as still being used, resulting in a false-positive
// memory leak report in the application.
// so we do a tight loop here, waiting, until the application reported memory
// gets stable.
LastRead := result;
MemCounters.cb := SizeOf(MemCounters);
if GetProcessMemoryInfo(GetCurrentProcess,
#MemCounters,
SizeOf(MemCounters)) then
Result := MemCounters.WorkingSetSize + MemCounters.PagefileUsage
else
RaiseLastOSError;
if AWaitForConsistentRead and (LastRead <> 0) and (abs(LastRead - result)>1024) then
begin
sleep(60);
application.processmessages;
end;
until (not AWaitForConsistentRead) or (abs(LastRead - result)<1024) or (maxCnt>1000);
// 60 seconds wait is a bit too much
// so if the system is that "unstable", let's just forget it.
end;
function CurrentFastMMMemory:Cardinal;
var mem:TMemoryManagerUsageSummary;
begin
GetMemoryManagerUsageSummary(mem);
result := mem.AllocatedBytes + mem.OverheadBytes;
end;
I am running the code on a 64bit computer and my top memory consumption before crashes is about 3.3 - 3.4 GB. After that, I get memory/resources related crashes anywhere in the application. Took me some time to pin it down on the large dynamic arrays usage which were buried down in some 3rd party library.
The way I am getting over this is that I made the application resume itself from where it left off, by re-starting itself and closing with certain parameters.
This is all nice and dandy if memory consumption is fair and current operation finishes.
The big problem happens when the current memory usage is 1GB and the next operation to process requires 2.5 GB memory or more to be processed. My current code limited itself to an upper value of 1.5 GB used memory before resuming, but in this situation, I'd have to drop the limit down under 1 GB which would basically have the application resume itself after each operation and not even that guaranteeing that everything will be fine.
What if another operation will have a larger data set to process and it will require a total of 4GB or more memory?
To note that I am not talking about actual 4 GB in memory, but consumed memory by allocating huge dynamic arrays which the OS doesn't get back once de-allocated and hence it still sees it as consumed, so it adds up.
So, my next point of attack is to force fastmm to release all (or at least part of) memory to the OS. I'm specifically targeting the huge dynamic arrays here. Again, these are in a 3rd party library so re-coding that is not really in the top options. It's much easier and faster to tinker in the fastmm code and write a proc to release the memory.
I can't switch from FastMM as currently the entire application and some of the 3rd party libs are heavily coded around the use of PushAllocationGroup in order to quickly find and pinpoint any memory leaks. I know I can write a dummy FastMM unit to solve the compilation references, but I will be left without this quick and certain leak detection.
In conclusion: is there any way I can force FastMM to release at least some of it's large blocks to the OS? (well, sure there is, the actual question is: did anybody write it and if so, mind sharing?)
Thanks
later edit:
I will come up with a small relevant test application soon. It doesn't appear to be that easy to mock up one

I doubt that the issue is actually down to FastMM. For huge memory blocks, FastMM will not do any sub-allocation. Your allocation request will be handled with a straight VirtualAlloc. And then deallocation is VirtualFree.
That's assuming that you are allocating those 380MB objects in one contiguous block. I suspect that what you actually have are ragged 2D dynamic arrays. And they are not single allocations. a 5000x5000 ragged 2D dynamic arrays takes 5001 allocations to initialise. One for the row pointers, and 5000 for the rows. Those will be medium FastMM blocks. There will be sub-allocation.
I think you are asking too much. In my experience, any time you need over 3GB of memory in a 32 bit process, it's game over. Fragmentation of address space will stop you before you run out of memory. You cannot hope for this to work. Switch to 64 bit, or use a cleverer, less demanding allocation pattern. Or do you really need dense 2D arrays? Can you use sparse storage?
If you cannot alleviate your memory demands that way, you could use memory mapped files. This would allow you to make use of the extra memory that your 64 bit system has. The system's disk cache can be larger than 4GB and so your app can traverse more than 4GB of memory without actually needing to hit the disk.
You could certainly try different memory managers. I honestly do not hold out any hope that it would help. You could write a trivial replacement memory manager that used HeapAlloc. And enable the low fragmentation heap (enabled by default from Vista on). But I sincerely doubt that it will help. I'm afraid that there won't be a quick fix for you. To resolve this you face a more fundamental modification to your code.

Your issue as others have said is most likely attributable to memory fragmentation. You could test this by using VirtualQuery to create a picture of how memory is allocated to your application. You will very likely find that although you may have more than enough total memory for a new array, you don't have enough contiguous memory.
FastMem already does a lot to try and avoid problems due to memory fragmentation. "Small" allocations are done at the low end of the address space, whereas "large" allocations are done at the high end. This avoids a common problem where a series of large then small allocations followed by all large allocations being released results in a large amount of fragmented memory that is almost unusable. (Certainly unusable by anything slightly larger than the original large allocations.)
To see the benfits of FastMem's approach, imagine your memory layed out as follows:
Each digit represent a 100mb block.
[0123456789012345678901234567890123456789]
Small allocations represented by "s".
Large allocations repestented by capital letters.
[0sssss678901GGGGFFFFEEEEDDDDCCCCBBBBAAAA]
Now if you free all your large blocks, you should have no trouble performing similar large allocations later.
[0sssss6789012345678901234567890123456789]
The problem is that "large" and "small" are relative, and highly dependent on the nature of your application. FastMem defines a dividing line between "large" and "small". If you happen to have some small allocations that FastMem would classify as large, you may encounter the following problem.
[0sss4sGGGGsFFFFsEEEEsDDDDsCCCCsBBBBsAAAA]
Now if you free the large blocks you're left with:
[0sss4s6789s1234s6789s1234s6789s1234s6789]
And an attempt to allocate something larger than 400mb will fail.
Options
You may be able to tweak the FastMem settings so that all your "small" allocations are also considered small by FastMem. However, there are a few situations where this won't work:
Any DLLs you use that allocate memory to your application but bypass FastMem may still cause fragmentation.
If you don't release all your large blocks together, those that remain may induce fragmentation which will slowly get worse over time.
You could take on the task of memory management yourself.
Allocate one very large block e.g. 3.5GB which you keep for the entire lifetime of the application.
Instead of using dynamic arrays, you determine the pointer locations to use when setting up a new array.
Of course the simplest alternative would be to go 64-bit.
You could consider alternate data structures.
Do you really need array lookup capability? If not, another structure that allocates in smaller chunks may suffice.
Even if you do need array lookup, consider a paged array. Sparse arrays are a combination of arrays and linked lists. Data is stored on pages, with linked lists chaining each page.
A simple variant (since you mentioned your arrays are 2 dimensional) would be to leverage that: One dimension forms its own array providing a lookup into one of multiple arrays for the second dimension.
Related to the alternate data structures option, consider storing some data on disk. Yes performance will be slower. But if an efficient caching mechanism can be found, then maybe not so much. It would be better to be a little slower, but not crashing.

Dynamic arrays are reference counted in Delphi, so they should be automatic released when they are not used anymore.
Like strings, they are handled with COW (copy on write) when shared/stored in several variables/objects. So it seems you have some kind of memory/reference leak (e.g. an object in memory that holds still are reference to an array).
Just to be sure: you are not doing any kind of low level pointer tricks, aren't you?
So please yes, post a test program (or send the complete program private via email) so one of us can take a look at it.

How to analyze excessive memory consumption (PageFileUsage) in a Delphi Application?

This is a follow-up to this question: What could explain the difference in memory usage reported by FastMM or GetProcessMemoryInfo?
My Delphi XE application is using a very large amount of memory which sometimes lead to an out of memory exception. I'm trying to understand why and what is causing this memory usage and while FastMM is reporting low memory usage, when requesting for TProcessMemoryCounters.PageFileUsage I can clearly see that a lot of memory is used by the application.
I would like to understand what is causing this problem and would like some advise on how to handle it:
Is there a way to know what is contained in that memory and where it has been allocated ?
Is there some tool to track down memory usage by line/procedure in a Delphi application ?
Any general advise on how to handle such a problem ?
EDIT 1 : Here are two screenshots of FastMMUsageTracker indicating that memory has been allocate by the system.
Before process starts:
After process ends:
Legend: Light red is FastMM allocated and dark gray is system allocated.
I'd like to understand what is causing the system to use that much memory. Probably by understanding what is contained in that memory or what line of code or procedure did cause that allocation.
EDIT 2 : I'd rather not use the full version of AQTime for multiple reasons:
I'm using multiple virtual machines for development and their licensing system is a PITA (I'm already a registered user of TestComplete)
LITE version doesn't provide enough information and I won't waste money without making certain the FULL version will give me valuable information
Any other suggestions ?

Another problem might be heap fragmentation. This means you have enough memory free, but all the free blocks are to small. You might see it visually by using the source version of FastMM and use the FastMMUsageTracker.pas as suggested here.

You need a profiler, but even that won't be enough in lots of places and cases. Also, in your case, you would need the full featured AQTime, not the lite version that comes with Delphi XE and XE2. (AQTIME is extremely expensive, and annoyingly node-locked, so don't think I'm a shill for SmartBear software.)
The thing is that people often mistake AQTime Allocation Profiler as only a way to find leaks. It can also tell you where your memory goes, at least within the limits of the tool. While running, and consuming lots of memory, I click Run -> Get Results.
Here is one of my applications being profile in AQTime with its Allocation Profiler showing exactly what class is allocating how many instances on the heap and how much memory those use. Since you report low Delphi heap usage with FastMM, that tells me that most of AQTime's ability to analyze by delphi class name will also be useless to you. However by using AQTime's events and triggers, you might be able to figure out what areas of your application are causing you a "memory usage expense" and when those occur, what the expense is. AQTime's real-time instrumentation may be sufficient to help you narrow down the cause even though it might not find for you what function call is causing the most memory usage automatically.
The column names include "Object Name" which includes things like this:
* All delphi classes, and their instance count and heap usage.
* Virtual Memory blocks allocated via Win32 calls.
It can detect Delphi and C/C++ library allocations on the heap, and can see certain Windows-API level memory allocations.
Note the live count of objects, the amount of memory from the heap that is used.
I usually try to figure out the memory cost of a particular operation by measuring heap memory use before, and just after, some expensive operation, but before the cleanup (freeing) of the memory from that expensive operation. I can set event points inside AQTime and when a particular method gets hit or a flag gets turned on by me, I can measure before, and after values, and then compare them.
FastMM alone can not even detect a non-delphi allocation or an allocation from a heap that is not being managed by FastMM. AQTime is not limited in that way.

How do programs allocate large amounts of memory?

I have 3 questions concerning memory allocation that I thought better to put into one question than 3.
When memory is allocated as I understand, it is allocated on the heap, which is just 16mb. How hen do programs such as video games or modern browsers manage to use over 1GB?
Since it is obviously possible for this much memory to be used, why can it not be allocated at the start? I have found the most I can allocate in High Level Assembly language is around 100MB. This is a lot more than 16MB, and far less than I have 3, so where does this limitation come from?
Why allocate memory in the first place, rather than allocating variables and letting the compiler/system handle it?

When memory is allocated as I understand, it is allocated on the heap,
which is just 16mb. How hen do programs such as video games or modern
browsers manage to use over 1GB?
The heap can grow. It isn't limited to any value and certainly not 16MB. You can easily allocate 1GB of heap, just make a program test and you'll see.
Since it is obviously possible for this much memory to be used, why
can it not be allocated at the start? I have found the most I can
allocate in High Level Assembly language is around 100MB. This is a
lot more than 16MB, and far less than I have 3, so where does this
limitation come from?
I'm not sure why your OS isn't filling larger allocation requests. Perhaps due to memory fragmentation? It's going to be a problem specific to your setup, which you didn't share. I can allocation much more memory than that without an issue.
You can try to use the mmap system call if malloc (which uses the brk system call) is having some sort of issue. Note that for GNU libc, malloc actually uses mmap instead of brk when the allocation is large enough (over 128k I think).
Why allocate memory in the first place, rather than allocating
variables and letting the compiler/system handle it?
Variable must live in memory somewhere. What you are saying is "why manually manage memory? Why can't some algorithm do that for me?". It is actually very common for the compiler and a runtime component to handle allocation/freeing - it's called garbage collection.

How to use AQTime's memory allocation profiler in a program that uses a large amount of memory?

I'm finding AQTime hard to use because it interferes with the original program too much. If I have a program that uses, for example, 300MB of ram I can use AQTime's allocation profiler without a problem, and find out where most of the memory is being used. However I notice that running under AQTime, the original program uses more like 1GB while it's being profiled.
Right now I'm trying to reduce memory usage in a program which is using 1.4GB of memory. If I run it under AQTime, then the original program uses all of the 2GB address space and crashes. I can of course invent a smaller set of test data and estimate how the memory usage will scale with the full data set - but the reason I'm using a profiler in the first place is to try to avoid this sort of guesswork.
I already have AQTime set to 'Collect stack information - None' and all the check boxes to do with checking memory integrity are switched off, and I've tried restricting the area being profiled to just a few classes but this doesn't seem to improve anything. Is there a way to use AQTime that produces a smaller overhead? Or failing that, what other approaches are there to get a good idea of the memory being used?
The app is written in Delphi 2010 and I'm using AQTime 6.
NB: On top of the increased memory usage, running under AQTime slows the app down an awful lot, making the whole exercise not just impossible but impractical too :-P

AFAIK the allocation profiler will track memory block allocation regardless of profiling areas. Profiling areas are used to track classes instantiation. Of course memory-profiling an application that allocates a large amount of memory is a issue, you may try to use the LARGE_ADRESS_AWARE flag, and the /3GB boot switch, or use a 64 bit system (as long as you have at least 4GB of memory, or more). Also you can take snapshot of the application state before it crashes, to see where the memory is allocated. Profiling takes time, anyway, you may have to let it run for a while.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart