How to allocate a large array in the linux kernel? - memory

I'm writing a device mapper target.
I need to allocate a large array of ~30 byte records.
Right now I'm using flex_array for testing, but it seems to be limited to 4MB on 32-bit and 2MB on 64-bit. http://lxr.free-electrons.com/source/include/linux/flex_array.h
The size limit seems to be due to the code being limited to one level of indirection.
Is there something like flex_array that has multiple levels of indirection? I could probably hack something together but I don't want to reinvent the wheel if there is something already in.

Related

How do I increase maximum download size in an MVC application?

I am frankly stumped. This is beyond my experience.
I have a C# MVC program that generates a zip file in a MemoryStream for downloading. The action method is called by a button click to JavaScript.
The only problem is that in some cases the potential file size can easily exceed one Gig and from my reading, that is a common problem. I've tried upping the Maximum Allowed Content Length to 3000000000 in Request Filtering on IIS (IIS8). I've tried adding requestLimits maxAllowedContentLength to my web.config. I've even tried breaking up the zip through multiple calls to the action method (without success), although I have yet to get any confirmation/denial that this is even possible.
Is there any setting within IIS or my web.config that I could be overlooking? Could this be a company network issue, not solvable on an app developer's level?
Okay, so it's kind of hard to explain big concepts in 400 characters or less, so I think I'm just causing more confusion sticking in the comments section. Besides, I think we're close enough here to an "answer" as you're likely to get.
The default constructor of MemoryStream essentially sets the initial size to 0. In reality, the initial size is set to somewhere around 256, but since the initial size is mostly a guide, and it doesn't actually claim that space until its needed, it starts at 0.
Each time you write to the stream, it checks how much is being written versus the remaining size of the buffer array. If it can't fit the write, it creates a new, larger buffer array and copies the old buffer array into that. In this way, setting an initial size can help somewhat, in that you start off with a larger initial buffer array and you may not need to grow that buffer. You might have a better chance of getting a contiguous block of memory, which I'll explain the importance of in a bit, but that actually kind of works against you, as well. If you only need 1MB for the file, but you're initializing with 100MB and there's not 100MB of contiguous memory, you'll get an OutOfMemoryException, even though there might be 1MB of contiguous memory available.
Regardless of whether you initialize or not, there remains certain immutable facts. First, MemoryStream requires contiguous memory. Even if you technically have memory available on the system, it's possible you might not have large blocks of available memory. In other words if you have 4GB available, but it's all fragmented, even trying to create a 1GB stream in memory could fail, simply because it can't reserve 1GB of contiguous memory. Obviously, the larger the file you're tying to create in memory, the greater the chances that you're going to run into this issue. For this reason alone, I would say you're out of luck without raising the amount of system RAM. With 8GB and probably only 4-6GB actually available to IIS and then split up between worker processes and threads, the odds that you're going to be able to claim 25% or so of the available RAM as contiguous space, is highly unlikely.
The next immutable fact may or may not be relevant, but since you haven't specified, I'll mention it. If your web app is deployed as 32-bit, you'll have a hard limit of 2GB for any object, meaning a MemoryStream could never house more than 2GB (actually around 1.3-1.6GB as .NET code consumes some of that address space), and any attempt to make it do so will result in an OutOfMemoryException, even if you had some ridiculous amount of RAM on the system like 1TB+. If your app is 64-bit, this is less likely an issue as you can address a ton more memory, assuming it's compiled properly. You'd have to pretty much try to screw that up, though, so you should be fine.
Finally, multiple writes can cause an issue as well. As I said previously, the buffer array resizes (if necessary) in response to writes. Each time it resizes, the new buffer array must also be able to fit in contiguous address space. As a result, multiple resizes can cause you to bump into an OutOfMemoryException you wouldn't have hit if you had written all the data from the start. This is where initializing the MemoryStream can be helpful, but as I said before, it's also a double-edged sword, as your initial buffer size might be too great to begin with and you end up with an exception where you may have not had one letting it grow organically. Long and short, try to write everything to the stream in one go rather than piecemeal.

Dynamic Array Memory Allocation Strategies

I've written a 32bit program using a dynamic array to store a list of triangles with an unknown count. My current strategy is to estimate a very large number of triangles and then trim the list when all the triangles are created. In some cases I'll only allocate memory once in others I'll need to add to the allocation.
With a very large data set I'm running out of memory when my application is memory usage is about 1.2GB and since the allocation step is so large I feel like I may be fragmenting memory.
Looking at FastMM (memory manager) I see these constants which would suggest one of these as a good size to increment by.
ChunkSize = 64 * 1024;
MaximumSmallBlockSize = 32752;
LargeBlockGranularity = 64 * 1024;
Would one of these be an optimal size for increasing the size of an array?
Eventually this program will become 64bit but we're not quite ready for that step.
Your real problem here is not that you are running out of memory, but that the memory allocator cannot find a large enough block of contiguous address space. Some simple things you can do to help include:
Execute the code in a 64 bit process.
Add the LARGEADDRESSAWARE PE flag so that your process gets a 4GB address space rather than 2GB.
Beyond that the best you can do is allocate smaller blocks so that you avoid the requirement to store your large data structure in contiguous memory. Allocate memory in blocks. So, if you need 1GB of memory, allocate 64 blocks of size 16MB, for instance. The exact block size that you use can be tuned to your needs. Larger blocks result in better allocation performance, but smaller blocks allow you to use more address space.
Wrap this up in a container that presents an array like interface to the consumer, but internally stores the memory in non-contiguous blocks.
As far as I know, dynamic arrays in Delphi use contiguous address space (at least in the virtual memory address space.)
Since you are running out of memory at 1.2 gb, I guess that's the point where the memory manager can't find a block contiguous memory large enough to fit a larger array.
One way you can work around this limitation would be to implement your array as a collection of smaller array of (lets say) 200 mb in size. That should give you some more headroom before you hit the memory cap.
From the 1.2 gb value, I would guess your program isn't compiled to be "large address aware". You can see here how to compile your application like this.
One last trick would be to actually save the array data in a file. I use this trick for one of my application where I needed to load a few GB of images to be displayed in a grid. What I did was to create a file with the attribute FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_DELETE_ON_CLOSE and saved/loaded images from the resulting file. From CreateFile documentation:
A file is being used for temporary storage. File systems avoid writing
data back to mass storage if sufficient cache memory is available,
because an application deletes a temporary file after a handle is
closed. In that case, the system can entirely avoid writing the data.
Otherwise, the data is written after the handle is closed.
Since it makes use of cache memory, I believe it allows an application to use memory beyond the 32 bits limitation since the cache is managed by the OS and (as far as I know) not mapped inside the process' virtual memory space. After doing this change, performance were still pretty good. But I can't say if performances would still be good enough for your needs.

How to solve memory segmentation and force FastMM to release memory to OS?

Note: 32 bit application, which is not planned to be migrated to 64 bit.
I'm working with a very memory consuming application and have pretty much optimized all the relevant paths in respect to memory allocation/de-allocation. (there are no memory leaks, no handle leaks, no any other kind of leaks in the application itself AFAIK and tested. 3rd party libs which I cannot touch are of course candidates but unlikely in my scenario)
The application will frequently allocate large single and bi-dimensional dynamic arrays of single and packed records of up to 4 singles. By large I mean 5000x5000 of record(single,single,single,single) is normal. Also having even 6 or 7 such arrays in work at a given time. This is needed as there are a lot of cross-computations made on these arrays and having them read from disk would be a real performance killer.
Having this clarified, I am getting out of memory errors a lot because of these large dynamic arrays which will not go away after releasing them, no matter if I setlength them to 0 or finalize them. This is of course something FastMM is doing in order to be fast, I know that much.
I am tracking both FastMM allocated blocks and process consumed memory (RAM + PF) by using:
function CurrentProcessMemory(AWaitForConsistentRead:boolean): Cardinal;
var
MemCounters: TProcessMemoryCounters;
LastRead:Cardinal;
maxCnt:integer;
begin
result := 0;// stupid D2010 compiler warning
maxCnt := 0;
repeat
Inc(maxCnt);
// this is a stabilization loop;
// in tight loops, the system doesn't get
// much chance to release allocated resources, which in turn will get falsely
// reported by this function as still being used, resulting in a false-positive
// memory leak report in the application.
// so we do a tight loop here, waiting, until the application reported memory
// gets stable.
LastRead := result;
MemCounters.cb := SizeOf(MemCounters);
if GetProcessMemoryInfo(GetCurrentProcess,
#MemCounters,
SizeOf(MemCounters)) then
Result := MemCounters.WorkingSetSize + MemCounters.PagefileUsage
else
RaiseLastOSError;
if AWaitForConsistentRead and (LastRead <> 0) and (abs(LastRead - result)>1024) then
begin
sleep(60);
application.processmessages;
end;
until (not AWaitForConsistentRead) or (abs(LastRead - result)<1024) or (maxCnt>1000);
// 60 seconds wait is a bit too much
// so if the system is that "unstable", let's just forget it.
end;
function CurrentFastMMMemory:Cardinal;
var mem:TMemoryManagerUsageSummary;
begin
GetMemoryManagerUsageSummary(mem);
result := mem.AllocatedBytes + mem.OverheadBytes;
end;
I am running the code on a 64bit computer and my top memory consumption before crashes is about 3.3 - 3.4 GB. After that, I get memory/resources related crashes anywhere in the application. Took me some time to pin it down on the large dynamic arrays usage which were buried down in some 3rd party library.
The way I am getting over this is that I made the application resume itself from where it left off, by re-starting itself and closing with certain parameters.
This is all nice and dandy if memory consumption is fair and current operation finishes.
The big problem happens when the current memory usage is 1GB and the next operation to process requires 2.5 GB memory or more to be processed. My current code limited itself to an upper value of 1.5 GB used memory before resuming, but in this situation, I'd have to drop the limit down under 1 GB which would basically have the application resume itself after each operation and not even that guaranteeing that everything will be fine.
What if another operation will have a larger data set to process and it will require a total of 4GB or more memory?
To note that I am not talking about actual 4 GB in memory, but consumed memory by allocating huge dynamic arrays which the OS doesn't get back once de-allocated and hence it still sees it as consumed, so it adds up.
So, my next point of attack is to force fastmm to release all (or at least part of) memory to the OS. I'm specifically targeting the huge dynamic arrays here. Again, these are in a 3rd party library so re-coding that is not really in the top options. It's much easier and faster to tinker in the fastmm code and write a proc to release the memory.
I can't switch from FastMM as currently the entire application and some of the 3rd party libs are heavily coded around the use of PushAllocationGroup in order to quickly find and pinpoint any memory leaks. I know I can write a dummy FastMM unit to solve the compilation references, but I will be left without this quick and certain leak detection.
In conclusion: is there any way I can force FastMM to release at least some of it's large blocks to the OS? (well, sure there is, the actual question is: did anybody write it and if so, mind sharing?)
Thanks
later edit:
I will come up with a small relevant test application soon. It doesn't appear to be that easy to mock up one
I doubt that the issue is actually down to FastMM. For huge memory blocks, FastMM will not do any sub-allocation. Your allocation request will be handled with a straight VirtualAlloc. And then deallocation is VirtualFree.
That's assuming that you are allocating those 380MB objects in one contiguous block. I suspect that what you actually have are ragged 2D dynamic arrays. And they are not single allocations. a 5000x5000 ragged 2D dynamic arrays takes 5001 allocations to initialise. One for the row pointers, and 5000 for the rows. Those will be medium FastMM blocks. There will be sub-allocation.
I think you are asking too much. In my experience, any time you need over 3GB of memory in a 32 bit process, it's game over. Fragmentation of address space will stop you before you run out of memory. You cannot hope for this to work. Switch to 64 bit, or use a cleverer, less demanding allocation pattern. Or do you really need dense 2D arrays? Can you use sparse storage?
If you cannot alleviate your memory demands that way, you could use memory mapped files. This would allow you to make use of the extra memory that your 64 bit system has. The system's disk cache can be larger than 4GB and so your app can traverse more than 4GB of memory without actually needing to hit the disk.
You could certainly try different memory managers. I honestly do not hold out any hope that it would help. You could write a trivial replacement memory manager that used HeapAlloc. And enable the low fragmentation heap (enabled by default from Vista on). But I sincerely doubt that it will help. I'm afraid that there won't be a quick fix for you. To resolve this you face a more fundamental modification to your code.
Your issue as others have said is most likely attributable to memory fragmentation. You could test this by using VirtualQuery to create a picture of how memory is allocated to your application. You will very likely find that although you may have more than enough total memory for a new array, you don't have enough contiguous memory.
FastMem already does a lot to try and avoid problems due to memory fragmentation. "Small" allocations are done at the low end of the address space, whereas "large" allocations are done at the high end. This avoids a common problem where a series of large then small allocations followed by all large allocations being released results in a large amount of fragmented memory that is almost unusable. (Certainly unusable by anything slightly larger than the original large allocations.)
To see the benfits of FastMem's approach, imagine your memory layed out as follows:
Each digit represent a 100mb block.
[0123456789012345678901234567890123456789]
Small allocations represented by "s".
Large allocations repestented by capital letters.
[0sssss678901GGGGFFFFEEEEDDDDCCCCBBBBAAAA]
Now if you free all your large blocks, you should have no trouble performing similar large allocations later.
[0sssss6789012345678901234567890123456789]
The problem is that "large" and "small" are relative, and highly dependent on the nature of your application. FastMem defines a dividing line between "large" and "small". If you happen to have some small allocations that FastMem would classify as large, you may encounter the following problem.
[0sss4sGGGGsFFFFsEEEEsDDDDsCCCCsBBBBsAAAA]
Now if you free the large blocks you're left with:
[0sss4s6789s1234s6789s1234s6789s1234s6789]
And an attempt to allocate something larger than 400mb will fail.
Options
You may be able to tweak the FastMem settings so that all your "small" allocations are also considered small by FastMem. However, there are a few situations where this won't work:
Any DLLs you use that allocate memory to your application but bypass FastMem may still cause fragmentation.
If you don't release all your large blocks together, those that remain may induce fragmentation which will slowly get worse over time.
You could take on the task of memory management yourself.
Allocate one very large block e.g. 3.5GB which you keep for the entire lifetime of the application.
Instead of using dynamic arrays, you determine the pointer locations to use when setting up a new array.
Of course the simplest alternative would be to go 64-bit.
You could consider alternate data structures.
Do you really need array lookup capability? If not, another structure that allocates in smaller chunks may suffice.
Even if you do need array lookup, consider a paged array. Sparse arrays are a combination of arrays and linked lists. Data is stored on pages, with linked lists chaining each page.
A simple variant (since you mentioned your arrays are 2 dimensional) would be to leverage that: One dimension forms its own array providing a lookup into one of multiple arrays for the second dimension.
Related to the alternate data structures option, consider storing some data on disk. Yes performance will be slower. But if an efficient caching mechanism can be found, then maybe not so much. It would be better to be a little slower, but not crashing.
Dynamic arrays are reference counted in Delphi, so they should be automatic released when they are not used anymore.
Like strings, they are handled with COW (copy on write) when shared/stored in several variables/objects. So it seems you have some kind of memory/reference leak (e.g. an object in memory that holds still are reference to an array).
Just to be sure: you are not doing any kind of low level pointer tricks, aren't you?
So please yes, post a test program (or send the complete program private via email) so one of us can take a look at it.

Determining available video memory

When developing an OpenGL program, is there a way to poll from the system to find out just how many megabytes are available to store textures, etc?
Or is the standard approach these days just allocate memory and forget about everything?
Although the official stance remains "you don't need to know, you don't want to know, and it would not help you anyway", luckily at least two IHVs have shown a little more insight lately and offer extensions to query that information:
NVX_gpu_memory_info
ATI_meminfo
One nice thing about these extensions is that they have a least common denominator which is just what most people need, and you don't need to query extension support or do anything special, as they both work via glGetIntegerv.
In the easiest case, you can just initialize an array of 4 integers to zero (or some minimum default value that you'll assume in case the extensions don't work), then you call glGetIntegerv twice (with GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX and TEXTURE_FREE_MEMORY_ATI, respectively), and finally call glGetError to clear the error state. glGetIntegerv does not modify the pointed-to memory if it fails, nor does it crash or any other bad thing -- it merely sets the error state to GL_INVALID_ENUM.
Both extensions return a value in the first array position, the ATI one returns some values in the other 3 too.
n.b.: glAreTexturesResident has not been supported for almost a decade on mainstream hardware, in the same manner as texture priorities. The common mantra is that the driver writer knows much better than you anyway.
OpenGL doesn't give you this information. And frankly: There's only little benefit, simply because today we have multitasking operating systems. The OpenGL driver is responsible for swapping in texture data to/from system memory, if there's demand for it.
What OpenGL can do for you, is tell, if the textures you've uploaded are still resident in fast memory. The function is called "glAreTexturesResident". You can use this to gradually upload stuff to the GPU until you've filled up the GPU's memory. But keep in mind that you're not the only user of the GPU.

Data Storage with AWE Memory via collections / lists / other containers

Does anyone have any suggestions (product, toolsets, methods or other) for the storage and processing of custom data (delphi collections, binary trees, DIContainers etc) that DOES NOT restrict itself to a standard win32 memory address space? To put that in the extreme, is there anything off the shelf that can do the equivalent of holding a 10GB TList, thereby blowing the /3GB switch barrier and the 4GB 'windows on windows' limit?
What we ideally need is something that is pretty transparent to the Delphi application programmer, but allows very fast access to the data held in its structures, preferably via key lookup. The equivalent of a delphi colletion container would be fine, but its memory usage needs to be via AWE. It would also need to take care of mapping and unmapping the physical space it uses into the win32 process making use of it i.e. that would be the transaprent bit...
Moving the data into a database is not the answer - the information needs to remain memory resident for very fast access. The in-memory databases/tables that we've tried do not make use of AWE and also are slow at accessing. Our current Delphi data structures are fine, but straining the limits of win32 address space.
I'm going to be a complete dork, and tell you that I've made something even more advanced than what you're describing.... at work. So it's all closed source I'm afraid. Never saw anything like this anywhere. We combine VM, AWE, MMF and (soon) 32<>64 bit IPC into one big, mean data-processing machine, addressing up to 64 GB of memory, while processing hundreds of datasets, tens of GBs each.
But I can give you a few tips : AWE view-swapping is rather slow, because it forcibly pauses all running threads during the swap. Therefor, choose your window-sizes wisely (the smaller, the faster the swap - but call-overhead is lower with larger sizes ofcourse). We've settled with AWE view-sizes equal to the Windows default page-size (4 KB), but only because random-access performs best that way. Lineair data-access could run faster with bigger view-sizes.
Each view can map to any part of the allocated AWE memory, so one thing that can help is mapping only those pages into a view that need to be accessed - and try to save on unnessecary view-swaps (a priority-queue comes to mind).
Also, there should be a registration-mechanism somewhere in your design that handles the linkage between a view and the AWE memory behind this. And this better be thread-safe!
As for general usage : No, this doesn't fit in with regular Delphi classes. You should switch over to another concept altogether - and base your data-structures on that.
Anyway, good luck mate! You're going to need it... ;-)
There are system calls that can do this but it is not supported on all versions of Windows (in particular, Windows XP does not support AWE).
Transparency would be something of an issue as the API could not return pointers to objects. Mapping more than 4GB of RAM into a 4GB address space means that a 32 bit pointer could be ambiguous - you could potentially map different objects into the same location.
This ambiguity means that you would have to generate proxies for the objects which hold a handle that could be used to access the 'record'. Some SQL server versions use this technique to store disk buffers in AWE memory. An approach like this would probably work for something like rows in a matrix where the operations are done on the whole row. Finer grained access would be more fiddly.
In order to provide direct access to the mapped object you would have to implement a protocol where a temporary pointer to the mapped memory was made available. This would also require the object to be locked in memory while in use - again, bang goes your transparency.
Assuming you can get a 64 bit version of Delphi now you might be better off going to a 64 bit version of Windows for customers that need more RAM.
You state that you do not want to move to a database, but what about a database that specifically uses AWE?
I've not tried it personally, but would consider using products from this company for my own projects.
[Edit]: NexusDB is Delphi-friendly: it originated from the old Turbopower FlashFiler development (but has moved on a long way since then).
The issue with AWE it works very much alike the old, DOS-based EMS and XMS - if you ever used them. Basically, a range of addressable memory is reserved, and the memory outside the addressable range is then mapped to the addressable range when needed, and unmapped when no longer need, allowing other memory to be mapped at the same addresses. Thereby most non-AWE aware data structures or containers wouldn't work in such a scenario - probably a TMemoryStream descendant is easier to build. It should be easy enough to build a TList or the like that store data in AWE memory, it should keep track where the data are really stored and recall them when needed, adjusting addresses as well when data are mapped to addressable memory. I am not aware of any Delphi containers library using AWE, and there is another issue: desktop 32 bit operating systems can't use more than 4GB of physical RAM, a server version would be required, and the supported physical RAM depends on what version is used, see here for a complete list.
Assuming the data is loaded once in bulk and fits available memory, NexusDB AWE will be very very fast. The database can be created as an in-memory only DB and will then not need any further harddrive access while manipulating.
Sounds to me like you guys might consider dropping the current database SQL backend and going to a 100% NexusDB + AWE solution.
(Or rather, dropping the day to day access to the SQL backend, and having an export/sync function that can write out any required NexusDB reporting data to an MSSQL reporting db.)
W
Your situation sounds similar to ours, our application uses a huge datafile that we store in a memory-mapped file. The files are around 750MB, and we allocate data structures from them that use up to 1.5GB of RAM.
We have found no solution to the 4GB limit other than moving some of it off to FPC/Lazarus until Delphi is 64-bit, unfortunately. AWE does not work with Vista Home versions, also we couldn't get it to work with MMFs.
You could try memory-mapped files with a sliding window, meaning you dynamically create views of different chunks of the file depending on what part of it the application is using. Sounds like that won't work though because you need the entire file in memory at once.

Resources