I get an "EOutofresources - Not enough storage" when I try to work with BMP files when I try to set BMP.Height or BMP.Width. Imediatelly after these instructions, the stack trace is (in this order):
ntdll.dll.RtlLeaveCriticalSection, kernel32.dll.FileTimeToDosDateTime, GDI32.dll.GdiReleaseDC, GDI32.dll.PatBlt, kernel32.dll.ReadFile or like this:
|7E429130|user32.dll GetParent
|7C90FF2D|ntdll.dll RtlGetNtGlobalFlags
|77F15A00|GDI32.dll GdiReleaseDC
|7C83069E|kernel32.dll FileTimeToDosDateTime
|7C9010E0|ntdll.dll RtlLeaveCriticalSection
| |my function (where I set BMP.Height or BMP.Width)
At a moment I was sure that it has to do something with memory fragmentation - the system had enough free ram to process my image BUT the memory was fragmented so there was no block large enough to hold my image. But then I have seen it happening once 11 seconds after Windows start up. My program cycled through the loop where I process the images ONLY once! So, this could not be related to RAM fragmentation.
A different situation (but still related to drawing) when I got this error is below:
|77F16A7E|GDI32.dll IntersectClipRect
|77F16FE5|GDI32.dll BitBlt
|7E429011|user32.dll OffsetRect
|7E42A97D|user32.dll CallWindowProcA
|7E42A993|user32.dll CallWindowProcA
|7C9010E0|ntdll.dll RtlLeaveCriticalSection
|7E4196C2|user32.dll DispatchMessageA
|7E4196B8|user32.dll DispatchMessageA
|0058A2E1|UTest.exe UTest.dpr
|7C90DCB8|ntdll.dll ZwSetInformationThread
I think there is always a 'RtlLeaveCriticalSection' call in the stack trace after BMP.Height.
There is this post pointing to a possible solution by editing a Windows registry key. However, the post says that it applies only to Win XP. While my error appears also on Win 7.
I see many similar posts (some of them are close connected to saving a file to disk) but until nobody came back to report that he fixed the error.
Update:
As you requested, this is the code where the error appears:
procedure TMyBitmap.SetLargeSize(iWidth, iHeight: Integer);
CONST ctBytesPerPixel= 3;
begin
{ Protect agains huge/empty images }
if iWidth< 1 then iWidth:= 1 else
if iWidth> 32768 then iWidth:= 32768;
if iHeight< 1 then iHeight:= 1 else
if iHeight> 32768 then iHeight:= 32768;
{ Set image type }
if iWidth * iHeight * ctBytesPerPixel > 9000000 {~9MB}
then HandleType:= bmDIB { Pros and cons: -no hardware acceleration, +supports larger images }
else HandleType:= bmDDB;
{ Total size is higher than 1GB? }
if (iWidth* iHeight* ctBytesPerPixel) > 1*GB then
begin
Width := 8000; { Set a smaller size }
Height := 8000; { And rise an error }
RAISE Exception.Create('Image is too large.');
end;
{ Set size }
Width := iWidth; <----------------- HERE
Height:= iHeight;
end;
From my experiment, the maximum bitmap size depends on:
The OS version (e.g. XP seems to allow smaller bitmap resources than Seven);
The OS edition (64 bit OS allows bigger resource allocation than 32 bit OS);
The current RAM installed (and free);
The number of bitmaps already allocated (since those are shared resources).
So you can not be sure that a bitmap allocation will be successful, when you start working with huge data (more than an on-screen bitmap resolution).
Here are some potential solutions (I've used some of them):
Allocate not a bitmap resource, but a plain memory block to work with, then use direct Win32 BitBlt API to draw it - but you'll have to write some dedicated process functions (or use some third party libraries), and on 32 bit OS, IMHO VirtualAlloc API (the one called by FastMM4 for big blocks of memory) won't be able to allocate more than 1 GB of contiguous memory;
Enhancement of the previous version: either use a 64 bit process to handle huge RAM block (welcome XE2 compiler), or use a file for temporary storage, then memory-map its content for processing (it is how PhotoShop or other handle huge memory) - if you have enough RAM, using a temporary file won't be necessary slower (no data will be written on disk);
Tile you big pictures into smaller pictures - the JPEG library is able to render only a portion of the picture, and it will fit easily into a bitmap resource;
In all cases, prevent any duplicated bitmap resource (since all bitmap resource are shared): for instance, if you are reading from a bitmap, copy its content into a temporary memory block or file, release its resource, then allocate your destination bitmap;
About performance, make it right, then make it fast - do not include "tricks" too early in your implementation (your customer may accept waiting some seconds, but won't accept a global failure).
There is no perfect solution (my preference is to use smaller pictures, because it has the advantage of being easily multi-threaded for the process, so may speed up a lot with new CPU), and be aware that resource allocation may work on your brand new 64 bit Windows Seven PC, but fail on the 32 bit XP computer of your customer.
Related
I want to create large Bitmap with code
LargeBmp := TBitmap.Create;
try
LargeBmp.Width := 1000; // fine
LargeBmp.Height := 15000; // XP - EOutOfResources, Not enough memory, Win 7 - works fine
...
finally
FreeAndNil(LargeBmp);
end;
This code raises an EOutOfResources exception with message "Not enough memory" on Windows XP but works fine in Windows 7.
What is wrong? Why Not enough memory? It's only 60 MB.
Set the pixel format like this:
LargeBmp.PixelFormat := pf24Bit;
I had the same problem several times and that always solved it.
As was discussed already, if you don't set the pixel format Windows will see it as a device-dependent bitmap. When you set the pixel format you create a DIB (device-independent bitmap). This means it's independent of the display device (graphics card).
I've had this same problem, and wanted to point out that pf24bit is not the only option. In the Graphics unit, you also have:
TPixelFormat = (pfDevice, pf1bit, pf4bit, pf8bit, pf15bit, pf16bit, pf24bit, pf32bit, pfCustom);
For a project I'm working on, I found the 8 bit option worked the best for what I needed, since I had a very large bitmap (high resolution), but limited colors (I was creating the whole bitmap from some simple code).
So try out a few others, other than just pf24bit, to find what's optimal for your production environment. I saved quite a bit of internal memory with the pf8bit option.
Created bitmaps (by default) is stored in some buffer. That buffer's size depends on videodriver, os and God knows what else.
This buffer can be pretty small (about 20-25mb) and if you will try to create more, it will fail.
To avoid this try to create DIB instead of TBitmap, or try to change Pixelformat to pf24bit. This will tell system to create Bitmap in user's memory instead of GDI buffer.
Now, why it not fails in win7 you ask? Ok, probably cause there is no GDI, but GDI+ and Direct2D uses in win 7 instead. Maybe other driver's version, dunno.
I want to open some relative big files (jpg, gif, bmp) using as little RAM as possible.
Inside my program I need all open files converted to BMP so I can process them. However the conversion from JPG to BMP takes 27.1MB of RAM if I use the classic conversion code:
function ConvertJPG2BMP(FullFileName: string; BMP: TBitmap);
VAR JPG: TJpegImage;
begin
JPG:= TJpegImage.Create;
TRY
TRY
JPG.LoadFromFile(FullFileName);
BMP.Assign(JPG);
EXCEPT
END;
FINALLY
FreeAndNil(JPG);
end;
end;
because it uses two images (a jpeg that is transferred to a bitmap then).
--
However, if I use a TPicture to load the file, I use only 7.1MB of RAM. But in this case the TPicture.Bitmap is empty and I need a valid TBitmap object.
Is there any way to load images from disk while keeping the mem footprint small?
--
(Test file: 1.JPG 2.74MB 3264x1840 pix)
Back of the envelope calculation gives 6 mega pixels. Assuming 32 bit colour this takes you to 24MB.
You aren't going to do any better than your current code.
The memory usage does not come from the JPEG library, but in the way you use it.
If you convert a JPEG into a TBitmap, it will create a bitmap resource, then uncompress the JPEG into the bitmap memory buffer.
You can paint directly from the JPEG content into the screen. Depending on the JPEG implementation, it will use (or not) a temporary TBitmap.
You are not tied to the JPEG unit supplied by Borland.
For instance, you may try calling directly the StretchDIBits() windows API from the uncompressed memory buffer, as such (this code is extracted from our SSE JPEG decoder):
procedure TJpegDecode.DrawTo(Canvas: TCanvas; X, Y: integer);
var BMI: TBitmapInfo;
begin
if #self=nil then
exit;
ToBMI(BMI);
StretchDIBits(Canvas.Handle,X,Y,width,height,0,0,width,height,pRGB,
BMI,DIB_RGB_COLORS,SrcCopy);
end;
Creating a huge bitmap is sometimes not possible (at least under Windows XP), because it uses shared GDI resources, whereas using plain RAM and StretchDIBits will always work, even for huge content. You can create a memory mapped file to handle the binary content, but just allocating the memory at once would suffice (and Windows will use hard drive only if short of RAM). With today's PCs, you should have enough RAM available even for big pictures. (17 MB is not a big deal, even for your 3264x1840 pix).
Then, from this global uncompressed memory buffer containing raw pixel triplets, you can use
a smaller bitmap corresponding to a region of the picture, then work on the region using StretchDIBits(aBitmap.Handle,.... It will use less GDI resource.
You could also rely for instance on GDI+ drawing, which will draw it without any temporary bitmap. See e.g. this OpenSource unit. From our testing, it's very fast and can be used without any TBitmap. You could also ask only for a region of the whole picture, and draw it using GDI+ on your bitmap canvas: this will use less RAM. And your exe will be a bit smaller than the default JPEG unit. And you'll be able to display and save not only JPEG, but GIF and TIFF formats.
If you want to minimize even further the memory usage, you'll have to call directly the JPEG library, at lowest-level. It's able to uncompress only a region of the JPEG. So you would be able to minimize used RAM. You may try using the IJL library with Delphi, a bit old, but still working.
I will have to create a multi-threading project soon I have seen experiments ( delphitools.info/2011/10/13/memory-manager-investigations ) showing that the default Delphi memory manager has problems with multi-threading.
So, I have found this SynScaleMM. Anybody can give some feedback on it or on a similar memory manager?
Thanks
Our SynScaleMM is still experimental.
EDIT: Take a look at the more stable ScaleMM2 and the brand new SAPMM. But my remarks below are still worth following: the less allocation you do, the better you scale!
But it worked as expected in a multi-threaded server environment. Scaling is much better than FastMM4, for some critical tests.
But the Memory Manager is perhaps not the bigger bottleneck in Multi-Threaded applications. FastMM4 could work well, if you don't stress it.
Here are some (not dogmatic, just from experiment and knowledge of low-level Delphi RTL) advice if you want to write FAST multi-threaded application in Delphi:
Always use const for string or dynamic array parameters like in MyFunc(const aString: String) to avoid allocating a temporary string per each call;
Avoid using string concatenation (s := s+'Blabla'+IntToStr(i)) , but rely on a buffered writing such as TStringBuilder available in latest versions of Delphi;
TStringBuilder is not perfect either: for instance, it will create a lot of temporary strings for appending some numerical data, and will use the awfully slow SysUtils.IntToStr() function when you add some integer value - I had to rewrite a lot of low-level functions to avoid most string allocation in our TTextWriter class as defined in SynCommons.pas;
Don't abuse on critical sections, let them be as small as possible, but rely on some atomic modifiers if you need some concurrent access - see e.g. InterlockedIncrement / InterlockedExchangeAdd;
InterlockedExchange (from SysUtils.pas) is a good way of updating a buffer or a shared object. You create an updated version of of some content in your thread, then you exchange a shared pointer to the data (e.g. a TObject instance) in one low-level CPU operation. It will notify the change to the other threads, with very good multi-thread scaling. You'll have to take care of the data integrity, but it works very well in practice.
Don't share data between threads, but rather make your own private copy or rely on some read-only buffers (the RCU pattern is the better for scaling);
Don't use indexed access to string characters, but rely on some optimized functions like PosEx() for instance;
Don't mix AnsiString/UnicodeString kind of variables/functions, and check the generated asm code via Alt-F2 to track any hidden unwanted conversion (e.g. call UStrFromPCharLen);
Rather use var parameters in a procedure instead of function returning a string (a function returning a string will add an UStrAsg/LStrAsg call which has a LOCK which will flush all CPU cores);
If you can, for your data or text parsing, use pointers and some static stack-allocated buffers instead of temporary strings or dynamic arrays;
Don't create a TMemoryStream each time you need one, but rely on a private instance in your class, already sized in enough memory, in which you will write data using Position to retrieve the end of data and not changing its Size (which will be the memory block allocated by the MM);
Limit the number of class instances you create: try to reuse the same instance, and if you can, use some record/object pointers on already allocated memory buffers, mapping the data without copying it into temporary memory;
Always use test-driven development, with dedicated multi-threaded test, trying to reach the worse-case limit (increase number of threads, data content, add some incoherent data, pause at random, try to stress network or disk access, benchmark with timing on real data...);
Never trust your instinct, but use accurate timing on real data and process.
I tried to follow those rules in our Open Source framework, and if you take a look at our code, you'll find out a lot of real-world sample code.
If your app can accommodate GPL licensed code, then I'd recommend Hoard. You'll have to write your own wrapper to it but that is very easy. In my tests, I found nothing that matched this code. If your code cannot accommodate the GPL then you can obtain a commercial licence of Hoard, for a significant fee.
Even if you can't use Hoard in an external release of your code you could compare its performance with that of FastMM to determine whether or not your app has problems with heap allocation scalability.
I have also found that the memory allocators in the versions of msvcrt.dll distributed with Windows Vista and later scale quite well under thread contention, certainly much better than FastMM does. I use these routines via the following Delphi MM.
unit msvcrtMM;
interface
implementation
type
size_t = Cardinal;
const
msvcrtDLL = 'msvcrt.dll';
function malloc(Size: size_t): Pointer; cdecl; external msvcrtDLL;
function realloc(P: Pointer; Size: size_t): Pointer; cdecl; external msvcrtDLL;
procedure free(P: Pointer); cdecl; external msvcrtDLL;
function GetMem(Size: Integer): Pointer;
begin
Result := malloc(size);
end;
function FreeMem(P: Pointer): Integer;
begin
free(P);
Result := 0;
end;
function ReallocMem(P: Pointer; Size: Integer): Pointer;
begin
Result := realloc(P, Size);
end;
function AllocMem(Size: Cardinal): Pointer;
begin
Result := GetMem(Size);
if Assigned(Result) then begin
FillChar(Result^, Size, 0);
end;
end;
function RegisterUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
Result := False;
end;
const
MemoryManager: TMemoryManagerEx = (
GetMem: GetMem;
FreeMem: FreeMem;
ReallocMem: ReallocMem;
AllocMem: AllocMem;
RegisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak;
UnregisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak
);
initialization
SetMemoryManager(MemoryManager);
end.
It is worth pointing out that your app has to be hammering the heap allocator quite hard before thread contention in FastMM becomes a hindrance to performance. Typically in my experience this happens when your app does a lot of string processing.
My main piece of advice for anyone suffering from thread contention on heap allocation is to re-work the code to avoid hitting the heap. Not only do you avoid the contention, but you also avoid the expense of heap allocation – a classic twofer!
It is locking that makes the difference!
There are two issues to be aware of:
Use of the LOCK prefix by the Delphi itself (System.dcu);
How does FastMM4 handles thread contention and what it does after it failed to acquire a lock.
Use of the LOCK prefix by the Delphi itself
Borland Delphi 5, released in 1999, was the one that introduced the lock prefix in string operations. As you know, when you assign one string to another, it does not copy the whole string but merely increases the reference counter inside the string. If you modify the string, it is de-references, decreasing the reference counter and allocating separate space for the modified string.
In Delphi 4 and earlier, the operations to increase and decrease the reference counter were normal memory operations. The programmers that have used Delphi knew about and, and, if they were using strings across threads, i.e. pass a string from one thread to another, have used their own locking mechanism only for the relevant strings. Programmers did also use read-only string copy that did not modify in any way the source string and did not require locking, for example:
function AssignStringThreadSafe(const Src: string): string;
var
L: Integer;
begin
L := Length(Src);
if L <= 0 then Result := '' else
begin
SetString(Result, nil, L);
Move(PChar(Src)^, PChar(Result)^, L*SizeOf(Src[1]));
end;
end;
But in Delphi 5, Borland have added the LOCK prefix to the string operations and they became very slow, compared to Delphi 4, even for single-threaded applications.
To overcome this slowness, programmers became to use "single threaded" SYSTEM.PAS patch files with lock's commented.
Please see https://synopse.info/forum/viewtopic.php?id=57&p=1 for more information.
FastMM4 Thread Contention
You can modify FastMM4 source code for a better locking mechanism, or use any existing FastMM4 fork, for example https://github.com/maximmasiutin/FastMM4
FastMM4 is not the fastest one for multicore operation, especially when the number of threads is more than the number of physical sockets is because it, by default, on thread contention (i.e. when one thread cannot acquire access to data, locked by another thread) calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.
Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.
That’s why, on multithreded wotk with FastMM, CPU use never reached 100% - because of the Sleep(1) issued by FastMM4. This way of acquiring locks is not optimal. A better way would have been a spin-lock of about 5000 pause instructions, and, if the lock was still busy, calling SwitchToThread() API call. If pause is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.
The fork that I've mentioned uses a new approach to waiting for a lock, recommended by Intel in its Optimization Manual for developers - a spinloop of pause + SwitchToThread(), and, if any of these are not available: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).
When these options are enabled, FastMM4-AVX it checks: (1) whether the CPU supports SSE2 and thus the "pause" instruction, and (2) whether the operating system has the SwitchToThread() API call, and, if both conditions are met, uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.
You can see the test results, including made on a computer with multiple physical CPUs (sockets) in that fork.
See also the Long Duration Spin-wait Loops on Hyper-Threading Technology Enabled Intel Processors article. Here is what Intel writes about this issue - and it applies to FastMM4 very well:
The long duration spin-wait loop in this threading model seldom causes a performance problem on conventional multiprocessor systems. But it may introduce a severe penalty on a system with Hyper-Threading Technology because processor resources can be consumed by the master thread while it is waiting on the worker threads. Sleep(0) in the loop may suspend the execution of the master thread, but only when all available processors have been taken by worker threads during the entire waiting period. This condition requires all worker threads to complete their work at the same time. In other words, the workloads assigned to worker threads must be balanced. If one of the worker threads completes its work sooner than others and releases the processor, the master thread can still run on one processor.
On a conventional multiprocessor system this doesn't cause performance problems because no other thread uses the processor. But on a system with Hyper-Threading Technology the processor the master thread runs on is a logical one that shares processor resources with one of the other worker threads.
The nature of many applications makes it difficult to guarantee that workloads assigned to worker threads are balanced. A multithreaded 3D application, for example, may assign the tasks for transformation of a block of vertices from world coordinates to viewing coordinates to a team of worker threads. The amount of work for a worker thread is determined not only by the number of vertices but also by the clipped status of the vertex, which is not predictable when the master thread divides the workload for working threads.
A non-zero argument in the Sleep function forces the waiting thread to sleep N milliseconds, regardless of the processor availability. It may effectively block the waiting thread from consuming processor resources if the waiting period is set properly. But if the waiting period is unpredictable from workload to workload, then a large value of N may make the waiting thread sleep too long, and a smaller value of N may cause it to wake up too quickly.
Therefore the preferred solution to avoid wasting processor resources in a long duration spin-wait loop is to replace the loop with an operating system thread-blocking API, such as the Microsoft Windows* threading API,
WaitForMultipleObjects. This call causes the operating system to block the waiting thread from consuming processor resources.
It refers to Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor application note.
You can also find a very good spin-loop implementation here at stackoverflow.
It also loads normal loads just to check before issuing a lock-ed store, just to not flood the CPU with locked operations in a loop, that would lock the bus.
FastMM4 per se is very good. Just improve the locking and you will get an excelling multi-threaded memory manager.
Please also be aware that each small block type is locked separately in FastMM4.
You can put padding between the small block control areas, to make each area have own cache line, not shared with other block sizes, and to make sure it begins at a cache line size boundary. You can use CPUID to determine the size of the CPU cache line.
So, with locking correctly implemented to suit your needs (i.e. whether you need NUMA or not, whether to use lock-ing releases, etc., you may obtain the results that the memory allocation routines would be several times faster and would not suffer so severely from thread contention.
FastMM deals with multi-threading just fine. It is the default memory manager for Delphi 2006 and up.
If you are using an older version of Delphi (Delphi 5 and up), you can still use FastMM. It's available on SourceForge.
You could use TopMM:
http://www.topsoftwaresite.nl/
You could also try ScaleMM2 (SynScaleMM is based on ScaleMM1) but I have to fix a bug regarding to interthread memory, so not production ready yet :-(
http://code.google.com/p/scalemm/
Deplhi 6 memory manager is outdated and outright bad. We were using RecyclerMM both on a high-load production server and on a multi-threaded desktop application, and we had no issues with it: it's fast, reliable and doesn't cause excess fragmentation. (Fragmentation was Delphi's stock memory manager worst issue).
The only drawback of RecyclerMM is that it isn't compatible with MemCheck out of the box. However, a small source alteration was enough to render it compatible.
I know I can reserve virtual memory using VirtualAlloc.
e.g. I can claim 1GB of virtual memory and then call in the first MB of that to put my a growing array into.
When the array grows beyond 1MB I call in the 2nd MB and so on.
This way I don't need to move the array around in memory when it grows, it just stays in place and the Intel/AMD virtual memory manager takes care of my problems.
However does FastMM support this structure, so I don't have to do my own memory management?
Pseudo code:
type
PBigarray = ^TBigarray;
TBigArray = array[0..0] of SomeRecord;
....
begin
VirtualMem:= FastMM.ReserveVirtualMemory(1GB);
PBigArray:= FastMM.ClaimPhysicalMemory(VirtualMem, 1MB);
....
procedure GrowBigArray
begin
FastMM.ClaimMorePhysicalMemory(PBigArray, 1MB {extra});
//will generate OOM exception when claim exceeds 1GB
Does FastMM support this?
No, FastMM4 (as of the latest version I looked at) does not explicitly support this. It's really not a functionality you would expect in a general purpose memory manager as it's trivially simple to do with VirtualAlloc calls.
NexusMM4 (which is part of NexusDB) does something that gives you a similar result, but without wasting all the address space before it is needed in the background.
If you make an initial large allocation (directly via GetMem, or indirectly via a dynamic array or such) the memory is allocated in just the size needed, via VirtualAlloc.
But if that allocation is then resized to a larger size, NexusMM will use a different way to allocate memory which allows it to simply unmap the allocation from the address space an remap it again, at a larger size, when further reallocs takes place.
This prevents the 2 major problems that most general purpose memory managers have when reallocating:
during a normal realloc the existing and new allocation need to be present in the address space at the same time, temporarily doubling the address space and physical memory requirements
during a normal realloc, the whole contents of the existing allocation needs to be copied
So with NexusMM you would get all the advantages of what you showed in your pseudo code (with the exception that the first realloc will involve a copy, and that growing your array might change it's address) by simply using normal GetMem/ReallocMem/FreeMem calls.
I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.
I reserve the memory at boot time using the kernel parameter memmap=2G$1G. Then, in the driver's __init function, I ioremap some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
I load the driver, and check dmesg. It reports:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?
This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.
EDIT:
The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.
See my eventual solution below.
ioremap allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.
You probably want kmalloc or vmalloc. The usual reference materials will explain the capabilities of each.
I don't think ioremap() is what you want there. You should only access the result (what you call reservedBlock) with readb, readl, writeb, memcpy_toio etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.
It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.
Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.
I have tried out doing a huge memory chunk reservations with the memmap
The ioremapping of this chunk gave me a mapped memory address space which in beyond few tera bytes.
when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo
0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap
Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).
Secondly, Your observation is correct. even the memset_io results in a extremely large delays (in tens of minutes) to touch all this memory.
So, the time taken has to do mainly with address space conversion and non cacheable page loading.