Working with large arrays - OutOfRam - delphi

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.

The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.

If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.

Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

Related

Does the order or syntax of allocate statement affect performance? (Fortran)

Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).
From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.
There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.

Linked List in OpenCL

I have 1000 float datas in an array. I want to separate into different classes, lets say 4 classes. Their sizes are unpredictable. I could easily hold them in a linked list in a CPU implementation, but in OpenCL kernel, is there an opportunity like that? In my mind there are 3 solution to this problem.
First, arrays with length 1000 constructed in number of classes, which is memory costly.
Second, I allocate an array with length 1000 and separate them into parts. However, I may transport the values from and index into different index, becuase I don't know the size of each classes and they may exceed the size which I provided for each.
Third, and better in my opinion, I get two different array with same length. One of them stores data, the other one stores pointers. For example, in i-th index of data array, the value is stored which belongs to 2nd class. Additionally in i-th index of pointer to the next data which belongs to 2nd class. But this is good for just atomic type (like int, float, char etc) linked lists.
I am new in OpenCL. I haven't known lots of features of it yet. If there is a better way, please don't share with me and others.
Using pointers on GPU is usually very bad idea. Major amount of data resides in global memory, and to fetch it quickly the access should be coalesced. Using pointers breaks the access pattern totally, making it essentially random. It's not very good on CPUs too since it cause a lot of cache misses, but CPUs have larger caches and "smarter" internal logic, so it's usually not so important, but sometimes cache-aware memory access pattern can increase CPU application's speed by nearly order of magnitude. On GPUs coalesced global memory access is one of most important optimizations, and pointers can't provide it.
If you are not extremely short on memory, I'd suggest to use first way and preallocate arrays large enough to hold all data. If you are really short on memory, you could use textures to store your data and pointer arrays, but it depends on the algorithm whether it would provide any benefits or not.

Fast access to a (sorted) TList

My project (running on Delphi 6!) requires a list of memory allocations (TMemoryAllocation) which I store within an object that also holds the information about the size of the allocation (FSize) and if the allocation is in use or free (FUsed). I use this basically as a GarbageCollector and a way to keep my application of allocating/deallocating memory all the time (and there are plenty allocations/deallocations needed).
Whenever my project needs an allocation it looks up the list to find a free allocation that fits the required size. To achieve that I use a simple for loop:
for I := 0 to FAllocationList.Count - 1 do
begin
if MemoryAllocation.FUsed and (MemoryAllocation.FSize = Size) then
...
The longer my application runs this list grows to several thousand items and it slows down considerably as I run it up very frequently (several times per second).
I am trying to find a way to accelerate this solution. I thought about sorting the TList by size of allocation. If I do that I should use some intelligent way of accessing the list for the particular size I need at every call. Is there some easy way to do this?
Another way I was thinking about was to have two TLists. One for the Unused and one of the Used allocations. That would mean though that I would have to Extract TList.Items from one list and add to the other all the time. And I would still need to use a for-loop to go through the (now) smaller list. Would this be the right way?
Other suggestions are VERY welcome as well!
You have several possibilities:
Of course, use a proven Memory Manager as FastMM4 or some others dedicated to better scale for multi-threaded applications;
Reinvent the wheel.
If your application is very sensitive to memory allocation, perhaps some feedback about reinventing the wheel:
Leverage your blocks size e.g. per 16 bytes multiples, then maintain one list per block size - so you can quickly reach the good block "family", and won't have to store each individual block size in memory (if it's own in the 32 bytes list, it is a 32 bytes blocks);
If you need reallocation, try to guess the best increasing factor to reduce memory copy;
Sort your blocks per size, then use binary search, which will be much faster than a plain for i := 0 to Count-1 loop;
Maintain a block of deleted items in the list, in which to lookup when you need a new item (so you won't have to delete the item, just mark it as free - this will speed up a lot if the list is huge);
Instead of using a list (which will have some speed issues when deleting or inserting sorted items with a huge number of items) use a linked list, for both items and freed items.
It's definitively not so simple, so you may want to look at some existing code before, or just rely on existing libraries. I think you don't have to code this memory allocation in your application, unless FastMM4 is not fast enough for you, which I'll doubt very much, because this is a great piece of code!

how to generate a bidimensional array with different "branch" lengths very fast

I am a Delphi programmer.
In a program I have to generate bidimensional arrays with different "branch" lengths.
They are very big and the operation takes a few seconds (annoying).
For example:
var a: array of array of Word;
i: Integer;
begin
SetLength(a, 5000000);
for i := 0 to 4999999 do
SetLength(a[i], Diff_Values);
end;
I am aware of the command SetLength(a, dim1, dim2) but is not applicable. Not even setting a min value (> 0) for dim2 and continuing from there because min of dim2 is 0 (some "branches" can be empty).
So, is there a way to make it fast? Not just by 5..10% but really FAST...
Thank you.
When dealing with a large amount of data, there's a lot of work that has to be done, and this places a theoretical minimum on the amount of time it can be done in.
For each of 5 million iterations, you need to:
Determine the size of the "branch" somehow
Allocate a new array of the appropriate size from the memory manager
Zero out all the memory used by the new array (SetLength does this for you automatically)
Step 1 is completely under your control and can possibly be optimized. 2 and 3, though, are about as fast as they're gonna get if you're using a modern version of Delphi. (If you're on an old version, you might benefit from installing FastMM and FastCode, which can speed up these operations.)
The other thing you might do, if appropriate, is lazy initialization. Instead of trying to allocate all 5 million arrays at once, just do the SetLength(a, 5000000); at first. Then when you need to get at a "branch", first check if its length = 0. If so, it hasn't been initialized, so initialize it to the proper length. This doesn't save time overall, in fact it will take slightly longer in total, but it does spread out the initialization time so the user doesn't notice.
If your initialization is already as fast as it will get, and your situation is such that lazy initialization can't be used here, then you're basically out of luck. That's the price of dealing with large amounts of data.
I just tested your exact code, with a constant for Diff_Values, timed it using GetTickCount() for rudimentary timing. If Diff_Values is 186 it takes 1466 milliseconds, if Diff_Values is 187 it fails with Out of Memory. You know, Out of Memory means Out of Address Space, not really Out of Memory.
In my opinion you're allocating so much data you run out of RAM and Windows starts paging, that's why it's slow. On my system I've got enough RAM for the process to allocate as much as it wants; And it does, until it fails.
Possible solutions
The obvious one: Don't allocate that much!
Figure out a way to allocate all data into one contiguous block of memory: helps with address space fragmentation. Similar to how a bi dimensional array with fixed size on the "branches" is allocated, but if your "branches" have different sizes, you'll need to figure a different mathematical formula, based on your data.
Look into other data structures, possibly ones that cache on disk (to brake the 2Gb address space limit).
In addition to Mason's points, here are some more ideas to consider:
If the branch lengths never change after they are allocated, and you have an upper bound on the total number of items that will be stored in the array across all branches, then you might be able to save some time by allocating one huge chunk of memory and divvying up the "branches" within that chunk yourself. Your array would become a 1 dimensional array of pointers, and each entry in that array points to the start of the data for that branch. You keep track of the "end" of the used space in your big block with a single pointer variable, and when you need to reserve space for a new "branch" you take the current "end" pointer value as the start of the new branch and increment the "end" pointer by the amount of space that branch requires. Don't forget to round up to dword boundaries to avoid misalignment penalties.
This technique will require more use of pointers, but it offers the potential of eliminating all the heap allocation overhead, or at least replacing the general purpose heap allocation with a purpose-built very simple, very fast suballocator that matches your specific use pattern. It should be faster to execute, but it will require more time to write and test.
This technique will also avoid heap fragmentation and reduces the releasing of all the memory to a single deallocation (instead of millions of separate allocations in your present model).
Another tip to consider: If the first thing you always do with the each newly allocated array "branch" is assign data into every slot, then you can eliminate step 3 in Mason's example - you don't need to zero out the memory if all you're going to do is immediately assign real data into it. This will cut your memory write operations by half.
Assuming you can fit the entire data structure into a contiguous block of memory, you can do the allocation in one shot and then take over the indexing.
Note: Even if you can't fit the data into a single contiguous block of memory, you can still use this technique by allocating multiple large blocks and then piecing them together.
First off form a helper array, colIndex, which is to contain the index of the first column of each row. Set the length of colIndex to RowCount+1. You build this by setting colIndex[0] := 0 and then colIndex[i+1] := colIndex[i] + ColCount[i]. Do this in a for loop which runs up to and including RowCount. So, in the final entry, colIndex[RowCount], you store the total number of elements.
Now set the length of a to be colIndex[RowCount]. This may take a little while, but it will be quicker than what you were doing before.
Now you need to write a couple of indexers. Put them in a class or a record.
The getter looks like this:
function GetItem(row, col: Integer): Word;
begin
Result := a[colIndex[row]+col];
end;
The setter is obvious. You can inline these access methods for increased performance. Expose them as an indexed property for convenience to the object's clients.
You'll want to add some code to check for validity of row and col. You need to use colIndex for the latter. You can make this checking optional with {$IFOPT R+} if you want to mimic range checking for native indexing.
Of course, this is a total non-starter if you want to change any of your column counts after the initial instantiation!

String to Byte [delphi]

I need to store my data into memory. My type data of my data is string. I want to minimize the memory usage. I guess I have to change string into byte. Am I right? If I convert string to byte, that means I have to convert string to TMemoryStream?
If you really want to convert it then this code will get it done
var
BinarySize: Integer;
InputString: string;
StringAsBytes: array of Byte;
begin
BinarySize := (Length(InputString) + 1) * SizeOf(Char);
SetLength(StringAsBytes, BinarySize);
Move(InputString[1], StringAsBytes[0], BinarySize);
But as already stated this will not save you memory. The ammount of it used will be practically the same. You will gain nothing from this alone. If you are having to many strings take a different approach. Like something from this list of choices:
Use a dictionary and only store each same string once
Only hold a portion of all strings in memory. Some sort of cache. Have others on hard drive and use streams to load them
If you have very large string consider compressing them.
If you are reading from file and you target is binary data, skip the string in the middle. Read the source directly into a byte buffer.
It is hard to give further help without knowing more about the problem.
EDIT:
If you really want a minimum memory footprint and you can live with a little lower speed (but still very fast) you can use Suffix Trie or B-Tree or event a simple Binary Tree. They can work directly from hard drive and can be very fast for searching. If you then cache a subset of the data to RAM, you get the optimal solution memory vs. speed wise.
Anyway given the ammount of data you claim to have it seems no memory optimization is needed at all. 22MB of RAM is hardly an issue and not worth optimizing.
Are you certain this is an optimization that is needed?
2000 lines that are 10 characters long is only 20000 characters.
In most environments, that's tiny. Most machines have considerably more RAM than that. Most disks are considerably larger than that. And, usually, sending and receiving that much information is trivial over the web.
Perhaps your situation is unique. Maybe you have large number of 20000 character data sets, or very slow web access over which to transmit this date, etc. But, I'd encourage you to consider whether you aren't perhaps trying to optimize something that even if you are very successful in implementing, won't significantly change your application's performance in the real world.
Make your storage type tutf8string. It can be simply assigned from tunicodestring, and conversion should be safe.

Resources