how to generate a bidimensional array with different "branch" lengths very fast - delphi

I am a Delphi programmer.
In a program I have to generate bidimensional arrays with different "branch" lengths.
They are very big and the operation takes a few seconds (annoying).
For example:
var a: array of array of Word;
i: Integer;
begin
SetLength(a, 5000000);
for i := 0 to 4999999 do
SetLength(a[i], Diff_Values);
end;
I am aware of the command SetLength(a, dim1, dim2) but is not applicable. Not even setting a min value (> 0) for dim2 and continuing from there because min of dim2 is 0 (some "branches" can be empty).
So, is there a way to make it fast? Not just by 5..10% but really FAST...
Thank you.

When dealing with a large amount of data, there's a lot of work that has to be done, and this places a theoretical minimum on the amount of time it can be done in.
For each of 5 million iterations, you need to:
Determine the size of the "branch" somehow
Allocate a new array of the appropriate size from the memory manager
Zero out all the memory used by the new array (SetLength does this for you automatically)
Step 1 is completely under your control and can possibly be optimized. 2 and 3, though, are about as fast as they're gonna get if you're using a modern version of Delphi. (If you're on an old version, you might benefit from installing FastMM and FastCode, which can speed up these operations.)
The other thing you might do, if appropriate, is lazy initialization. Instead of trying to allocate all 5 million arrays at once, just do the SetLength(a, 5000000); at first. Then when you need to get at a "branch", first check if its length = 0. If so, it hasn't been initialized, so initialize it to the proper length. This doesn't save time overall, in fact it will take slightly longer in total, but it does spread out the initialization time so the user doesn't notice.
If your initialization is already as fast as it will get, and your situation is such that lazy initialization can't be used here, then you're basically out of luck. That's the price of dealing with large amounts of data.

I just tested your exact code, with a constant for Diff_Values, timed it using GetTickCount() for rudimentary timing. If Diff_Values is 186 it takes 1466 milliseconds, if Diff_Values is 187 it fails with Out of Memory. You know, Out of Memory means Out of Address Space, not really Out of Memory.
In my opinion you're allocating so much data you run out of RAM and Windows starts paging, that's why it's slow. On my system I've got enough RAM for the process to allocate as much as it wants; And it does, until it fails.
Possible solutions
The obvious one: Don't allocate that much!
Figure out a way to allocate all data into one contiguous block of memory: helps with address space fragmentation. Similar to how a bi dimensional array with fixed size on the "branches" is allocated, but if your "branches" have different sizes, you'll need to figure a different mathematical formula, based on your data.
Look into other data structures, possibly ones that cache on disk (to brake the 2Gb address space limit).

In addition to Mason's points, here are some more ideas to consider:
If the branch lengths never change after they are allocated, and you have an upper bound on the total number of items that will be stored in the array across all branches, then you might be able to save some time by allocating one huge chunk of memory and divvying up the "branches" within that chunk yourself. Your array would become a 1 dimensional array of pointers, and each entry in that array points to the start of the data for that branch. You keep track of the "end" of the used space in your big block with a single pointer variable, and when you need to reserve space for a new "branch" you take the current "end" pointer value as the start of the new branch and increment the "end" pointer by the amount of space that branch requires. Don't forget to round up to dword boundaries to avoid misalignment penalties.
This technique will require more use of pointers, but it offers the potential of eliminating all the heap allocation overhead, or at least replacing the general purpose heap allocation with a purpose-built very simple, very fast suballocator that matches your specific use pattern. It should be faster to execute, but it will require more time to write and test.
This technique will also avoid heap fragmentation and reduces the releasing of all the memory to a single deallocation (instead of millions of separate allocations in your present model).
Another tip to consider: If the first thing you always do with the each newly allocated array "branch" is assign data into every slot, then you can eliminate step 3 in Mason's example - you don't need to zero out the memory if all you're going to do is immediately assign real data into it. This will cut your memory write operations by half.

Assuming you can fit the entire data structure into a contiguous block of memory, you can do the allocation in one shot and then take over the indexing.
Note: Even if you can't fit the data into a single contiguous block of memory, you can still use this technique by allocating multiple large blocks and then piecing them together.
First off form a helper array, colIndex, which is to contain the index of the first column of each row. Set the length of colIndex to RowCount+1. You build this by setting colIndex[0] := 0 and then colIndex[i+1] := colIndex[i] + ColCount[i]. Do this in a for loop which runs up to and including RowCount. So, in the final entry, colIndex[RowCount], you store the total number of elements.
Now set the length of a to be colIndex[RowCount]. This may take a little while, but it will be quicker than what you were doing before.
Now you need to write a couple of indexers. Put them in a class or a record.
The getter looks like this:
function GetItem(row, col: Integer): Word;
begin
Result := a[colIndex[row]+col];
end;
The setter is obvious. You can inline these access methods for increased performance. Expose them as an indexed property for convenience to the object's clients.
You'll want to add some code to check for validity of row and col. You need to use colIndex for the latter. You can make this checking optional with {$IFOPT R+} if you want to mimic range checking for native indexing.
Of course, this is a total non-starter if you want to change any of your column counts after the initial instantiation!

Related

Does the order or syntax of allocate statement affect performance? (Fortran)

Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).
From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.
There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.

Memory Locations of Variables when Using IA-32 Assembly Language

Quick question on memory locations in IA-32 assembly language that i cannot seem to find the answer for anywhere else.
On IA-32 each memory address is 4 bytes long (e.g. 0x0040120e). Each of these addresses points to a 1 byte value (or in the case of a larger value, the first byte of it). Now look at these two simple IA-32 assembly language statements:
var1 db 2
var2 db 3
This will place var1 and var2 in adjacent memory cells (let's say 0x0040120e and 0f). Now I realize that the define directive db allocates 1 byte to the value. But, in the case above I have two values (2 and 3) that in fact only requires two bits each, to be stored.
Questions:
When using the db directive, do these two values still consume a full byte, even though they are smaller than 1 byte?
Is using a full byte for values that could get away with less, still the common way to go (as we have so much memory that we don't care)?
Does integers 0 to 255 then generally take up 1 byte and integers 256 to (2^16 - 1) take up 2 bytes (a word), etc.?
Thank you,
Magnus
EDIT 1: Made questions more clear (apologies for the back and forth)
EDIT 2: Added a structured reply below, based on other posters' input
yes. the B in DB is for Byte.
You could use a nibble for each, like so:
combined db 0x23
but you'd have to
a) shift the result for 4 bits right if you need the "2".
b) mask the leftmost 4 bits if you need the "3".
Hardly worth the effort these days ;-)
Yes, since the architecture is byte-addressable and cannot address anything smaller than a byte.
This means that data requiring less than one byte will need to share its address with other data.
In practice this means that you're going to have to know which bits in the pointed-out byte are used for this particular value.
For hardware registers this sort of mapping is very common.
EDIT: Ah, you seem to mean "values of the same variable" when you said "2 and 3". I thought you meant 2-bit and 3-bit values. You need to decide how many bits are needed at most for a particular variable, for all the values you need that variable to be able to store. There are variable-length encodings for integers of course, but that's generally rarely used in assembly and not what you'd typically use for some general-purpose variable.
You generally should expect to reserve all bits required for all values that a variable need to hold, up front. Otherwise, if you're worried about "wasting memory", you would need to move all other variables as soon as you get some "vacant bits" somewhere. That would end up costing fantastically much. Also, knowing the size of a variable is constant makes it possible to generate (or write) the proper code to handle it, otherwise you would of course also need to explicitly store somewhere "the size of the value held in variable x is now y bits". That becomes extremely painful very very quickly.
My initial question was a bit unstructured, so for the benefit of other searchers stopping by here I will use the answers received from #unwind and #geert3 to create a structured response. Again, this was my fault due to the initial poor structuring and creds for the answers goes to #unwind and #geert3.
When using the db directive you allocate 1 byte to the variable, and even if the variable takes up less space than 1 byte, it will still consume that full 1-byte address spot. As one might guess, that wastes a few bits of memory, but that is okay as you have enough memory and not too bothered about wasting a couple of bits. The reason you want to use the full 1-byte memory location is that it is easier to reference the variable when it is alone in the address slot (see #geert3's note on how to access it if you use less than a byte), and additionally, in case you want to reuse the variable later, it is great to know you have space for any number up to 255.
Yes, see answer to 1
Yes, you would normally allocate multiples of a byte to a variable, in a byte-addressable system

Why can't I use an Int64 in a for loop?

I can write for..do process for integer value..
But I can't write it for int64 value.
For example:
var
i:int64;
begin
for i:=1 to 1000 do
end;
The compiler refuses to compile this, why does it refuse?
The Delphi compiler simply does not support Int64 loop counters yet.
Loop counters in a for loop have to be integers (or smaller).
This is an optimization to speed up the execution of a for loop.
Internally Delphi always uses an Int32, because on x86 this is the fastest datatype available.
This is documented somewhere deep in the manual, but I don't have a link handy right now.
If you must have a 64 bit loop counter, use a while..do or repeat..until loop.
Even if the compiler did allow "int64" in a Delphi 7 for-loop (Delphi 7???), it probably wouldn't complete iterating through the full range until sometime after the heat death of the Sun.
So why can't you just use an "integer"?
If you must use an int64 value ... then simply use a "while" loop instead.
Problem solved :)
Why to use a Int64 on a for-loop?
Easy to answer:
There is no need to do a lot of iterations to need a Int64, just do a loop from 5E9 to 5E9+2 (three iterations in total).
It is just that values on iteration are bigger than what Int32 can hold
An example:
procedure Why_Int64_Would_Be_Great_On_For_Loop;
const
StartValue=5000000000; // Start form 5E9, 5 thousand millons
Quantity=10; // Do it ten times
var
Index:Int64;
begin
for Index:=StartValue to StartValue+Quantity-1
do begin // Bla bla bla
// Do something really fast (only ten times)
end;
end;
That code would take no time at all, it is just that index value need to be far than 32bit integer limit.
The solution is to do it with a while loop:
procedure Equivalent_For_Loop_With_Int64_Index;
const
StartValue=5000000000; // Start form 5E9, 5 thousand millons
Quantity=10; // Do it ten times
var
Index:Int64;
begin
Index:=StartValue;
while Index<=StartValue+Quantity
do begin // Bla bla bla
// Do something really fast (only ten times)
Inc(Index);
end;
end;
So why the compiler refuses to compile the foor loop, i see no real reason... any for loop can be auto-translated into a while loop... and pre-compiler could do such before compiler (like other optimizations that are done)... the only reason i see is the lazy people that creates the compiler that did not think on it.
If for is optimized and so it is only able to use 32 bit index, then if code try to use a 64 bit index it can not be so optimized, so why not let pre-compiler optimizator to chage that for us... it only gives bad image to programmers!!!
I do not want to make anyone ungry...
I only just say something obvious...
By the way, not all people start a foor loop on zero (or one) values... sometimes there is the need to start it on really huge values.
It is allways said, that if you need to do something a fixed number of times you best use for loop instead of while loop...
Also i can say something... such two versions, the for-loop and the while-loop that uses Inc(Index) are equally fast... but if you put the while-loop step as Index:=Index+1; it is slower; it is really not slower because pre-compiler optimizator see that and use Inc(Index) instead... you can see if buy doing the next:
// I will start the loop from zero, not from two, but i first do some maths to avoid pre-compiler optimizator to convert Index:=Index+Step; to Inc(Index,Step); or better optimization convert it to Inc(Index);
Index:=2;
Step:=Index-1; // Do not put Step:=1; or optimizator will do the convertion to Inc()
Index:=Step-2; // Now fix, the start, so loop will start from zero
while Index<1000000 // 1E6, one millon iterations, from 0 to 999999
do begin
// Do something
Index:=Index+Step; // Optimizator will not change this into Inc(Index), since sees that Step has changed it's value before
end;
The optimizer can see a variable do not change its value, so it can convert it to a constant, then on the increment assign if adding a constant (variable:=variable+constant) it will optimize it to Inc(variable,constant) and in the case it sees such constant is 1 it will also optimes it to Inc(variable)... and such optimizatons in low level computer language are very noticeble...
In Low level computer language:
A normal add (variable:=variable1+variable2) implies two memory reads plus one sum plus one memory write... lot of work
But if is a (variable:=variable+othervariable) it can be optimized holding variable inside the processor cache.
Also if it is a (variable:=variable1+constant) it can also be optimized by holding constant on the processor cache
And if it is (variable:=variable+constant) both are cached on processor cache, so huge fast compared with other options, no acces to RAM is needed.
In such way pre-compiler optimizer do another important optimization... for-loops index variables are holded as processor registers... much more faster than processor cache...
Most mother processor do an extra optimization as well (at hardware level, inside the processor)... some cache areas (32 bit variables for us) seen that are intensivly used are stored as special registers to fasten access... and such for-loop / while-loop indexes are ones of them... but as i said.. most mother AMD proccesors (the ones that uses MP technology does that)... i do not yet know any Intel that do that!!! such optimization is more relevant when multi-core and on super-computing... so maybe that is the reason why AMD has it and Intel not!!!
I only want to show one "why", there are a lot more... another one could be as simple as the index is stored on a database Int64 field type, etc... there are a lot of reasons i know and a lot more i did not know yet...
I hope this will help to understand the need to do a loop on a Int64 index and also how to do it without loosing speed by correctly eficiently converting loop into a while loop.
Note: For x86 compiles (not for 64bit compilation) beware that Int64 is managed internally as two Int32 parts... and when modifing values there is an extra code to do, on adds and subs it is very low, but on multiplies or divisions such extra is noticeble... but if you really need Int64 you need it, so what else to do... and imagine if you need float or double, etc...!!!

Fast access to a (sorted) TList

My project (running on Delphi 6!) requires a list of memory allocations (TMemoryAllocation) which I store within an object that also holds the information about the size of the allocation (FSize) and if the allocation is in use or free (FUsed). I use this basically as a GarbageCollector and a way to keep my application of allocating/deallocating memory all the time (and there are plenty allocations/deallocations needed).
Whenever my project needs an allocation it looks up the list to find a free allocation that fits the required size. To achieve that I use a simple for loop:
for I := 0 to FAllocationList.Count - 1 do
begin
if MemoryAllocation.FUsed and (MemoryAllocation.FSize = Size) then
...
The longer my application runs this list grows to several thousand items and it slows down considerably as I run it up very frequently (several times per second).
I am trying to find a way to accelerate this solution. I thought about sorting the TList by size of allocation. If I do that I should use some intelligent way of accessing the list for the particular size I need at every call. Is there some easy way to do this?
Another way I was thinking about was to have two TLists. One for the Unused and one of the Used allocations. That would mean though that I would have to Extract TList.Items from one list and add to the other all the time. And I would still need to use a for-loop to go through the (now) smaller list. Would this be the right way?
Other suggestions are VERY welcome as well!
You have several possibilities:
Of course, use a proven Memory Manager as FastMM4 or some others dedicated to better scale for multi-threaded applications;
Reinvent the wheel.
If your application is very sensitive to memory allocation, perhaps some feedback about reinventing the wheel:
Leverage your blocks size e.g. per 16 bytes multiples, then maintain one list per block size - so you can quickly reach the good block "family", and won't have to store each individual block size in memory (if it's own in the 32 bytes list, it is a 32 bytes blocks);
If you need reallocation, try to guess the best increasing factor to reduce memory copy;
Sort your blocks per size, then use binary search, which will be much faster than a plain for i := 0 to Count-1 loop;
Maintain a block of deleted items in the list, in which to lookup when you need a new item (so you won't have to delete the item, just mark it as free - this will speed up a lot if the list is huge);
Instead of using a list (which will have some speed issues when deleting or inserting sorted items with a huge number of items) use a linked list, for both items and freed items.
It's definitively not so simple, so you may want to look at some existing code before, or just rely on existing libraries. I think you don't have to code this memory allocation in your application, unless FastMM4 is not fast enough for you, which I'll doubt very much, because this is a great piece of code!

Working with large arrays - OutOfRam

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.
The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.
If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.
Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

Resources