How to handle billions of objects without "Outofmemory" error - delphi

I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:
To apply an operation for each TRange object(eg:Output certain properties)
To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).
How do I effectively create an Iterator for these object without raising Outofmemory?
I have handled the first case by passing a function pointer to the algorithm function. For eg:
procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a //TRange
var range: TRange;
rangerec: TRangeRec;
begin
range:=TRange.Create;
try
while canCreateRange do begin//certain conditions needed to create a range
rangerec := ReturnRangeRec;
range.Update(rangerec);//don't create new, use the same object.
if Assigned(aProc) then aProc(range);
end;
finally
range.Free;
end;
end;
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.
Thank you all in advance
Pradeep

For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:
I Created an extended storage that can store a custom record either in memory or on the hard drive. This storage has a maximum number of records that can live simultaniously in memory.
Then I Derived the record classes out of the custom record class. These classes know how to store and load themselves from the hard drive (I use streams).
Everytime you need a new or already existing record you ask the extended storage for such a record. If the maximum number of objects is exceeded, the storage streams some of the least used record back to the hard drive.
This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.
I did not post any code because it is beyond the scope of the question itself and would only confuse.

Look at TgsStream64. This class can handle a huge amounts of data through file mapping.
http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas

But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter.
It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:
TRangeProc = procedure(Data: pointer; range: TRange);
procedure enumRanges(aProc: TRangeProc; Data: pointer);
begin
{for each range}
aProc(range, Data);
end;
Then if you want to, say, sum all ranges, you do it like this:
TSumRecord = record
Sum: int64;
end;
PSumRecord = ^TSumRecord;
procedure SumProc(SumRecord: PSumRecord; range: TRange);
begin
SumRecord.Sum := SumRecord.Sum + range.Value;
end;
function SumRanges(): int64;
var SumRec: TSumRecord;
begin
SumRec.Sum := 0;
enumRanges(TRangeProc(SumProc), #SumRec);
Result := SumRec.Sum;
end;
Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.

"Runner" has a good answer how to handle this!
But I would like to known if you could do a quick fix: make smaller TRange objects.
Maybe you have a big ancestor? Can you take a look at the instance size of TRange object?
Maybe you better use packed records?

This part:
As a result, if you have 100 items,
you can't directly create the 100th
object without creating all the prior
objects.
sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.

Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.

Related

What is the fastest mean to transfer a record in DCOM

I want to transfer some records with the following structure between two Windows PC computer using COM/DCOM. I prefer to transfer an array, say 100 members of TARec, at a time, not each record individually. Currently I am doing this using IStrings. I am looking to improve it using the raw records, to save the time to encode/decode the strings at both ends. Please share your experience.
type
TARec = record
A : TDateTime;
B : WORD;
C : Boolean;
D : Double;
end;
All the record's field type are OLE compatible. Many thanks in advance.
As Rudy suggests in the comments, if your data contains simple value types then a variant byte array can be a very efficient approach and quite simple to implement.
Since you have stated that your data already resides in an array, the basic approach would be:
Create a byte array of the required size to hold all your record data (use VarArrayCreate with type varByte)
Lock the array to obtain a pointer that is safe to use to reference the array contents in memory (VarArrayLock will lock and return a pointer to the array data)
Use CopyMemory to directly copy the data from your array of records to the byte array memory.
Unlock the variant array (VarArrayUnlock) and pass it through your COM/DCOM interface
On the other ('receiving') side you simply reverse the process:
Declare an array of records of the required size
Lock the variant byte array to obtain a pointer to the memory holding the bytes
Copy the byte array data into your record array
Unlock the byte array
This exact approach is something I have used very successfully in a very demanding COM/DCOM scenario (w.r.t efficiency/performance) in the past.
Things to be careful of:
If your data ever changes to include more complex types such as strings or dynamic arrays then additional work will be required to correctly transport these through a byte array.
If your data structure ever changes then the code on both sides of the interface will need to be updated accordingly. One way to protect against this is to incorporate some mechanism for the data to be identified as valid or not by the receiver. This could include a "version number" for example and/or a value (in a 'header' as part of the byte array, in addition to the array data, or passed as a separate parameter entirely - precise details don't really matter). If the receiver finds a version number or size that it is not expecting then it can report this gracefully rather than naively processing the data incorrectly and (most likely) crashing or throwing exceptions as a result.
Alignment/packing issues. Even with the same declaration for the record type, if code is compiled with different alignment settings then the size required for each record in memory could change (which is why a "version number" for the data structure format might not be reliable on its own). One way to avoid this would be to declare the record as packed, though this comes at the cost of a slight reduction in efficiency (and still relies on both sides of the interface agreeing that the data structure is packed).
There are just things to bear in mind however, not prescriptive. Just how complex/robust your implementation needs to be will be determined by your specific case.

About Data from TClientDataSet

I made a function that copies the Data from A TClientDataSet to B.
In production, the code will dynamically fill a TClientDataSet, like the following:
procedure LoadClientDataSet(const StringSql: String; ParamsAsArray: array of Variant; CdsToLoad: TClientDataSet);
begin
//There is a shared TClientDataSet that retrieves
//all data from a TRemoteDataModule in here.
//the following variables (and code) are here only to demonstration of the algorithm;
GlobalMainThreadVariable.SharedClientDataSet.CommandText := StringSql;
//Handle Parameters
GlobalMainThreadVariable.SharedClientDataSet.Open;
CdsToLoad.Data:= GlobalMainThreadVariable.SharedClientDataSet.Data;
GlobalMainThreadVariable.SharedClientDataSet.Close;
end;
That said:
It is safe to do so? (By safe I mean, what kind of exceptions should I expect and how to deal with them?)
How do I release ".Data's" memory?
The data storage residing behind the Data property is reference counted. Therefore you don't have to bother releasing it.
If you want to dive deeper into the intrinsics of TClientDataSet I recommend reading the really excellent book from Cary Jensen: Delphi in Depth: ClientDataSets
By assigning the Data property like you did duplicates the records. You have now two different isntances of TClientDataset with two different set of records with precisely the same structure, same row count and same field values.
It´s safe to do that if the receiving TClientDataset does not have any field structure previously defined or the existing structure is compatible with the Data being assigned. However, if we are talking about a huge number of records, the assignment may take long time and, in an extreme circumstance, it could exaust the computer´s memory (it all the depends on the computer´s configuration).
In order to release the Data, just close the dataset.
If you prefer to have two instances of TClientDataset but one single instance of the records, my suggestion is to use the TClientDataset.CloneCursor method, that instead of copying the Data, just assign a reference to it in a different dataset. In this case it´s the very same Data shared between two different datasets.

How can I avoid running out of memory with a growing TDictionary?

TDictionary<TKey,TValue> uses an internal array that is doubled if it is full:
newCap := Length(FItems) * 2;
if newCap = 0 then
newCap := 4;
Rehash(newCap);
This performs well with medium number of items, but if one gets to the upper limit it is very unfortunate, because it might throw an EOutOfMemory exception even if there is almost half of the memory still available.
Is there any way to influence this behaviour? How do other collection classes deal with this scenario?
You need to understand how a Dictionary works. A dictionary contains a list of "hash buckets" where the items you insert are placed. That's a finite number, so once you fill it up you need to allocate more buckets, there's no way around it. Since the assignment of objects-to-buckets is based on the result of a hash function, you can't simply add buckets to the end of the array and put stuff in there, you need to re-allocate the whole list of blocks, re-hash everything and put it in the (new) corresponding buckets.
Given this behavior, the only way to make the dictionary not re-allocate once full is to make sure it never gets full. If you know the number of items you'll insert in the dictionary pass it as a parameter to the constructor and you'll be done, no more dictionary reallocations.
If you can't do that (you don't know the number of items you'll have in the dictionary) you'll need to reconsider what made you select the TDictionary in the first place and select a data structure that offers better compromise for your particular algorithm. For example you could use binary search trees, as they do the balancing by rotating information in existing nodes, no need for re-allocations ever.

Which is more efficient way to deal with an Array of Records?

What is the more efficient way?
FUserRecords[I].CurrentInput:=FUserRecords[I].CurrentInput+typedWords;
or
var userRec: TUserRec;
...
userRec:=FUserRecords[I];
userRec.CurrentInput:=userRec.CurrentInput+typedWords;
FUserRecords[I]:=userRec;
In the case you have described, the first example will be the most efficient.
Records are value types, so when you do this line:
userRec:=FUserRecords[I];
You are in fact copying the contents of the record in the array into the local variable. And when you do the reverse, you are copying the information again. If you are iterating over the array, this can end up being quite slow if your array and the records are quite large.
If you want to go down the second path, to speed things up you can manipulate the record in the array directly using a pointer to the record, like this:
type
TUserRec = record
CurrentInput: string;
end;
PUserRec = ^TUserRec;
var
LUserRec: PUserRec;
...
LUserRec := #FUserRecords[I];
LUserRec^.CurrentInput := LUserRec^.CurrentInput + typedWords;
(as discussed in the comments, the carets (^) are optional. I only added them here for completeness).
Of course I should point out that you really only need to do this if there is in fact a performance problem. It is always good to profile your application before you hand code these sorts of optimisations.
EDIT: One caveat to all the discussion on the performance issue you are looking at is that if most of your data in the records are strings, then most of any performance lost in the example you have shown will be in the string concatenation, not in the record manipulation.
Also, the strings are basically stored in the record as pointers, so the record will actually be the size of any integers etc, plus the size of the pointers for the strings. So, when you concatenate to the string, it will no increase the record size. You can basically look at the string manipulation part of your code as a separate issue to the record manipulation.
You have mentioned in your comments below, this is for storing input from multiple keyboards, I can't imagine that you will have more than 5-10 items in the array. At that size, doing these steps to optimize the code is not really going to boost the speed that much.
I believe that both your first example and the pointer-using code above will end up with approximately the same performance. You should use which ever is the easiest for you to read and understand.
N#
Faster and more readable:
with FUserRecords[I] do
CurrentInput := CurrentInput + typedWords;

Approaches for caching calculated values

In a Delphi application we are working on we have a big structure of related objects. Some of the properties of these objects have values which are calculated at runtime and I am looking for a way to cache the results for the more intensive calculations. An approach which I use is saving the value in a private member the first time it is calculated. Here's a short example:
unit Unit1;
interface
type
TMyObject = class
private
FObject1, FObject2: TMyOtherObject;
FMyCalculatedValue: Integer;
function GetMyCalculatedValue: Integer;
public
property MyCalculatedValue: Integer read GetMyCalculatedValue;
end;
implementation
function TMyObject.GetMyCalculatedValue: Integer;
begin
if FMyCalculatedValue = 0 then
begin
FMyCalculatedValue :=
FObject1.OtherCalculatedValue + // This is also calculated
FObject2.OtherValue;
end;
Result := FMyCalculatedValue;
end;
end.
It is not uncommon that the objects used for the calculation change and the cached value should be reset and recalculated. So far we addressed this issue by using the observer pattern: objects implement an OnChange event so that others can subscribe, get notified when they change and reset cached values. This approach works but has some downsides:
It takes a lot of memory to manage subscriptions.
It doesn't scale well when a cached value depends on lots of objects (a list for example).
The dependency is not very specific (even if a cache value depends only on one property it will be reset also when other properties change).
Managing subscriptions impacts the overall performance and is hard to maintain (objects are deleted, moved, ...).
It is not clear how to deal with calculations depending on other calculated values.
And finally the question: can you suggest other approaches for implementing cached calculated values?
If you want to avoid the Observer Pattern, you might try to use a hashing approach.
The idea would be that you 'hash' the arguments, and check if this match the 'hash' for which the state is saved. If it does not, then you recompute (and thus save the new hash as key).
I know I make it sound like I just thought about it, but in fact it is used by well-known softwares.
For example, SCons (Makefile alternative) does it to check if the target needs to be re-built preferably to a timestamp approach.
We have used SCons for over a year now, and we never detected any problem of target that was not rebuilt, so their hash works well!
You could store local copies of the external object values which are required. The access routine then compares the local copy with the external value, and only does the recalculation on a change.
Accessing the external objects properties would likewise force a possible re-evaluation of those properties, so the system should keep itself up-to-date automatically, but only re-calculate when it needs to. I don't know if you need to take steps to avoid circular dependencies.
This increases the amount of space you need for each object, but removes the observer pattern. It also defers all calculations until they are needed, instead of performing the calculation every time a source parameter changes. I hope this is relevant for your system.
unit Unit1;
interface
type
TMyObject = class
private
FObject1, FObject2: TMyOtherObject;
FObject1Val, FObject2Val: Integer;
FMyCalculatedValue: Integer;
function GetMyCalculatedValue: Integer;
public
property MyCalculatedValue: Integer read GetMyCalculatedValue;
end;
implementation
function TMyObject.GetMyCalculatedValue: Integer;
begin
if (FObject1.OtherCalculatedValue &LT;&GT; FObjectVal1)
or (FObject2.OtherValue &LT;&GT; FObjectVal2) then
begin
FMyCalculatedValue :=
FObject1.OtherCalculatedValue + // This is also calculated
FObject2.OtherValue;
FObjectVal1 := FObject1.OtherCalculatedValue;
FObjectVal2 := Object2.OtherValue;
end;
Result := FMyCalculatedValue;
end;
end.
In my work I use Bold for Delphi that can manage unlimited complex structures of cached values depending on each other. Usually each variable only holds a small part of the problem. In this framework that is called derived attributes. Derived because the value is not saved in the database, It just depends on on other derived attributes or persistant attributes in the database.
The code behind such attribute is written in Delphi as a procedure or in OCL (Object Constraint Language) in the model. If you write it as Delphi code you have to subscribe to the depending variables. So if attribute C depends on A and B then whenever A or B changes the code for recalc C is called automatically when C is read. So the first time C is read A and B is also read (maybe from the database). As long as A and B is not changed you can read C and got very fast performance. For complex calculations this can save quite a lot of CPU-time.
The downside and bad news is that Bold is not offically supported anymore and you cannot buy it either. I suppose you can get if you ask enough people, but I don't know where you can download it. Around 2005-2006 it was downloadable for free from Borland but not anymore.
It is not ready for D2009 as someone have to port it to Unicode.
Another option is ECO with dot.net from Capable Objects. ECO is a plugin in Visual Studio. It is a supported framwork that have the same idea and author as Bold for Delphi. Many things are also improved, for example databinding is used for the GUI-components. Both Bold and ECO use a model as a central point with classes, attributes and links. Those can be persisted in a database or a xml-file. With the free version of ECO the model can have max 12 classes, but as I remember there is no other limits.
Bold and ECO contains lot more than derived attributes that makes you more productive and allow you to think on the problem instead of technical details of database or in your case how to cache values. You are welcome with more questions about those frameworks!
Edit:
There is actually a download link for Embarcadero registred users for Bold for Delphi for D7, quite old... I know there was updates for D2005, ad D2006.

Resources