What datatype/structure to store file list info? - delphi

I have an application that searches files on the computer (configurable path, type etc). Currently it adds information to a database as soon as a matching file is found. Rather than that I want to hold the information in memory for further manipulation before inserting to database. The list may contain a lot of items. I consider performance as important factor. I may need iterating thru the items, so a structure that can be coded easily is another key issue. and how can I achieve php style associative arrays for this job?

If you're using Delphi 2009, you can use a TDictionary. It takes two generic parameters. The first should be a string, for the filename, and the second would be whatever data type you're associating with. It also has three built-in enumerators, one for key-value pairs, one for keys only and one for values only, which makes iterating easy.

Another solution would be to use just a standard TStringList.
As long as it's sorted and has some duplicate setting other than dupAccept, you can use indexof or indexofname to find items in the list quickly.
It also has the Objects addition which allows you to store object information attached to the name. Starting with D2009, TStringList has the OwnsObject property which allows you to delegate object cleanup to the TStringList. Prior to D2009 you have to handle that yourself.

Much of this will depend on how you are going to use the list and to what scale. If you are going to use it as a stack, or queue, then a TList would work fine. If your needing to search through the list for a specific item then you will need something that allows faster retrieval. TDictionary (2009) or TStringList (pre 2009) would be the most likely choice.
Dynamic arrays are also a possiblity, but if you use them you will want to minimize the use of SetLength as each time it is called it will re-allocate memory. TList manages this for you, which is why I suggested using a TList. if you KNOW how many you will deal with in advance, then use a dynamic array, and set its length on the onset.
If you have more items than will fit in memory then your choices also change. At that point I would either use a database table, or a tFileStream to store the records to be processed, then seek to the beginning of the table/stream for processing.

Try using the AVL-Tree by http://sourceforge.net/projects/alcinoe/ as your associative Array. It has an iterate-method for fast iteration. You may need to derive from his baseclass and implement your own comparator, but it's easy to use.
Examples are included.

Related

Delphi object without Create? [duplicate]

In Delphi sane people use a class to define objects.
In Turbo Pascal for Windows we used object and today you can still use object to create an object.
The difference is that a object lives on the stack and a class lives on the heap.
And of course the object is depreciated.
Putting all that aside:
is there a benefit to be had, speed wise by using object instead of class?
I know that object is broken in Delphi 2009, but I've got a special use case1) where speed matters and I'm trying to find if using object will make my thing faster without making it buggy
This code base is in Delphi 7, but I may port it to Delphi 2007, haven't decided yet.
1) Conway's game of life
Long comment
Thanks all for pointing me in the right direction.
Let me explain a bit more. I'm trying to do a faster implementation of hashlife, see also here or here for simple sourcecode
The current record holder is golly, but golly uses a straight translation of Bill Gospher original lisp code (which is brilliant as an algorithm, but not optimized at the micro level at all). Hashlife enables you to calculate a generation in O(log(n)) time.
It does this by using a space/time trade off. And for this reason hashlife needs a lot of memory, gigabytes are not unheard of. In return you can calculate generation 2^128 (340282366920938463463374607431770000000) using generation 2^127 (170141183460469231731687303715880000000) in o(1) time.
Because hashlife needs to compute hashes for all sub-patterns that occur in a larger pattern, allocation of objects needs to be fast.
Here's the solution I've settled upon:
Allocation optimization
I allocate one big block of physical memory (user settable) lets say 512MB. Inside this blob I allocate what I call cheese stacks. This is a normal stack where I push and pop, but a pop can also be from the middle of the stack. If that happens I mark it on the free list (this is a normal stack). When pushing I check the free list first if nothing is free I push as normal. I'll be using records as advised it looks like the solution with the least amount of overhead.
Because of the way hashlife works, very little popping takes place and a lot of pushes. I keep separate stacks for structures of different sizes, making sure to keep memory access aligned on 4/8/16 byte boundaries.
Other optimizations
recursion removal
cache optimization
use of inline
precalculation of hashes (akin to rainbow tables)
detection of pathological cases and use of fall-back algorithm
use of GPU
For using normal OOP programming, you should always use the class kind. You'll have the most powerful object model in Delphi, including interface and generics (in later Delphi versions).
1. Records, pointers and objects
Records can be evil (slow hidden copy if you forgot to declare a parameter as const, record hidden slow cleanup code, a fillchar would make any string in record become a memory leak...), but they are sometimes very convenient to access a binary structure (e.g. some "smallish value"), via a pointer.
A dynamic array of tiny records (e.g. with one integer and one double field) will be much faster than a TList of small classes; with our TDynArray wrapper, you will have high-level access to the records, with serialization, sorting, hashing and such.
If using pointers, you must know what you are doing. It's definitively more preferable to stick with classes, and TPersistent if you want to use the magical "VCL component ownership model".
Inheritance is not allowed for records. You'll need either to use a "variant record" (using the case keyword in its type definition), either use nested records. When using C-like API, you'll sometimes have to use object-oriented structures. Using nested records or variant records is IMHO much less clear than the good old "object" inheritance model.
2. When to use object
But there are some places where objects are a good way of accessing already existing data.
Even the object model is better than the new record model, because it handles simple inheritance.
In a Blog entry last summer, I posted some possibilities to still use objects:
A memory mapped file, which I want to parse very quickly: a pointer to such an object is just great, and you still have methods at hand; I use this for TFileHeader or TFileInfo which map the .zip header, in SynZip.pas;
A Win32 structure, as defined by a API call, in which I put handy methods for easy access to the data (for that you may use record but if there is some object orientation in the struct - which is very common - you'll have to nest records, which is not the very handy);
A temporary structure defined on the stack, just used during a procedure: I use this for TZStream in SynZip.pas, or for our RTTI related classes, which map the Delphi generated RTTI in an Object-Oriented way not as the TypeInfo which is function/procedure oriented. By mapping the RTTI memory content directly, our code is faster than using the new RTTI classes created on the heap. We don't instanciate any memory, which, for an ORM framework like ours, is good for its speed. We need a lot of RTTI info, but we need it quick, we need it directly.
3. How object implementation is broken in modern Delphi
The fact that object is broken in modern Delphi is a shame, IMHO.
Normally, if you define a record on the stack, containing some reference-counted variables (like a string), it will be initialized by some compiler magic code, at the begin level of the method/function:
type TObj = object Int: integer; Str: string; end;
procedure Test;
var O: TObj
begin // here, an _InitializeRecord(#O,TypeInfo(TObj)) call is made
O.Str := 'test';
(...)
end; // here, a _FinalizeRecord(#O,TypeInfo(TObj)) call is made
Those _InitializeRecord and _FinalizeRecord will "prepare" then "release" the O.Str variable.
With Delphi 2010, I found out that sometimes, this _InitializeRecord() was not always made.
If the record has only some no public fields, the hidden calls are sometimes not generated by the compiler.
Just build the source again, and there will be...
The only solution I found out was using the record keyword instead of object.
So here is how the resulting code looks like:
/// used to store and retrieve Words in a sorted array
// - is defined either as an object either as a record, due to a bug
// in Delphi 2010 compiler (at least): this structure is not initialized
// if defined as a record on the stack, but will be as an object
TSortedWordArray = {$ifdef UNICODE}record{$else}object{$endif}
public
Values: TWordDynArray;
Count: integer;
/// add a value into the sorted array
// - return the index of the new inserted value into the Values[] array
// - return -(foundindex+1) if this value is already in the Values[] array
function Add(aValue: Word): PtrInt;
/// return the index if the supplied value in the Values[] array
// - return -1 if not found
function IndexOf(aValue: Word): PtrInt; {$ifdef HASINLINE}inline;{$endif}
end;
The {$ifdef UNICODE}record{$else}object{$endif} is awful... but the code generation error didn't occur since..
The resulting modifications in the source code are not huge, but a bit disappointing. I found out that older version of the IDE (e.g. Delphi 6/7) are not able to parse such declaration, so the class hierarchy will be broken in the editor... :(
Backward compatibility should include regression tests. A lot of Delphi users stay to this product because of the existing code. Breaking features are very problematic for the Delphi future, IMHO: if you have to rewrite a lot of code, which shouldn't you just switch the project to C# or Java?
Object was not the Delphi 1 method of setting up objects; it was the short-lived Turbo Pascal method of setting up objects, which was replaced by the Delphi TObject model in Delphi 1. It was kept around for backwards compatibility, but it should be avoided for a few reasons:
As you noted, it's broken in more recent versions. And AFAIK there are no plans to fix it.
It's a conceptualy wrong object model. The entire point of Object Oriented Programming, the one thing that really distinguishes it from procedural programming, is Liskov substitution (inheritance and polymorphism), and inheritance and value types don't mix.
You lose support for a lot of features that require TObject descendants.
If you really need value types that don't need to be dynamically allocated and initialized, you can use records instead. You can't inherit from them, but you can't do that very well with object either so you're not losing anything here.
As for the rest of the question, there aren't all that many speed benefits. The TObject model is plenty fast, especially if you're using the FastMM memory manager to speed up creation and destruction of objects, and if your objects contain lots of fields they can even be faster than records in a lot of cases, because they're passed by reference and don't have to be copied around for each function call.
When given a choice between "fast and possibly broken" and "fast and correct," always choose the latter.
Old-style objects offer no speed incentive over plain old records, so wherever you might be tempted to use old-style objects, you can use records instead without the risk of having uninitialized compiler-managed types or broken virtual methods. If your version of Delphi doesn't support records with methods, then just use standalone procedures instead.
Way back in older versions of Delphi which did not support records with methods then using object was the way to get your objects allocated on the stack. Very occasionally that would yield worthwhile performance benefits. Nowadays record is better. The only feature missing from record is the ability to inherit from another record.
You give up a lot when you change from class to record so only consider it if the performance benefits are overwhelming.

String lifetime management, in records

I am working on getting rid of shortstring.
One of the many places shortstring is currently used within our programs is in records.
Alot of these records are kept in AVL trees.
The AVL tree used is a generic one, holding a pointer to a number of bytes (ElemSize), which have worked well so far.
The memory for each record in the AVL tree is allocated with GetMem, and copied with Move.
However, with string being a pointer to a reference-counted structure, copying back the memory to a record no longer works, as the sting referenced is often freed (automatically by reference count).
With only a pointer and a size of the "data block", I assume it is not possible to have the reference count of the strings increased.
I'm looking for a way to get the reference count of the stings to be taken into account when storing the record in a AVL tree.
Can I pass the record type to the tree constructor, then cast the pointer to this type and thus get the references increased? Or a similar fix, where I can isolate the changes to primarily be in the AVL unit and calls to it's constructor.
Current code for allocation of space to store the record in AVL; XData is a pointer to the record to be stored:
New(RootPtr); { create new memory space }
GetMem(RootPtr^.TreeData, ElemSize);
WITH RootPtr^ DO BEGIN
{ copy data }
Move(XData^, RootPtr^.TreeData^, ElemSize);
In essence the question you are asking is:
How can I allocate, copy and deallocate a record when all I know about its type is its size?
The simple answer is that you can use GetMem, Move and FreeMem provided that the record does not contain managed types. You wish to work with records that contain Delphi strings, which are managed. And so your current approach using GetMem and Move does not suffice.
There are plenty of ways to solve this. You could write your own code to do reference counting, so long as you knew where in the record the managed types were. I don't recommend this. You could make your user data be a class and use polymorphism to help.
The option I'd like to discuss continues to support records and indeed allows the user to choose whatever type they like. The reasoning is as follows:
If the type contains managed types, then operating on it requires knowledge of the type. If the tree is to be generic, then it cannot have that knowledge. Ergo, the knowledge must be supplied by the user of the tree.
This leads you to events. Let the tree offer events that the user can supply handlers for. The types would look like this:
type
PTreeNodeUserData = type Pointer;
TTreeNodeCreateUserDataEvent = function: PTreeNodeUserData of object;
TTreeNodeDestroyUserDataEvent = procedure(Data: PTreeNodeUserData) of object;
TTreeNodeCopyUserDataEvent = procedure(Source, Dest: PTreeNodeUserData) of object;
Then you can arrange for your tree to publish events with these types that the user can subscribe to.
The point being that this allows the user of the tree to supply the missing knowledge about the user data type.
One of the main benefits of using records is the simplicity with which they can be copied (without using Move). So your best solution is to simply replace Move with a normal assignment operator :=. This will correctly consider the reference counts for all managed types involved.
Is there a particular reason you're not using the normal assignment operator?
PS: You need to ensure that the memory for all managed types (including long strings) is correctly initialised and finalised. I suggest you do some additional reading on the Initialize and Finalize routines.
The tree is general, it can hold a given lump of data. I hoped I could extend the functionality without making a new tree class per record.
In that case you need your "copy behaviour" to be variable depending on what it's working with. As couple of options:
If your tree is wrapped in a class you can easily modify it to use a callback event to perform the copy operation. (This option might be easiest even if you first have to work on encapsulating the tree in a class.)
Modify your nodes and/or data to be objects with polymorphic copy functionality. Then each subtype will know how to copy itself correctly, and you can write something along the lines of Root.TreeData := XData.CreateCopy;
If you are working at such a low level, and don't want compiler to help you, then you need to use PChar-strings instead of regular strings.

Delphi TStringList wrapper to implement on-the-fly compression

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross
I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).
You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.
I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.
I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

Is it good practice to use a Dynamic Array in an object field?

I am refactoring some existing Delphi code into a class.
The current code uses a global variable defined as a dynamic array array of byte. At initialization time the code figures out the size of the array and uses SetLength to allocate it. It is convenient both as the buffer to obtain the data and as the runtime container for a later processing.
I want to move this variable as one of the object attributes.
But I am not sure if it is ok to maintain its type. Is it considered good practice?
The alternative I am considering is to tranform it to a dynamic container like a TList. I will keep the very same code for obtaining the data, with a local dynamic array but moving it to the container for the rest of its lifespan. Is it worth the effort? I know that elegance always pay off at the end, but I don't really see the value of the effort at this moment. Any thoughts?
Dynamic arrays are great, but really only for fixed dimentions. If they have to grow, especially in single record increments, this can cause eventual errors from the memory manager (and possible performance issues) since the array has to be reallocated and copied to the new bigger destination. TList does at least have a 'growing' mechanism that is called less frequently.
I know that elegance always pay off at the end,
Is that so? Note that changing working code always includes the risk to break something. It must IMHO be decided in every situation if the gained elegance is worth the risk.
In your case, if you add and remove items during runtime, I would use a TList since it is much easier for these operations. If you just initialize the length once and the arrays is constant after initialization you can just keep the dynamic array. There's definitely no "good practice" saying that you shouldn't use dynamic arrays.

TStringList vs. TList<string>

what is the difference in using a standard
type
sl: TStringList
compared to using a generic TList
type
sl: TList<string>
?
As far as I can see, both behave exactly the same.
Is it just another way of doing the same thing?
Are there situations where one would be better than the other?
Thanks!
TStringList is a descendant of TStrings.
TStringList knows how to sort itself alphabetically.
TStringList has an Objects property.
TStringList doesn't make your code incompatible with all previous versions of Delphi.
TStringList can be used as a published property. (A bug prevents generic classes from being published, for now.)
TStringList has been around a long time in Delphi before generics were around. Therefore, it has built up a handful of useful features that a generic list of strings would not have.
The generics version is just creating a new type that is identical to TList that works on the type of String. (.Add(), .Insert(), .Remove(), .Clear(), etc.)
TStringList has the basic TList type methods and other methods custom to working with strings, such as .SaveToFile() and .LoadFromFile()
If you want backwards compatibility, then TStringList is definitely the way to go.
If you want enhanced functionality for working with a list of Strings, then TStringList is the way to go.
If you have some basic coding fundamentals that you want to work with a list of any type, then perhaps you need to look away from TStringList.
As TStringList is a descendant of TStrings it is compatible with the Lines property of TMemo, Items of TListbox and TComboBox and other VCL components.
So can use
cbList.Items := StringList; // internally calls TStrings.Assign
I'd probably say if you want backwards compatibility use TStringList, and if you want forward compatibility (perhaps the option to change that list of strings to say list of Int64s in the future) then go for TList.
From memory point of view TStringList memory usage is increased with the size of TObject pointer added to each item. TList memory usage is increased with the size of pointer added to each item. If is needed just an array of strings without searching, replacing, sorting or associative operations, a dynamic array (array of string) should be just enough. This lacks of a good memory management of TStringList or TList, but in theory should use less memory.
The TStringlist is one very versatile class of Delphi. I used (and abused ;-) ) its Objects property many times. It's very interesting to quickly translate a delimited string to a control like a TMemo and similar ones (TListBox, TComboBox, just to list a few).
I just don't like much TList, as TStringList satisfied my needs without needing of treating pointers (as Tlist is a list of Pointer values).
EDIT: I confused the TList(list of pointers) with TList (generic list of strings). Sorry for that. My point stands: TStringList is just a lot more than a simple list of strings.
For most purposes that TStringList has been abused in the past, TObjectDictionary is better - it's faster and doesn't need sorting.
If you need a TStrings object (generally for UI stuff, since the VCL doesn't use generics much even for XE5) use TStringList - the required casting from TObject is annoying but not a showstopper.
TStringList has been used for far too long and has many advantages, all mentioned by Rob Kennedy.
The only real disadvantage of using it as a pair of a string and an object is the necessity of casting object to the actual type expected and stored in this list (when reading) and as far as I know Embarcadero did not provide Delphi 2009 and up VCL libraries with generic version of TStringList.
To overcome this limitation I implemented such list for internal use and for almost 3 years it serves it's purpose so I decided to share it today: https://github.com/t00/deltoo#tgenericstringlist
One important note - it changes the default property from Strings to Objects as in most cases when object is stored in a list it is also the mostly accessed property of it.

Resources