How can I avoid running out of memory with a growing TDictionary? - delphi

TDictionary<TKey,TValue> uses an internal array that is doubled if it is full:
newCap := Length(FItems) * 2;
if newCap = 0 then
newCap := 4;
Rehash(newCap);
This performs well with medium number of items, but if one gets to the upper limit it is very unfortunate, because it might throw an EOutOfMemory exception even if there is almost half of the memory still available.
Is there any way to influence this behaviour? How do other collection classes deal with this scenario?

You need to understand how a Dictionary works. A dictionary contains a list of "hash buckets" where the items you insert are placed. That's a finite number, so once you fill it up you need to allocate more buckets, there's no way around it. Since the assignment of objects-to-buckets is based on the result of a hash function, you can't simply add buckets to the end of the array and put stuff in there, you need to re-allocate the whole list of blocks, re-hash everything and put it in the (new) corresponding buckets.
Given this behavior, the only way to make the dictionary not re-allocate once full is to make sure it never gets full. If you know the number of items you'll insert in the dictionary pass it as a parameter to the constructor and you'll be done, no more dictionary reallocations.
If you can't do that (you don't know the number of items you'll have in the dictionary) you'll need to reconsider what made you select the TDictionary in the first place and select a data structure that offers better compromise for your particular algorithm. For example you could use binary search trees, as they do the balancing by rotating information in existing nodes, no need for re-allocations ever.

Related

Array of references that share an Arc

This one's kind of an open ended design question I'm afraid.
Anyway: I have a big two-dimensional array of stuff. This array is mutable, and is accessed by a bunch of threads. For now I've just been dealing with this as a Arc<Mutex<Vec<Vec<--owned stuff-->>>>, which has been fine.
The problem is that stuff is about to grow considerably in size, and I'll want to start holding references rather than complete structures. I could do this by inverting everything and going to Vec<Vec<Arc<Mutex>>, but I feel like that would be a ton of overhead, especially because each thread would need a complete copy of the grid rather than a single Arc/Mutex.
What I want to do is have this be an array of references, but somehow communicate that the items being referenced all live long enough according to a single top-level Arc or something similar. Is that possible?
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
EDIT:Giving some more specifics on my code (away from home so this is rough):
What I want:
Outer scope initializes a bunch of Ts and somehow collectively ensures they live long enough (that's the hard part)
Outer scope initializes a grid :Something<Vec<Vec<&T>>> that stores references to the Ts
Outer scope creates a bunch of threads and passes grid to them
Threads dive in and out of some sort of (problable RW) lock on grid, reading the Tsand changing the &Ts in the process.
What I have:
Outer thread creates a grid: Arc<RwLock<Vector<Vector<T>>>>
Arc::clone(& grid)s are passed to individual threads
Read-heavy threads mostly share the lock and sometimes kick each other out for the writes.
The only problem with this is that the grid is storing actual Ts which might be problematically large. (Don't worry too much about the RwLock/thread exclusivity stuff, I think it's perpendicular to the question unless something about it jumps out at you.)
What I don't want to do:
Top level creates a bunch of Arc<Mutex<T>> for individual T
Top level creates a `grid : Vec<Vec<Arc<Mutex>>> and passes it to threads
The problem with that is that I worry about the size of Arc/Mutex on every grid element (I've been going up to 2000x2000 so far and may go larger). Also while the threads would lock each other out less (only if they're actually looking at the same square), they'd have to pick up and drop locks way more as they explore the array, and I think that would be worse than my current RwLock implementation.
Let me start of by your "aside" question, as I feel it's the one that can be answered:
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
The documenation of std::vec::Vec specifies that the layout is essentially a pointer with size information. That means that any Vec<Vec<T>> is a pointer to a densely packed array of pointers to densely packed arrays of Ts. So if block of memory means a contiguous block to you, then no, Vec<Vec<T>> cannot give that you. If that is part of your requirements, you'd have to deal with a datatype (let's call it Grid) that is basically a (pointer, n_rows, n_columns) and define for yourself if the layout should be row-first or column-first.
The next part is that if you want different threads to mutate e.g. columns/rows of your grid at the same time, Arc<Mutex<Grid>> won't cut it, but you already figured that out. You should get clarity whether you can split your problem such that each thread can only operate on rows OR columns. Remember that if any thread holds a &mut Row, no other thread must hold a &mut Column: There will be an overlapping element, and it will be very easy for you to create a data races. If you can assign a static range of of rows to a thread (e.g. thread 1 processes rows 1-3, thread 2 processes row 3-6, etc.), that should make your life considerably easier. To get into "row-wise" processing if it doesn't arise naturally from the problem, you might consider breaking it into e.g. a row-wise step, where all threads operate on rows only, and then a column-wise step, possibly repeating those.
Speculative starting point
I would suggest that your main thread holds the Grid struct which will almost inevitably be implemented with some unsafe methods, e.g. get_row(usize), get_row_mut(usize) if you can split your problem into rows/colmns or get(usize, usize) and get(usize, usize) if you can't. I cannot tell you what exactly these should return, but they might even be custom references to Grid, which:
can only be obtained when the usual borrowing rules are fulfilled (e.g. by blocking the thread until any other GridRefMut is dropped)
implement Drop such that you don't create a deadlock
Every thread holds a Arc<Grid>, and can draw cells/rows/columns for reading/mutating out of the grid as needed, while the grid itself keeps book of references being created and dropped.
The downside of this approach is that you basically implement a runtime borrow-checker yourself. It's tedious and probably error-prone. You should browse crates.io before you do that, but your problem sounds specific enough that you might not find a fitting solution, let alone one that's sufficiently documented.

Swift 3 and Index of a custom linked list collection type

In Swift 3 Collection indices have to conform to Comparable instead of Equatable.
Full story can be read here swift-evolution/0065.
Here's a relevant quote:
Usually an index can be represented with one or two Ints that
efficiently encode the path to the element from the root of a data
structure. Since one is free to choose the encoding of the “path”, we
think it is possible to choose it in such a way that indices are
cheaply comparable. That has been the case for all of the indices
required to implement the standard library, and a few others we
investigated while researching this change.
In my implementation of a custom linked list collection a node (pointing to a successor) is the opaque index type. However, given two instances, it is not possible to tell if one precedes another without risking traversal of a significant part of the chain.
I'm curious, how would you implement Comparable for a linked list index with O(1) complexity?
The only idea that I currently have is to somehow count steps while advancing the index, storing it within the index type as a property and then comparing those values.
Serious downside of this solution is that indices must be invalidated when mutating the collection. And while that seems reasonable for arrays, I do not want to break that huge benefit linked lists have - they do not invalidate indices of unchanged nodes.
EDIT:
It can be done at the cost of two additional integers as collection properties assuming that single linked list implements front insert, front remove and back append. Any meddling around in the middle would anyway break O(1) complexity requirement.
Here's my take on it.
a) I introduced one private integer type property to my custom Index type: depth.
b) I introduced two private integer type properties to the collection: startDepth and endDepth, which both default to zero for an empty list.
Each front insert decrements the startDepth.
Each front remove increments the startDepth.
Each back append increments the endDepth.
Thus all indices startIndex..<endIndex have a reflecting integer range startDepth..<endDepth.
c) Whenever collection vends an index either by startIndex or endIndex it will inherit its corresponding depth value from the collection. When collection is asked to advance the index by invoking index(_ after:) I will simply initialize a new Index instance with incremented depth value (depth += 1).
Conforming to Comparable boils down to comparing left-hand side depth value to the right-hand side one.
Note that because I expand the integer range from both sides as well, all the depth values for the middle indices remain unchanged (thus are not invalidated).
Conclusion:
Traded benefit of O(1) index comparisons at the cost of minor increase in memory footprint and few integer increments and decrements. I expect index lifetime to be short and number of collections relatively small.
If anyone has a better solution I'd gladly take a look at it!
I may have another solution. If you use floats instead of integers, you can gain kind of O(1) insertion-in-the-middle performance if you set the sortIndex of the inserted node to a value between the predecessor and the successor's sortIndex. This would require to store (and update) the predecessor's sortIndex on your nodes (I imagine this should not be to hard since it is only changed on insertion or removal and it can always be propagated 'up').
In your index(after:) method you need to query the successor node, but since you use your node as index, that is be straightforward.
One caveat is the finite precision of floating points, so if on insertion you the distance between the two sort indices are two small, you need to reindex at least part of the list. Since you said you only expect small scale, I would just go through the hole list and use the position for that.
This approach has all the benefits of your own, with the added benefit of good performance on insertion in the middle.

Why do we use data structures? (when no dynamic allocation is needed)

I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.
Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.
Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.
The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);

How to handle billions of objects without "Outofmemory" error

I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:
To apply an operation for each TRange object(eg:Output certain properties)
To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).
How do I effectively create an Iterator for these object without raising Outofmemory?
I have handled the first case by passing a function pointer to the algorithm function. For eg:
procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a //TRange
var range: TRange;
rangerec: TRangeRec;
begin
range:=TRange.Create;
try
while canCreateRange do begin//certain conditions needed to create a range
rangerec := ReturnRangeRec;
range.Update(rangerec);//don't create new, use the same object.
if Assigned(aProc) then aProc(range);
end;
finally
range.Free;
end;
end;
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.
Thank you all in advance
Pradeep
For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:
I Created an extended storage that can store a custom record either in memory or on the hard drive. This storage has a maximum number of records that can live simultaniously in memory.
Then I Derived the record classes out of the custom record class. These classes know how to store and load themselves from the hard drive (I use streams).
Everytime you need a new or already existing record you ask the extended storage for such a record. If the maximum number of objects is exceeded, the storage streams some of the least used record back to the hard drive.
This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.
I did not post any code because it is beyond the scope of the question itself and would only confuse.
Look at TgsStream64. This class can handle a huge amounts of data through file mapping.
http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter.
It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:
TRangeProc = procedure(Data: pointer; range: TRange);
procedure enumRanges(aProc: TRangeProc; Data: pointer);
begin
{for each range}
aProc(range, Data);
end;
Then if you want to, say, sum all ranges, you do it like this:
TSumRecord = record
Sum: int64;
end;
PSumRecord = ^TSumRecord;
procedure SumProc(SumRecord: PSumRecord; range: TRange);
begin
SumRecord.Sum := SumRecord.Sum + range.Value;
end;
function SumRanges(): int64;
var SumRec: TSumRecord;
begin
SumRec.Sum := 0;
enumRanges(TRangeProc(SumProc), #SumRec);
Result := SumRec.Sum;
end;
Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.
"Runner" has a good answer how to handle this!
But I would like to known if you could do a quick fix: make smaller TRange objects.
Maybe you have a big ancestor? Can you take a look at the instance size of TRange object?
Maybe you better use packed records?
This part:
As a result, if you have 100 items,
you can't directly create the 100th
object without creating all the prior
objects.
sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.
Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.

TStringList, Dynamic Array or Linked List in Delphi?

I have a choice.
I have a number of already ordered strings that I need to store and access. It looks like I can choose between using:
A TStringList
A Dynamic Array of strings, and
A Linked List of strings (singly linked)
and Alan in his comment suggested I also add to the choices:
TList<string>
In what circumstances is each of these better than the others?
Which is best for small lists (under 10 items)?
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
Which is best to minimize memory use?
Which is best to minimize loading time to add extra items on the end?
Which is best to minimize access time for accessing the entire list from first to last?
On this basis (or any others), which data structure would be preferable?
For reference, I am using Delphi 2009.
Dimitry in a comment said:
Describe your task and data access pattern, then it will be possible to give you an exact answer
Okay. I've got a genealogy program with lots of data.
For each person I have a number of events and attributes. I am storing them as short text strings but there are many of them for each person, ranging from 0 to a few hundred. And I've got thousands of people. I don't need random access to them. I only need them associated as a number of strings in a known order attached to each person. This is my case of thousands of "small lists". They take time to load and use memory, and take time to access if I need them all (e.g. to export the entire generated report).
Then I have a few larger lists, e.g. all the names of the sections of my "virtual" treeview, which can have hundreds of thousands of names. Again I only need a list that I can access by index. These are stored separately from the treeview for efficiency, and the treeview retrieves them only as needed. This takes a while to load and is very expensive memory-wise for my program. But I don't have to worry about access time, because only a few are accessed at a time.
Hopefully this gives you an idea of what I'm trying to accomplish.
p.s. I've posted a lot of questions about optimizing Delphi here at StackOverflow. My program reads 25 MB files with 100,000 people and creates data structures and a report and treeview for them in 8 seconds but uses 175 MB of RAM to do so. I'm working to reduce that because I'm aiming to load files with several million people in 32-bit Windows.
I've just found some excellent suggestions for optimizing a TList at this StackOverflow question:
Is there a faster TList implementation?
Unless you have special needs, a TStringList is hard to beat because it provides the TStrings interface that many components can use directly. With TStringList.Sorted := True, binary search will be used which means that search will be very quick. You also get object mapping for free, each item can also be associated with a pointer, and you get all the existing methods for marshalling, stream interfaces, comma-text, delimited-text, and so on.
On the other hand, for special needs purposes, if you need to do many inserts and deletions, then something more approaching a linked list would be better. But then search becomes slower, and it is a rare collection of strings indeed that never needs searching. In such situations, some type of hash is often used where a hash is created out of, say, the first 2 bytes of a string (preallocate an array with length 65536, and the first 2 bytes of a string is converted directly into a hash index within that range), and then at that hash location, a linked list is stored with each item key consisting of the remaining bytes in the strings (to save space---the hash index already contains the first two bytes). Then, the initial hash lookup is O(1), and the subsequent insertions and deletions are linked-list-fast. This is a trade-off that can be manipulated, and the levers should be clear.
A TStringList. Pros: has extended functionality, allowing to dynamically grow, sort, save, load, search, etc. Cons: on large amount of access to the items by the index, Strings[Index] is introducing sensible performance lost (few percents), comparing to access to an array, memory overhead for each item cell.
A Dynamic Array of strings. Pros: combines ability to dynamically grow, as a TStrings, with the fastest access by the index, minimal memory usage from others. Cons: limited standard "string list" functionality.
A Linked List of strings (singly linked). Pros: the linear speed of addition of an item to the list end. Cons: slowest access by the index and searching, limited standard "string list" functionality, memory overhead for "next item" pointer, spead overhead for each item memory allocation.
TList< string >. As above.
TStringBuilder. I does not have a good idea, how to use TStringBuilder as a storage for multiple strings.
Actually, there are much more approaches:
linked list of dynamic arrays
hash tables
databases
binary trees
etc
The best approach will depend on the task.
Which is best for small lists (under
10 items)?
Anyone, may be even static array with total items count variable.
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
For large lists I will choose:
- dynamic array, if I need a lot of access by the index or search for specific item
- hash table, if I need to search by the key
- linked list of dynamic arrays, if I need many item appends and no access by the index
Which is best to minimize memory use?
dynamic array will eat less memory. But the question is not about overhead, but about on which number of items this overhead become sensible. And then how to properly handle this number of items.
Which is best to minimize loading time to add extra items on the end?
dynamic array may dynamically grow, but on really large number of items, memory manager may not found a continous memory area. While linked list will work until there is a memory for at least a cell, but for cost of memory allocation for each item. The mixed approach - linked list of dynamic arrays should work.
Which is best to minimize access time for accessing the entire list from first to last?
dynamic array.
On this basis (or any others), which data structure would be preferable?
For which task ?
If your stated goal is to improve your program to the point that it can load genealogy files with millions of persons in it, then deciding between the four data structures in your question isn't really going to get you there.
Do the math - you are currently loading a 25 MB file with about 100000 persons in it, which causes your application to consume 175 MB of memory. If you wish to load files with several millions of persons in it you can estimate that without drastic changes to your program you will need to multiply your memory needs by n * 10 as well. There's no way to do that in a 32 bit process while keeping everything in memory the way you currently do.
You basically have two options:
Not keeping everything in memory at once, instead using a database, or a file-based solution which you load data from when you need it. I remember you had other questions about this already, and probably decided against it, so I'll leave it at that.
Keep everything in memory, but in the most space-efficient way possible. As long as there is no 64 bit Delphi this should allow for a few million persons, depending on how much data there will be for each person. Recompiling this for 64 bit will do away with that limit as well.
If you go for the second option then you need to minimize memory consumption much more aggressively:
Use string interning. Every loaded data element in your program that contains the same data but is contained in different strings is basically wasted memory. I understand that your program is a viewer, not an editor, so you can probably get away with only ever adding strings to your pool of interned strings. Doing string interning with millions of string is still difficult, the "Optimizing Memory Consumption with String Pools" blog postings on the SmartInspect blog may give you some good ideas. These guys deal regularly with huge data files and had to make it work with the same constraints you are facing.
This should also connect this answer to your question - if you use string interning you would not need to keep lists of strings in your data structures, but lists of string pool indexes.
It may also be beneficial to use multiple string pools, like one for names, but a different one for locations like cities or countries. This should speed up insertion into the pools.
Use the string encoding that gives the smallest in-memory representation. Storing everything as a native Windows Unicode string will probably consume much more space than storing strings in UTF-8, unless you deal regularly with strings that contain mostly characters which need three or more bytes in the UTF-8 encoding.
Due to the necessary character set conversion your program will need more CPU cycles for displaying strings, but with that amount of data it's a worthy trade-off, as memory access will be the bottleneck, and smaller data size helps with decreasing memory access load.
One question: How do you query: do you match the strings or query on an ID or position in the list?
Best for small # strings:
Whatever makes your program easy to understand. Program readability is very important and you should only sacrifice it in real hotspots in your application for speed.
Best for memory (if that is the largest constrained) and load times:
Keep all strings in a single memory buffer (or memory mapped file) and only keep pointers to the strings (or offsets). Whenever you need a string you can clip-out a string using two pointers and return it as a Delphi string. This way you avoid the overhead of the string structure itself (refcount, length int, codepage int and the memory manager structures for each string allocation.
This only works fine if the strings are static and don't change.
TList, TList<>, array of string and the solution above have a "list" overhead of one pointer per string. A linked list has an overhead of at least 2 pointers (single linked list) or 3 pointers (double linked list). The linked list solution does not have fast random access but allows for O(1) resizes where trhe other options have O(lgN) (using a factor for resize) or O(N) using a fixed resize.
What I would do:
If < 1000 items and performance is not utmost important: use TStringList or a dyn array whatever is easiest for you.
else if static: use the trick above. This will give you O(lgN) query time, least used memory and very fast load times (just gulp it in or use a memory mapped file)
All mentioned structures in your question will fail when using large amounts of data 1M+ strings that needs to be dynamically chaned in code. At that Time I would use a balances binary tree or a hash table depending on the type of queries I need to maken.
From your description, I'm not entirely sure if it could fit in your design but one way you could improve on memory usage without suffering a huge performance penalty is by using a trie.
Advantages relative to binary search tree
The following are the main advantages
of tries over binary search trees
(BSTs):
Looking up keys is faster. Looking up a key of length m takes worst case
O(m) time. A BST performs O(log(n))
comparisons of keys, where n is the
number of elements in the tree,
because lookups depend on the depth of
the tree, which is logarithmic in the
number of keys if the tree is
balanced. Hence in the worst case, a
BST takes O(m log n) time. Moreover,
in the worst case log(n) will approach
m. Also, the simple operations tries
use during lookup, such as array
indexing using a character, are fast
on real machines.
Tries can require less space when they contain a large number of short
strings, because the keys are not
stored explicitly and nodes are shared
between keys with common initial
subsequences.
Tries facilitate longest-prefix matching, helping to find the key
sharing the longest possible prefix of
characters all unique.
Possible alternative:
I've recently discovered SynBigTable (http://blog.synopse.info/post/2010/03/16/Synopse-Big-Table) which has a TSynBigTableString class for storing large amounts of data using a string index.
Very simple, single layer bigtable implementation, and it mainly uses disc storage, to consumes a lot less memory than expected when storing hundreds of thousands of records.
As simple as:
aId := UTF8String(Format('%s.%s', [name, surname]));
bigtable.Add(data, aId)
and
bigtable.Get(aId, data)
One catch, indexes must be unique, and the cost of update is a bit high (first delete, then re-insert)
TStringList stores an array of pointer to (string, TObject) records.
TList stores an array of pointers.
TStringBuilder cannot store a collection of strings. It is similar to .NET's StringBuilder and should only be used to concatenate (many) strings.
Resizing dynamic arrays is slow, so do not even consider it as an option.
I would use Delphi's generic TList<string> in all your scenarios. It stores an array of strings (not string pointers). It should have faster access in all cases due to no (un)boxing.
You may be able to find or implement a slightly better linked-list solution if you only want sequential access. See Delphi Algorithms and Data Structures.
Delphi promotes its TList and TList<>. The internal array implementation is highly optimized and I have never experienced performance/memory issues when using it. See Efficiency of TList and TStringList

Resources