Why do we use data structures? (when no dynamic allocation is needed) - memory

I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.

Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.

Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.

The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);

Related

Using something other than a Swift array for mutable fixed-size thread-safe data passed to OpenGL buffer

I am trying to squeeze every bit of efficiency out of my application I am working on.
I have a couple arrays that follow the following conditions:
They are NEVER appended to, I always calculate the index myself
The are allocated once and never change size
It would be nice if they were thread safe as long as it doesn't cost performance
Some hold primitives like floats, or unsigned ints. One of them does hold a class.
Most of these arrays at some point are passed into a glBuffer
Never cleared just overwritten
Some of the arrays individual elements are changed entirely by = others are changed by +=
I currently am using swift native arrays and am allocating them like var arr = [GLfloat](count: 999, repeatedValue: 0) however I have been reading a lot of documentation and it sounds like Swift arrays are much more abstract then a traditional C-style array. I am not even sure if they are allocated in a block or more like a linked list with bits and pieces thrown all over the place. I believe by doing the code above you cause it to allocate in a continuous block but i'm not sure.
I worry that the abstract nature of Swift arrays is something that is wasting a lot of precious processing time. As you can see by my above conditions I dont need any of the fancy appending, or safety features of Swift arrays. I just need it simple and fast.
My question is: In this scenario should I be using some other form of array? NSArray, somehow get a C-style array going, create my own data type?
Im looking into thread safety, would a different array type that was more thread safe such as NSArray be any slower?
Note that your requirements are contradictory, particularly #2 and #7. You can't operate on them with += and also say they will never change size. "I always calculate the index myself" also doesn't make sense. What else would calculate it? The requirements for things you will hand to glBuffer are radically different than the requirements for things that will hold objects.
If you construct the Array the way you say, you'll get contiguous memory. If you want to be absolutely certain that you have contiguous memory, use a ContiguousArray (but in the vast majority of cases this will give you little to no benefit while costing you complexity; there appear to be some corner cases in the current compiler that give a small advantage to ContinguousArray, but you must benchmark before assuming that's true). It's not clear what kind of "abstractness" you have in mind, but there's no secrets about how Array works. All of stdlib is open source. Go look and see if it does things you want to avoid.
For certain kinds of operations, it is possible for other types of data structures to be faster. For instance, there are cases where a dispatch_data is better and cases where a regular Data would be better and cases where you should use a ManagedBuffer to gain more control. But in general, unless you deeply know what you're doing, you can easily make things dramatically worse. There is no "is always faster" data structure that works correctly for all the kinds of uses you describe. If there were, that would just be the implementation of Array.
None of this makes sense to pursue until you've built some code and started profiling it in optimized builds to understand what's going on. It is very likely that different uses would be optimized by different kinds of data structures.
It's very strange that you ask whether you should use NSArray, since that would be wildly (orders of magnitude) slower than Array for dealing with very large collections of numbers. You definitely need to experiment with these types a bit to get a sense of their characteristics. NSArray is brilliant and extremely fast for certain problems, but not for that one.
But again, write a little code. Profile it. Look at the generated assembler. See what's happening. Watch particularly for any undesired copying or retain counting. If you see that in a specific case, then you have something to think about changing data structures over. But there's no "use this to go fast." All the trade-offs to achieve that in the general case are already in Array.

Swift Struct vs Class: what is the allowed stack size? and refactoring a class to a struct

First, I understand the difference between value and reference types -this isn't that question. I am rewriting some of my code in Swift, and decided to also refactor some of the classes. Therefore, I thought I would see if some of the classes make sense as structs.
Memory: I have some model classes that hold very large arrays, that are constantly growing in size (unknown final size), and could exist for hours. First, are there any guidelines about a suggested or absolute size for a struct, since it lives on the stack?
Refactoring Use: Since I'm refactoring what right now is a mess with too much dependency, I wonder how I could improve on that. The views and view controllers are mostly easily, it's my model, and what it does, that's always left me wishing for better examples to follow.
WorkerManager: Singleton that holds one or two Workers at a time. One will always be recording new data from a sensor, and the other would be reviewing stored data. The view controllers get the Worker reference from the WorkerManager, and ask the Worker for the data to be displayed.
Worker: Does everything on a queue, to prevent memory access issues (C array pointers are constantly changing as they grow). Listening: The listening Worker listens for new data, sends it to a Processor object (that it created) that cleans up the data and stores it in C arrays held by the Worker. Then, if there is valid data, the Worker tells the Analyzer (also owned by the worker) to analyze the data and stores it in other C arrays to be fed to views. Both the Processor and Analyzer need state to know what has happened in the past and what to process and analyze next. The pure raw data is stored in a separate Record NSManaged object. Reviewer Takes a Record and uses the pure raw data to recreate all of the analyzed data so that it can be reviewed. (analyzed data is massive, and I don't want to store it to disk)
Now, my second question is, could/should Processor and Analyzer be replaced with structs? Or maybe protocols for the Worker? They aren't really "objects" in the normal sense, just convenient groups of related methods and the necessary state. And since the code is nearly a thousand lines for each, and I don't want to put it all in one class, or even the same file.
I just don't have a good sense of how to remove all of my state, use pure functions for all of the complex mathematical operations that are performed on the arrays, and where to put them.
While the struct itself lives on the stack, the array data lives on the heap so that array can grow in size dynamically. So even if you have an array with million items in it and pass it somewhere, none of the items are copied until you change the new array due to the copy-on-write implementation. This is described in details in 2015 WWDC Session 414.
As for the second question, I think that 2015 WWDC Session 414 again has the answer. The basic check that Apple engineers recommend for value types are:
Use a value type when:
Comparing instance data with == makes sense
You want copies to have independent state
The data will be used in code across multiple threads
Use a reference type (e.g. use a class) when:
Comparing instance identity with === makes sense
You want to create shared, mutable state
So from what you've described, I think that reference types fit Processor and Analyzer much better. It doesn't seem that copies of Processor and Analyzer are valid objects if you've not created new Producers and Analyzers explicitly. Would you not want the changes to these objects to be shared?

HHVM staticly typing lookup tables and keeping them fully cached in RAM

I'm doing scientific research, processing through millions of combinations of multi-megabyte arrays.
For you to be capable of answering this question you will need to have knowledge/experience of all of the following
how HHVM is able to cache data structures in RAM between requests
how to tell HHVM data structures will be constant
how to declare array index and value types
I need to process the entire arrays, so it's a lot of data to be loaded and processed. (millions of requests within minutes on a LAN). The faster I can complete requests the quicker I can complete my work. If HHVM has to do work loading this data on each request, it accounts for a significant fraction of the time to complete the request (sometimes more than half, it depends on the complexity of the analysis I'm doing at the time).
I have found a method that has allowed me to keep these data structures cached in RAM (no loading from files, interpreting code, pushing to the array hundreds of thousands of times for no reason, no pointless repetitive unserialize etc), and thus I have eliminated this massive measurable delay.
I have 3 questions regarding how I can make this even faster:
Is the way I'm doing it now creating a global scope penalty?
How can I declare my arrays as constant and tell HHVM what data types to expect?
If I declare my arrays as constant is it even necessary to declare the types for HHVM?
Instead of using nested arrays, if I use 3 separate data structures ImmVector, PackedArray, or define a class would it be faster?
Keep in mind that anything that prevents HHVM from caching the data structure in RAM between requests should be regarded as unacceptable.
Lookuptable35543.php
<?php
$data = [
["uuid (20 chars)", 5336, 7373],
["uuid (20 chars)", 5336, 7373],
#more lines as above
];
?>
Some of these files are many MB in size and there are a lot of them
Main.php
<?php
function main() {
require /path/to/Lookuptable35543.php;
#(Do stuff with $data)
}
?>
This is working quite well, as Main.php gets thousands of requests, in a short period of time, HHVM keeps Lookuptable.php's data structure in memory. Avoiding pointless processing and IO, as it just sits in RAM, ready for use. (I have more than enough RAM)
Unfortunately, the only way I know how to make HHVM hold the lookup table in RAM is, I set $data in the global scope inside my lookup####.php file (then require the lookup file into a function in the data processing file: Main.php)? This way HHVM doesnt bother loading the file or re executing the code to create $data, because it can see that $data can be determined at compile time, and it will not ever change during runtime. This works but I dont know if there is a penalty from having the $data exist in the lookup###.php file's global scope. (Or maybe its not global at all because it is required into main.php's function?)
What if I return $data from a function inside Lookup.php and call that function from Main.php like this
Main.php
Would the HHVM JIT the result of getData() in RAM?
Somehow I associate functions with unpredictability... but maybe HHVM is clever enough to know that the functions result can be determined at compile time, and never changes?
I can't put the lookup table inside Main.php because I require different lookup tables based on the type of request.
Is there a way I can tell HHVM that my outer array will always have an integer index that never changes, and the values of the outer array will always be an array?
Perhaps I need to use ImmVector?
Then is there a way to tell HHVM that my inner array will always be a fixed length string followed by 2 integers, always, no extra elements, contents never changes?
I'd prefer not to use OO or create a class. How can I declare types, procedural style?
If a class is absolutely necessary can you please give example code suitable for my requirements above?
Will it be faster if I dont nest arrays?
I just realized I could have one array with integer index and values of fixed length string. Then a 2nd array with integer index and integer values, and a 3rd one with integer index and integer values.
If you're not familiar with this HHVM caching technique please do not waste mutual time suggesting a database, redis, APC, unserialize, etc. The fastest is for HHVM to just keep my various $data variables in RAM. Even unserializing $data from a ramdisk file is slow, because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know. I dont want to even have to copy $data. The lookup tables are immutable, read only. They must just stay fully structured in RAM. My current caching solution (at the top of this question) has already given me huge gains, but as per my 3 questions I think there may be more gains to be had?
Incase you're wondering, I have measured the latency of various data loading or caching methods.
Now I basically want to keep the caching situation I have, but give the HHVM JIT maximum confidence about how to type my data, so it can save time not running type or even bound (array size) checks.
Edit
Ok so nobody has been able to give me any code examples yet, so I'm just trying stuff out.
Here's what I've found out so far.
const arrays don't work yet in HHVM. const foo = ['uuid1',43,43];
throws an error about HHVM only supporting constants with scalar values.
Vector with Array values: I don't know how it will perform yet... I expect it will be better than a normal array. This is valid HH code.
This is progress, because HHVM should be able to cache this in the same way, HHVM knows this whole structure is constant, and HHVM knows the indexes are all integers.
What I'm still not entirely happy about with this structure is this:
Consider this code
for ($n=0;$n<count($iv);++$n) if ($x > $iv[$n][1]) dosomething();
Will HHVM perform a type check on $if[$n][2] on every loop iteration?
In my definition of $iv above, there is nothing that says the 2nd element of the inner array will be an integer.
How can I improve on this?
Can disabling the type checker be of any use? Does this only hide errors from the external type checker, or does it prevent HHVM from constantly doing type checks? (I'm thinking it's the first thing)
Perhaps if I could make my own user-defined type that would solve the problem?
<?hh
#I don't know what mechanisms for UDT's exist, so this code is made-up
CreateUDT foo = <string,int,int>;
$iv = ImmVector<foo> {
['uuid1',425,244],
['uuid2',658,836]
};
print_r($iv);
I found a reference to this at Hack Collections Literal Syntax Vector<Foo> unfortunately it might not be available to use yet.
I'm a software engineer at Facebook working on HHVM.
This entire question reeks of premature optimization to me. Have you done profiling and determined that loading this array is actually a bottleneck for your app? (Not just microbenchmarks, but how it actually affects the performance, latency, RPS, etc of realistic pageloads.) And also isolated from other effects, e.g., if this array is a cache or some sort of precomputed data, you need to isolate the win of precomputing the data from the actual time to load it by caching it in various different ways.
In general, HHVM is very good at dealing with arrays, since they are so hot in nearly every codepath -- and in particular at constant arrays like this one. To your questions about how to inform it of the shape and types of things in the arrays, HHVM can figure that all out for itself, and is very good at doing so on constant arrays composed entirely of constants. (And the ways it thinks about arrays aren't quite the ways you think about arrays, so it can probably do a better job anyway!) Basically, unless profiling says this is actually a hotspot -- which I'm pretty skeptical of -- I wouldn't worry too much about it. A couple general notes to be aware of:
Measure every performance diff. Don't prematurely optimize -- use profiling to guide. The developer productivity lost by premature optimizations getting in the way can be lethal.
Get things out of toplevel ("pseudomains") as much as possible. A function which returns a static or constant array should be just fine, and will in general help HHVM optimize code even better.
Avoid references as much as possible, especially in this array if you care about performance so much.
You probably should look into repo authoritiative mode which can help HHVM optimize lots of things even more -- but in particular for this case, the more aggressive inlining that repo auth mode can do might be a win.
Edit, aside:
because then the entire data structure must be parsed as a string and converted into a data structure in memory for every request. APC has the same problem as far as i know
This is exactly what I mean by premature optimization: you're rejecting APC without even trying it, even if it might be a cleaner way of doing what you want. It turns out that, in most cases, HHVM actually can optimize away the serialization/deserialization of storing arrays in APC, particularly if they are constant arrays that are never modified. As above, HHVM is very good at optimizing lots of common patterns. Just write code that's clean, profile it, and fix the hotspots.
Okay I've solved my first question.
I don't have any global scope issues. My require is being done from inside function main(), so it's as if the code from lookuptable####.php is being inserted into function main().
HHVM docs: "If the include occurs inside a function..."
Basically if you were to open lookuptable####.php it looks like the code is in global scope, but that's not the file that is being requested from hhvm. main.php is the one being requested, thus there is no code in global scope.
I think I've answered my 2nd question, it's currently at the bottom of my question. I'm not 100% convinced, but I'm pretty happy to move ahead and test it.

TStringList, Dynamic Array or Linked List in Delphi?

I have a choice.
I have a number of already ordered strings that I need to store and access. It looks like I can choose between using:
A TStringList
A Dynamic Array of strings, and
A Linked List of strings (singly linked)
and Alan in his comment suggested I also add to the choices:
TList<string>
In what circumstances is each of these better than the others?
Which is best for small lists (under 10 items)?
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
Which is best to minimize memory use?
Which is best to minimize loading time to add extra items on the end?
Which is best to minimize access time for accessing the entire list from first to last?
On this basis (or any others), which data structure would be preferable?
For reference, I am using Delphi 2009.
Dimitry in a comment said:
Describe your task and data access pattern, then it will be possible to give you an exact answer
Okay. I've got a genealogy program with lots of data.
For each person I have a number of events and attributes. I am storing them as short text strings but there are many of them for each person, ranging from 0 to a few hundred. And I've got thousands of people. I don't need random access to them. I only need them associated as a number of strings in a known order attached to each person. This is my case of thousands of "small lists". They take time to load and use memory, and take time to access if I need them all (e.g. to export the entire generated report).
Then I have a few larger lists, e.g. all the names of the sections of my "virtual" treeview, which can have hundreds of thousands of names. Again I only need a list that I can access by index. These are stored separately from the treeview for efficiency, and the treeview retrieves them only as needed. This takes a while to load and is very expensive memory-wise for my program. But I don't have to worry about access time, because only a few are accessed at a time.
Hopefully this gives you an idea of what I'm trying to accomplish.
p.s. I've posted a lot of questions about optimizing Delphi here at StackOverflow. My program reads 25 MB files with 100,000 people and creates data structures and a report and treeview for them in 8 seconds but uses 175 MB of RAM to do so. I'm working to reduce that because I'm aiming to load files with several million people in 32-bit Windows.
I've just found some excellent suggestions for optimizing a TList at this StackOverflow question:
Is there a faster TList implementation?
Unless you have special needs, a TStringList is hard to beat because it provides the TStrings interface that many components can use directly. With TStringList.Sorted := True, binary search will be used which means that search will be very quick. You also get object mapping for free, each item can also be associated with a pointer, and you get all the existing methods for marshalling, stream interfaces, comma-text, delimited-text, and so on.
On the other hand, for special needs purposes, if you need to do many inserts and deletions, then something more approaching a linked list would be better. But then search becomes slower, and it is a rare collection of strings indeed that never needs searching. In such situations, some type of hash is often used where a hash is created out of, say, the first 2 bytes of a string (preallocate an array with length 65536, and the first 2 bytes of a string is converted directly into a hash index within that range), and then at that hash location, a linked list is stored with each item key consisting of the remaining bytes in the strings (to save space---the hash index already contains the first two bytes). Then, the initial hash lookup is O(1), and the subsequent insertions and deletions are linked-list-fast. This is a trade-off that can be manipulated, and the levers should be clear.
A TStringList. Pros: has extended functionality, allowing to dynamically grow, sort, save, load, search, etc. Cons: on large amount of access to the items by the index, Strings[Index] is introducing sensible performance lost (few percents), comparing to access to an array, memory overhead for each item cell.
A Dynamic Array of strings. Pros: combines ability to dynamically grow, as a TStrings, with the fastest access by the index, minimal memory usage from others. Cons: limited standard "string list" functionality.
A Linked List of strings (singly linked). Pros: the linear speed of addition of an item to the list end. Cons: slowest access by the index and searching, limited standard "string list" functionality, memory overhead for "next item" pointer, spead overhead for each item memory allocation.
TList< string >. As above.
TStringBuilder. I does not have a good idea, how to use TStringBuilder as a storage for multiple strings.
Actually, there are much more approaches:
linked list of dynamic arrays
hash tables
databases
binary trees
etc
The best approach will depend on the task.
Which is best for small lists (under
10 items)?
Anyone, may be even static array with total items count variable.
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
For large lists I will choose:
- dynamic array, if I need a lot of access by the index or search for specific item
- hash table, if I need to search by the key
- linked list of dynamic arrays, if I need many item appends and no access by the index
Which is best to minimize memory use?
dynamic array will eat less memory. But the question is not about overhead, but about on which number of items this overhead become sensible. And then how to properly handle this number of items.
Which is best to minimize loading time to add extra items on the end?
dynamic array may dynamically grow, but on really large number of items, memory manager may not found a continous memory area. While linked list will work until there is a memory for at least a cell, but for cost of memory allocation for each item. The mixed approach - linked list of dynamic arrays should work.
Which is best to minimize access time for accessing the entire list from first to last?
dynamic array.
On this basis (or any others), which data structure would be preferable?
For which task ?
If your stated goal is to improve your program to the point that it can load genealogy files with millions of persons in it, then deciding between the four data structures in your question isn't really going to get you there.
Do the math - you are currently loading a 25 MB file with about 100000 persons in it, which causes your application to consume 175 MB of memory. If you wish to load files with several millions of persons in it you can estimate that without drastic changes to your program you will need to multiply your memory needs by n * 10 as well. There's no way to do that in a 32 bit process while keeping everything in memory the way you currently do.
You basically have two options:
Not keeping everything in memory at once, instead using a database, or a file-based solution which you load data from when you need it. I remember you had other questions about this already, and probably decided against it, so I'll leave it at that.
Keep everything in memory, but in the most space-efficient way possible. As long as there is no 64 bit Delphi this should allow for a few million persons, depending on how much data there will be for each person. Recompiling this for 64 bit will do away with that limit as well.
If you go for the second option then you need to minimize memory consumption much more aggressively:
Use string interning. Every loaded data element in your program that contains the same data but is contained in different strings is basically wasted memory. I understand that your program is a viewer, not an editor, so you can probably get away with only ever adding strings to your pool of interned strings. Doing string interning with millions of string is still difficult, the "Optimizing Memory Consumption with String Pools" blog postings on the SmartInspect blog may give you some good ideas. These guys deal regularly with huge data files and had to make it work with the same constraints you are facing.
This should also connect this answer to your question - if you use string interning you would not need to keep lists of strings in your data structures, but lists of string pool indexes.
It may also be beneficial to use multiple string pools, like one for names, but a different one for locations like cities or countries. This should speed up insertion into the pools.
Use the string encoding that gives the smallest in-memory representation. Storing everything as a native Windows Unicode string will probably consume much more space than storing strings in UTF-8, unless you deal regularly with strings that contain mostly characters which need three or more bytes in the UTF-8 encoding.
Due to the necessary character set conversion your program will need more CPU cycles for displaying strings, but with that amount of data it's a worthy trade-off, as memory access will be the bottleneck, and smaller data size helps with decreasing memory access load.
One question: How do you query: do you match the strings or query on an ID or position in the list?
Best for small # strings:
Whatever makes your program easy to understand. Program readability is very important and you should only sacrifice it in real hotspots in your application for speed.
Best for memory (if that is the largest constrained) and load times:
Keep all strings in a single memory buffer (or memory mapped file) and only keep pointers to the strings (or offsets). Whenever you need a string you can clip-out a string using two pointers and return it as a Delphi string. This way you avoid the overhead of the string structure itself (refcount, length int, codepage int and the memory manager structures for each string allocation.
This only works fine if the strings are static and don't change.
TList, TList<>, array of string and the solution above have a "list" overhead of one pointer per string. A linked list has an overhead of at least 2 pointers (single linked list) or 3 pointers (double linked list). The linked list solution does not have fast random access but allows for O(1) resizes where trhe other options have O(lgN) (using a factor for resize) or O(N) using a fixed resize.
What I would do:
If < 1000 items and performance is not utmost important: use TStringList or a dyn array whatever is easiest for you.
else if static: use the trick above. This will give you O(lgN) query time, least used memory and very fast load times (just gulp it in or use a memory mapped file)
All mentioned structures in your question will fail when using large amounts of data 1M+ strings that needs to be dynamically chaned in code. At that Time I would use a balances binary tree or a hash table depending on the type of queries I need to maken.
From your description, I'm not entirely sure if it could fit in your design but one way you could improve on memory usage without suffering a huge performance penalty is by using a trie.
Advantages relative to binary search tree
The following are the main advantages
of tries over binary search trees
(BSTs):
Looking up keys is faster. Looking up a key of length m takes worst case
O(m) time. A BST performs O(log(n))
comparisons of keys, where n is the
number of elements in the tree,
because lookups depend on the depth of
the tree, which is logarithmic in the
number of keys if the tree is
balanced. Hence in the worst case, a
BST takes O(m log n) time. Moreover,
in the worst case log(n) will approach
m. Also, the simple operations tries
use during lookup, such as array
indexing using a character, are fast
on real machines.
Tries can require less space when they contain a large number of short
strings, because the keys are not
stored explicitly and nodes are shared
between keys with common initial
subsequences.
Tries facilitate longest-prefix matching, helping to find the key
sharing the longest possible prefix of
characters all unique.
Possible alternative:
I've recently discovered SynBigTable (http://blog.synopse.info/post/2010/03/16/Synopse-Big-Table) which has a TSynBigTableString class for storing large amounts of data using a string index.
Very simple, single layer bigtable implementation, and it mainly uses disc storage, to consumes a lot less memory than expected when storing hundreds of thousands of records.
As simple as:
aId := UTF8String(Format('%s.%s', [name, surname]));
bigtable.Add(data, aId)
and
bigtable.Get(aId, data)
One catch, indexes must be unique, and the cost of update is a bit high (first delete, then re-insert)
TStringList stores an array of pointer to (string, TObject) records.
TList stores an array of pointers.
TStringBuilder cannot store a collection of strings. It is similar to .NET's StringBuilder and should only be used to concatenate (many) strings.
Resizing dynamic arrays is slow, so do not even consider it as an option.
I would use Delphi's generic TList<string> in all your scenarios. It stores an array of strings (not string pointers). It should have faster access in all cases due to no (un)boxing.
You may be able to find or implement a slightly better linked-list solution if you only want sequential access. See Delphi Algorithms and Data Structures.
Delphi promotes its TList and TList<>. The internal array implementation is highly optimized and I have never experienced performance/memory issues when using it. See Efficiency of TList and TStringList

Minimizing the memory footprint of an ADO.NET DataSet?

Given a legacy system that is making heavy use of DataSets and little or no possibility of replacing these with business objects or other, more efficient data structures:
Are there any techniques for reducing the memory footprint of a DataSet?
I am thinking about things like setting initial capacity (when known), removing restrictions, etc., but I have little experience with DataSets and do not know which specific options might be available to me or if any of them would matter at all.
Update:
I am aware of the long-term refactoring possibilities, but I am looking for quick fixes given a set of DataTable objects stored in a DataSet, i.e. which properties are known to affect memory overhead.
Due to the way data is stored internally, setting the initial capacity could be one method, as this would prevent the object from allocating an arbitrarily large amount of memory when adding just one more row.
This is unluckily to help you, but it can help greatly in same cases.
If you are storing a lot of strings that are the same in the dataset, e.g. names of Towns, look at only using a single string object with each distinct string.
e.g.
Directory <string, string> towns = new Directory <string, string>();
foreach(var row in datatable)
{
if (towns.contains(row.town))
{
row.town = towns[row.town]
}
else
{
towns[row.town] = row.town;
}
}
Then the GC can reclaim most of the duplicate strings, however this only works if the datasets lives for along time.
You may wish to do this in the rowCreated event, so that all the duplicate string objects are not created in first place.
If you're using VS2005+, you can instantiate DataTable objects, rather than the whole DataSet. In 2003, if the DataTable is instantiated, it comes with DataSet by default. 2005 and after, you get just the DataTable.
Look at your Data Access layer for filling the DataSets or DataTables. It's most often the case that there is too much data coming through. Make your queries more specific.
Make sure the code you're using does not do goofy things like copy the DataSets when they're passed around. Make sure you're using .Select statements or DataViews to filter and sort, rather than making copies.
There aren't a whole lot of quick "optimizations" for DataSets. If you're having trouble with memory, use items 2 and 3. This would be the case regardless of what type of data transport object you'd use.
And get good at DataSets. If you're not familiar with them, you can do silly things, like with anything. Then you'll write articles about how they suck, which are really articles about how little you know about them. They're really quite useful and simple to maintain. A couple tips:
Use typed DataSets. They'll save you gobs of coding and they're typed, which helps with simple validation.
If you're using typed DSs, make sure you don't modify the generated code file. If you're using VS2005+, you can put any custom business object behavior in the partial class for the DS (not the .designer code file).
Use DataView and .Select wherever you find yourself looping through DataRow objects.
Look around for a good code generation tool and build a rational data access framework for filling and updating from the DSs. One of the issues is that sometimes, designers tie the design of the DS directly to tables in the db, making the design brittle to data structure changes. If you -must- do that, build or use a code generator to build your data access layer from the db, like CodeSmith. Start by looking at some of the CodeSmith templates for generating stored procs and data access classes.
Remember when talking to someone about "objects" vs. "DataSets", the object in this case is the DataRow, not the DataSet. And because of the partial classes you can put behavior on the "object", getting you 95% of the benefits of "objects" for those who love writing code.
You could try making your tables and rows implement interfaces in the code behind files. Then over time change your code to make use of these interfaces rather then the table/rows directly.
Once most of your code just uses the interfaces, you could use a code generate to create C# class that implement those interfaces without the overhead of rows/tables.
However it may be cheaper just to move to 64 bit and buy more ram...

Resources