Related
Is there a way to filter based on historical data?
For example: "Show me all objects who had "Attribute_X" == True on 01/01/2013"
As Steve stated, this would require an advanced DXL script.
I'm not sure about creating a filter on this, but identifying those objects you are looking for, I might be able to help. Having recently solved a similar task, I recommend to start with Tony Goodman's really excellent Smart History Viewer (this code could be used as DXL tutorial!) which has almost all the code you need. You just need to find and understand it.
Let me elaborate. Besides other nifty stuff, the history viewer basically does:
For all (selected) baselines, explicitly including un-baselined current version: gather all module changes and put them into a two-dimensional Skip list each, for module/object/session changes. Focus on the object changes.
There is an unused function printObjectHistory in the code which helps understanding the data structures. Have a look at the inner loop
for hist in skipHistory do
Inside this loop, consider only changes which happened before "01/01/2013" (check hist->HIST_DATE to obtain this information). The history viewer code already classified the detected changes, so you want to watch out for changes which contain the string "Modify Attribute: Attribute_X". Assign the new value to a buffer. Outside this loop, check if the buffer contains "True". If so, you this is one of the objects you wanted to find.
In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.
I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.
Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.
Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.
The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);
For example, in Fallout 3, a save game stores the state and location of every single object and NPC in the game, and only takes up a few MB's. How do they do that!?!?
And then, during game play, how is this data added/retrieved in/from memory such that it can be displayed to the player in real-time?
UPDATED: (I'm going to make you work for your answers :P)
Based on Kevin Crowell's answer...
So I guess you would have a rendering distance that would apply to objects and NPC's, and you would "SELECT" the objects and NPC's within the given range. However, what type of data store would you use in order to get these objects?
In other words, you would you have a gigantic array of every object in the game, and constantly update a smaller list that holds the visible objects to render?
Also, per Chaos' answer...
Would would happen if you eventually touched every object in the game? Would your save game get bigger and bigger? In the case of Fallout 3, I'm pretty sure there aren't "stages", where the past data could just be dropped. Everything is persisted when you leave/return to a location. So how do you think this specific case is implemented?
With all the big hardisks nowaday, even developers seem to forget how many bytes there are in a megabyte. So to answer the question in the title: games store large amounts of data by creating savegames that are several megabyte large.
To illustrate how big a megabyte is, it's 8 million bits. That is sufficient to encode 2^8000000 = 10^2666666 states. In comparison, there are only 10^80 atoms in the universe. Now in a (save)game there are multiple subsystems with distinct states; e.g. in a RPG each NPC has its own state. But how much of a state is there, really? Their position in a town might be saved as 16 bits (do you remember their exact position if they're walking around anyway?). Their mood/disposition/etc as another 8 bits, and that allows for more emotions then some people have.
When it comes to storing this kind of data in-game, the typical datastructure is a QuadTree. This is a datastructure that allows you to determine objects in a certain X-Y region in O(log N). In some cases, game developers find it easier to pre-partition the world in zones. This reduces the amount of run-time calculations. A good example was Doom. Its maps had visibility pre-calculated; for each point one could determine quickly to which zone it belonged, and for each zone the amount of visible objects was pre-calculated. This reduced the amount of objects that needed runtime visibility checks.
It can simply be mapping objects, or NPCs, to an X,Y,Z coordinate plane. That information that be stored cheaply.
During gameplay, all of those objects are still mapped to a coordinate system at all times. They just need to read in the save information and start from there.
I think you're overestimating the complexity of what's being stored/retrieved. You don't need to store the 3D models for the objects, or their textures, or any of the things that make up large parts of a game's size-on-disk.
First of all, as chaos mentioned, it's only necessary to store information about things that have been moved. Even then, you probably only need to store, for each of those, the new position and orientation (assuming there's not other variables involved, like "damaged"). So that's two vectors for each object, which will be around a grand total of 24 bytes per object. That means you can store the information for 40,000 objects per megabyte. That's an awful lot of objects to have moved around.
Restoring this data is no more complex than placing the objects in the first place. Every object has to have a default position/orientation defined for the game to put it somewhere, so all you're doing is replacing the default with the stored value in the save file. This is not complex, and doesn't require any significant additional processing.
In Fallout 3 in particular, the map is divided in a grid fashion. You can only see your current square and the ones immediately next. The type of data store is not really important - can be a SQLite database, can be a tree serialized to disk, or can be something else entirely.
...you would you have a
gigantic array of every object in the
game, and constantly update a smaller
list that holds the visible objects to
render?
Generally yes, but the "gigantic array" doesn't need to be in memory. And there are more lists - objects in current and adjacent grid square (you can be attacked from behind - not in visible list), the visible list, the timer list...
Would would happen if you eventually touched every object in the game? Would your save
game get bigger and bigger?
Could - if there is a default state table for everything, the save can contain only the differences. The save will then grow as you progress.
Everything is persisted when you leave/return to a location.
Nope. Items you drop outside of your house will eventually disappear. Bodies too, probably. Random monsters are respawned every once in a while. This is both convenient to game designers and consistent with the real world.
If you think about the information you need to save it's really not that much;
E.g.
Position
Orientation
Inventory
Health
Objective-state
There are lots more of course, many of which dependend on both the type of game and how the save structure is organized.
Some games like Resident Evil only allow saves when you enter a new zone meaning you don't have to store all the information for entities in both zones. When you "load" a save their attributes come from the disc.
As to how this is data is retrieved/modofied, I'm not quite sure I understand. It's just data in the consoles memory. When the player saves it's written to the save device, and when they load it's restored.
One major technique is differential saves: only saving state that's something other than its default. Compare and contrast "saving the state and location of every object in the game world" with "saving the state and location of every object in the game world that the player has moved or altered".
Echoing the other answers, the biggest savings comes from eliminating all unnecessary state data.
If you look at 8-bit side-scroller games, they will start discarding state as soon as things are offscreen, and oftentimes retain nothing, because their resources are too tight to keep around more than the minimum number of instances.
Doing it on the macro-level for a game like Fallout 3 is just a matter of increasing the scope of the problem. You start sectioning up the landscape by grid or other geometrical methods, and spawn/despawn stuff as the player moves from one section to the next. You ideally keep the size of each area small so that in-memory state is not high. You figure out the bare minimum of state needed to keep around NPC and item instances, and in the layout data you tag as much as possible to auto-respawn so that it doesn't need any state saved.
If you want to be pointed at a specific data structure, an example serialization format might be a linear stream indexed by a tree of pointers, where the organization of the tree corresponds to the map layout.
On a related note, game engines often employ Zip compression, to keep the size of all that content down and also make some operations faster.
Besides what everybody else said, i would like to add state doesn't necessarly imply just position and movement,but also properites for the respective state. Usually a Game Engine has a feature witch allows you to save the data of a certain class.
Say you have a Player class and you are well into the story, when you click save the possible data that can be stored is :
Where is the player located in the
level/map
What are his attributes :
health,mana,strenght,
intelligence,etc
What skills does he have.
What level is he.
Globally we can also have:
How many references (names that allow the engine to pick up an object from a list) to objects are being stored in that specific level,in other words when you load what objects should be loaded along with it.
Are we using physics, if so who uses it.
And many more. Fallout 3 has one type of save, another game will have another. It really depends on the genre and the engine in use.
Given a legacy system that is making heavy use of DataSets and little or no possibility of replacing these with business objects or other, more efficient data structures:
Are there any techniques for reducing the memory footprint of a DataSet?
I am thinking about things like setting initial capacity (when known), removing restrictions, etc., but I have little experience with DataSets and do not know which specific options might be available to me or if any of them would matter at all.
Update:
I am aware of the long-term refactoring possibilities, but I am looking for quick fixes given a set of DataTable objects stored in a DataSet, i.e. which properties are known to affect memory overhead.
Due to the way data is stored internally, setting the initial capacity could be one method, as this would prevent the object from allocating an arbitrarily large amount of memory when adding just one more row.
This is unluckily to help you, but it can help greatly in same cases.
If you are storing a lot of strings that are the same in the dataset, e.g. names of Towns, look at only using a single string object with each distinct string.
e.g.
Directory <string, string> towns = new Directory <string, string>();
foreach(var row in datatable)
{
if (towns.contains(row.town))
{
row.town = towns[row.town]
}
else
{
towns[row.town] = row.town;
}
}
Then the GC can reclaim most of the duplicate strings, however this only works if the datasets lives for along time.
You may wish to do this in the rowCreated event, so that all the duplicate string objects are not created in first place.
If you're using VS2005+, you can instantiate DataTable objects, rather than the whole DataSet. In 2003, if the DataTable is instantiated, it comes with DataSet by default. 2005 and after, you get just the DataTable.
Look at your Data Access layer for filling the DataSets or DataTables. It's most often the case that there is too much data coming through. Make your queries more specific.
Make sure the code you're using does not do goofy things like copy the DataSets when they're passed around. Make sure you're using .Select statements or DataViews to filter and sort, rather than making copies.
There aren't a whole lot of quick "optimizations" for DataSets. If you're having trouble with memory, use items 2 and 3. This would be the case regardless of what type of data transport object you'd use.
And get good at DataSets. If you're not familiar with them, you can do silly things, like with anything. Then you'll write articles about how they suck, which are really articles about how little you know about them. They're really quite useful and simple to maintain. A couple tips:
Use typed DataSets. They'll save you gobs of coding and they're typed, which helps with simple validation.
If you're using typed DSs, make sure you don't modify the generated code file. If you're using VS2005+, you can put any custom business object behavior in the partial class for the DS (not the .designer code file).
Use DataView and .Select wherever you find yourself looping through DataRow objects.
Look around for a good code generation tool and build a rational data access framework for filling and updating from the DSs. One of the issues is that sometimes, designers tie the design of the DS directly to tables in the db, making the design brittle to data structure changes. If you -must- do that, build or use a code generator to build your data access layer from the db, like CodeSmith. Start by looking at some of the CodeSmith templates for generating stored procs and data access classes.
Remember when talking to someone about "objects" vs. "DataSets", the object in this case is the DataRow, not the DataSet. And because of the partial classes you can put behavior on the "object", getting you 95% of the benefits of "objects" for those who love writing code.
You could try making your tables and rows implement interfaces in the code behind files. Then over time change your code to make use of these interfaces rather then the table/rows directly.
Once most of your code just uses the interfaces, you could use a code generate to create C# class that implement those interfaces without the overhead of rows/tables.
However it may be cheaper just to move to 64 bit and buy more ram...