How do video games efficiently store/retrieve large amounts of data? - storage

For example, in Fallout 3, a save game stores the state and location of every single object and NPC in the game, and only takes up a few MB's. How do they do that!?!?
And then, during game play, how is this data added/retrieved in/from memory such that it can be displayed to the player in real-time?
UPDATED: (I'm going to make you work for your answers :P)
Based on Kevin Crowell's answer...
So I guess you would have a rendering distance that would apply to objects and NPC's, and you would "SELECT" the objects and NPC's within the given range. However, what type of data store would you use in order to get these objects?
In other words, you would you have a gigantic array of every object in the game, and constantly update a smaller list that holds the visible objects to render?
Also, per Chaos' answer...
Would would happen if you eventually touched every object in the game? Would your save game get bigger and bigger? In the case of Fallout 3, I'm pretty sure there aren't "stages", where the past data could just be dropped. Everything is persisted when you leave/return to a location. So how do you think this specific case is implemented?

With all the big hardisks nowaday, even developers seem to forget how many bytes there are in a megabyte. So to answer the question in the title: games store large amounts of data by creating savegames that are several megabyte large.
To illustrate how big a megabyte is, it's 8 million bits. That is sufficient to encode 2^8000000 = 10^2666666 states. In comparison, there are only 10^80 atoms in the universe. Now in a (save)game there are multiple subsystems with distinct states; e.g. in a RPG each NPC has its own state. But how much of a state is there, really? Their position in a town might be saved as 16 bits (do you remember their exact position if they're walking around anyway?). Their mood/disposition/etc as another 8 bits, and that allows for more emotions then some people have.
When it comes to storing this kind of data in-game, the typical datastructure is a QuadTree. This is a datastructure that allows you to determine objects in a certain X-Y region in O(log N). In some cases, game developers find it easier to pre-partition the world in zones. This reduces the amount of run-time calculations. A good example was Doom. Its maps had visibility pre-calculated; for each point one could determine quickly to which zone it belonged, and for each zone the amount of visible objects was pre-calculated. This reduced the amount of objects that needed runtime visibility checks.

It can simply be mapping objects, or NPCs, to an X,Y,Z coordinate plane. That information that be stored cheaply.
During gameplay, all of those objects are still mapped to a coordinate system at all times. They just need to read in the save information and start from there.

I think you're overestimating the complexity of what's being stored/retrieved. You don't need to store the 3D models for the objects, or their textures, or any of the things that make up large parts of a game's size-on-disk.
First of all, as chaos mentioned, it's only necessary to store information about things that have been moved. Even then, you probably only need to store, for each of those, the new position and orientation (assuming there's not other variables involved, like "damaged"). So that's two vectors for each object, which will be around a grand total of 24 bytes per object. That means you can store the information for 40,000 objects per megabyte. That's an awful lot of objects to have moved around.
Restoring this data is no more complex than placing the objects in the first place. Every object has to have a default position/orientation defined for the game to put it somewhere, so all you're doing is replacing the default with the stored value in the save file. This is not complex, and doesn't require any significant additional processing.

In Fallout 3 in particular, the map is divided in a grid fashion. You can only see your current square and the ones immediately next. The type of data store is not really important - can be a SQLite database, can be a tree serialized to disk, or can be something else entirely.
...you would you have a
gigantic array of every object in the
game, and constantly update a smaller
list that holds the visible objects to
render?
Generally yes, but the "gigantic array" doesn't need to be in memory. And there are more lists - objects in current and adjacent grid square (you can be attacked from behind - not in visible list), the visible list, the timer list...
Would would happen if you eventually touched every object in the game? Would your save
game get bigger and bigger?
Could - if there is a default state table for everything, the save can contain only the differences. The save will then grow as you progress.
Everything is persisted when you leave/return to a location.
Nope. Items you drop outside of your house will eventually disappear. Bodies too, probably. Random monsters are respawned every once in a while. This is both convenient to game designers and consistent with the real world.

If you think about the information you need to save it's really not that much;
E.g.
Position
Orientation
Inventory
Health
Objective-state
There are lots more of course, many of which dependend on both the type of game and how the save structure is organized.
Some games like Resident Evil only allow saves when you enter a new zone meaning you don't have to store all the information for entities in both zones. When you "load" a save their attributes come from the disc.
As to how this is data is retrieved/modofied, I'm not quite sure I understand. It's just data in the consoles memory. When the player saves it's written to the save device, and when they load it's restored.

One major technique is differential saves: only saving state that's something other than its default. Compare and contrast "saving the state and location of every object in the game world" with "saving the state and location of every object in the game world that the player has moved or altered".

Echoing the other answers, the biggest savings comes from eliminating all unnecessary state data.
If you look at 8-bit side-scroller games, they will start discarding state as soon as things are offscreen, and oftentimes retain nothing, because their resources are too tight to keep around more than the minimum number of instances.
Doing it on the macro-level for a game like Fallout 3 is just a matter of increasing the scope of the problem. You start sectioning up the landscape by grid or other geometrical methods, and spawn/despawn stuff as the player moves from one section to the next. You ideally keep the size of each area small so that in-memory state is not high. You figure out the bare minimum of state needed to keep around NPC and item instances, and in the layout data you tag as much as possible to auto-respawn so that it doesn't need any state saved.
If you want to be pointed at a specific data structure, an example serialization format might be a linear stream indexed by a tree of pointers, where the organization of the tree corresponds to the map layout.

On a related note, game engines often employ Zip compression, to keep the size of all that content down and also make some operations faster.

Besides what everybody else said, i would like to add state doesn't necessarly imply just position and movement,but also properites for the respective state. Usually a Game Engine has a feature witch allows you to save the data of a certain class.
Say you have a Player class and you are well into the story, when you click save the possible data that can be stored is :
Where is the player located in the
level/map
What are his attributes :
health,mana,strenght,
intelligence,etc
What skills does he have.
What level is he.
Globally we can also have:
How many references (names that allow the engine to pick up an object from a list) to objects are being stored in that specific level,in other words when you load what objects should be loaded along with it.
Are we using physics, if so who uses it.
And many more. Fallout 3 has one type of save, another game will have another. It really depends on the genre and the engine in use.

Related

replicating trees between ACID RDB using CRDT

I'm interested in replicating "hierachies" of data say similar to addresses.
Area
District
Sector
Unit
but you may have different pieces of data associated to each layer, so you may know the area of Sectors, but not of units, and you may know the population of a unit, basically its not a homogenious tree.
I know little about replication of data except brushing Brewers theorem/CAP, and some naive intuition about what eventual consistency is.
I'm looking for SIMPLE mechanisms to replicate this data from an ACID RDB, into other ACID RDBs, systemically the system needs to eventually converge, and obviously each RDB will enforce its own local consistent view, but any 2 nodes may not match at any given time (except 'eventually').
The simplest way to approach this is to simple store all the data in a single message from some designated leader and distribute it...like an overnight dump and load process, but thats too big.
So the next simplest thing (I thought) was if something inside an area changes, I can export the complete set of data inside an area, and load it into the nodes, thats still quite a coarse algorithm.
The next step was if, say an 'object' at any level changed, was to send all the data in the path to that 'object', i.e. if something in a sector is amended, you would send the data associated to the sector, its parent the district, and its parent the sector (with some sort of version stamp and lets say last update wins)....what i wanted to do was to ensure that any replication 'update' was guaranteed to succeed (so it needs the whole path, which potentially would be created if it didn't exist).
then i stumbled on CRDTs and thought....ah...I'm reinventing the wheel here, and the algorithms are allegedly easy in principle, but tricky to get correct in practice
are there standards accepted patterns to do this sort of thing?
In my use case the hierarchies are quite shallow, and there is only a single designated leader (at this time), I'm quite attracted to state based CRDTs because then I can ignore ordering.
Simplicity is the key requirement.
Actually it appears I've reinvented (in a very crude naive way) the SHELF algorithm.
I'll write some code and see if I can get it to work, and try to understand whats going on.

Array of references that share an Arc

This one's kind of an open ended design question I'm afraid.
Anyway: I have a big two-dimensional array of stuff. This array is mutable, and is accessed by a bunch of threads. For now I've just been dealing with this as a Arc<Mutex<Vec<Vec<--owned stuff-->>>>, which has been fine.
The problem is that stuff is about to grow considerably in size, and I'll want to start holding references rather than complete structures. I could do this by inverting everything and going to Vec<Vec<Arc<Mutex>>, but I feel like that would be a ton of overhead, especially because each thread would need a complete copy of the grid rather than a single Arc/Mutex.
What I want to do is have this be an array of references, but somehow communicate that the items being referenced all live long enough according to a single top-level Arc or something similar. Is that possible?
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
EDIT:Giving some more specifics on my code (away from home so this is rough):
What I want:
Outer scope initializes a bunch of Ts and somehow collectively ensures they live long enough (that's the hard part)
Outer scope initializes a grid :Something<Vec<Vec<&T>>> that stores references to the Ts
Outer scope creates a bunch of threads and passes grid to them
Threads dive in and out of some sort of (problable RW) lock on grid, reading the Tsand changing the &Ts in the process.
What I have:
Outer thread creates a grid: Arc<RwLock<Vector<Vector<T>>>>
Arc::clone(& grid)s are passed to individual threads
Read-heavy threads mostly share the lock and sometimes kick each other out for the writes.
The only problem with this is that the grid is storing actual Ts which might be problematically large. (Don't worry too much about the RwLock/thread exclusivity stuff, I think it's perpendicular to the question unless something about it jumps out at you.)
What I don't want to do:
Top level creates a bunch of Arc<Mutex<T>> for individual T
Top level creates a `grid : Vec<Vec<Arc<Mutex>>> and passes it to threads
The problem with that is that I worry about the size of Arc/Mutex on every grid element (I've been going up to 2000x2000 so far and may go larger). Also while the threads would lock each other out less (only if they're actually looking at the same square), they'd have to pick up and drop locks way more as they explore the array, and I think that would be worse than my current RwLock implementation.
Let me start of by your "aside" question, as I feel it's the one that can be answered:
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
The documenation of std::vec::Vec specifies that the layout is essentially a pointer with size information. That means that any Vec<Vec<T>> is a pointer to a densely packed array of pointers to densely packed arrays of Ts. So if block of memory means a contiguous block to you, then no, Vec<Vec<T>> cannot give that you. If that is part of your requirements, you'd have to deal with a datatype (let's call it Grid) that is basically a (pointer, n_rows, n_columns) and define for yourself if the layout should be row-first or column-first.
The next part is that if you want different threads to mutate e.g. columns/rows of your grid at the same time, Arc<Mutex<Grid>> won't cut it, but you already figured that out. You should get clarity whether you can split your problem such that each thread can only operate on rows OR columns. Remember that if any thread holds a &mut Row, no other thread must hold a &mut Column: There will be an overlapping element, and it will be very easy for you to create a data races. If you can assign a static range of of rows to a thread (e.g. thread 1 processes rows 1-3, thread 2 processes row 3-6, etc.), that should make your life considerably easier. To get into "row-wise" processing if it doesn't arise naturally from the problem, you might consider breaking it into e.g. a row-wise step, where all threads operate on rows only, and then a column-wise step, possibly repeating those.
Speculative starting point
I would suggest that your main thread holds the Grid struct which will almost inevitably be implemented with some unsafe methods, e.g. get_row(usize), get_row_mut(usize) if you can split your problem into rows/colmns or get(usize, usize) and get(usize, usize) if you can't. I cannot tell you what exactly these should return, but they might even be custom references to Grid, which:
can only be obtained when the usual borrowing rules are fulfilled (e.g. by blocking the thread until any other GridRefMut is dropped)
implement Drop such that you don't create a deadlock
Every thread holds a Arc<Grid>, and can draw cells/rows/columns for reading/mutating out of the grid as needed, while the grid itself keeps book of references being created and dropped.
The downside of this approach is that you basically implement a runtime borrow-checker yourself. It's tedious and probably error-prone. You should browse crates.io before you do that, but your problem sounds specific enough that you might not find a fitting solution, let alone one that's sufficiently documented.

track vehicle trajectory by using opencv

This question bothers me for a long time. The basic vehicle counting program includes: 1. recognize a vehicle. 2. track the vehicle by features.
However, if the vehicle #1 was found at time t, then at t+1 the program start to track the vehicle, but #1 can also be found by recognizing process, then t+2 program two vehicles will be tracked, but actually just one #1 in the frame. How can the recognized vehicle avoiding duplicate detect?
Thanks in advance!
If I understood correctly, you are concerned about detecting the object that you are already tracking (lack of detector/tracker communication). In that case you can either:
Pre-check - during detection exclude the areas, where you already track objects or
Post-check - discard detected objects, that are near tracked ones (if "selective" detection is not possible for your approach for some reason)
There are several possible implementations.
Mask. Create a binary mask, where areas near tracked objects are "marked" (e.g. ones near tracked objects and zeros everywhere else). Given such a mask, before detection in particular location you can quickly check if something is being tracked there, and abort detection (Pre-check approach) or remove detected object, if you stick with the Post-check approach.
Brute-force. Calculate distances between particular location and each of the tracked ones (you can also check overlapping area and other characteristics). You can then discard detections, that are too close and/or similar to already tracked objects.
Lets consider which way is better (and when).
Mask needs O(N) operations to add all tracked objects to the mask and O(M) operations to check all locations of interest. That's O(N + M) = O(max(N, M)), where N is number of tracked objects and M is number of checked locations (detected objects, for example). Which number (N or M) will be bigger depends on your application. Additional memory is also needed to hold the binary mask (usually it is not very important, but again, it depends on the application).
Brute-force needs O(N * M) operations (each of M locations is checked against N candidates). It doesn't need additional memory, and allows doing more complex logic during checks. For example, if object suddenly changes size/color/whatever within one frame - we should probably not track it (since it may be a completely different object occluding original one) and do something else instead.
To sum up:
Mask is asymptotically better when you have a lot of objects. It is almost essential if you do something like a sliding window search during detection, and can exclude some areas (since in this case you will likely have a large M). You will likely use it with Pre-check.
Brute-force is OK when you have few objects and need to do checks that involve different properties. It makes most sense to use it with Post-check.
If you happen to need something inbetween - you'll have to be more creative and either encode object properties in mask somehow (to achieve constant look-up time) or use more complex data structures (to speed up "Brute-force" search).

Why do we use data structures? (when no dynamic allocation is needed)

I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.
Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.
Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.
The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);

Game Terrain Database Model

I am developing a game for the web. The map of this game will be a minimum of 2000km by 2000km. I want to be able to encode elevation and terrain type at some level of granularity - 100m X 100m for example.
For a 2000km by 2000km map storing this information in 100m2 buckets would mean 20000 by 20000 elements or a total of 400,000,000 records in a database.
Is there some other way of storing this type of information?
MORE INFORMATION
The map itself will not ever be displayed in its entirety. Units will be moved on the map in a turn based fashion and the players will get feedback on where they are located and what the local area looks like. Terrain will dictate speed and prohibition of movement.
I guess I am trying to say that the map will be used for the game and not necessarily for a graphical or display purposes.
It depends on how you want to generate your terrain.
For example, you could procedurally generate it all (using interpolation of a low resolution terrain/height map - stored as two "bitmaps" - with random interpolation seeded from the xy coords to ensure that terrain didn't morph), and use minimal storage.
If you wanted areas of terrain that were completely defined, you could store these separately and use them where appropriate, randomly generating the rest.)
If you want completely defined terrain, then you're going to need to look into some kind of compression/streaming technique to only pull terrain you are currently interested in.
I would treat it differently, by separating terrain type and elevation.
Terrain type, I assume, does not change as rapidly as elevation - there are probably sectors of the same type of terrain that stretch over much longer than the lowest level of granularity. I would map those sectors into database records or some kind of hash table, depending on performance, memory and other requirements.
Elevation I would assume is semi-contiuous, as it changes gradually for the most part. I would try to map the values into set of continuous functions (different sets between parts that are not continues, as in sudden change in elevation). For any set of coordinates for which the terrain is the same elevation or can be described by a simple function, you just need to define the range this function covers. This should reduce much the amount of information you need to record to describe the elevation at each point in the terrain.
So basically I would break down the map into different sectors which compose of (x,y) ranges, once for terrain type and once for terrain elevation, and build a hash table for each which can return the appropriate value as needed.
If you want the kind of granularity that you are looking for, then there is no obvious way of doing it.
You could try a 2-dimensional wavelet transform, but that's pretty complex. Something like a Fourier transform would do quite nicely. Plus, you probably wouldn't go about storing the terrain with a one-record-per-piece-of-land way; it makes more sense to have some sort of database field which can store an encoded matrix.
I think the usual solution is to break your domain up into "tiles" of manageable sizes. You'll have to add a little bit of logic to load the appropriate tiles at any given time, but not too bad.
You shouldn't need to access all that info at once--even if each 100m2 bucket occupied a single pixel on the screen, no screen I know of could show 20k x 20k pixels at once.
Also, I wouldn't use a database--look into height mapping--effectively using a black & white image whose pixel values represent heights.
Good luck!
That will be awfully lot of information no matter which way you look at it. 400,000,000 grid cells will take their toll.
I see two ways of going around this. Firstly, since it is a web-based game, you might be able to get a server with a decently sized HDD and store the 400M records in it just as you would normally. Or more likely create some sort of your own storage mechanism for efficiency. Then you would only have to devise a way to access the data efficiently, which could be done by taking into account the fact that you doubtfully will need to use it all at once. ;)
The other way would be some kind of compression. You have to be careful with this though. Most out-of-the-box compression algorithms won't allow you to decompress an arbitrary location in the stream. Perhaps your terrain data has some patterns in it you can use? I doubt it will be completely random. More likely I predict large areas with the same data. Perhaps those can be encoded as such?

Resources