How does the size of a realm-file develop ?
To start with: I have a realm-file with several properties and one of them being an array of 860 entries and each array-entry consists of a couple of properties again.
One array-property states the name of the entry.
I observed the following:
If the name-property sais "Criteria_A1" (until "Criteria_A860") - then the realm-file is 1.6 MB big
If the name-property sais "A1" (until "A860") - then the realm-file is only 786 kB big
Why is the extra letters in the array-name-property making the realm-file this much bigger ??
A second observation:
if I add more objects (each again having an array with 860 entries), then the file size gets 1.6MB big again (no matter how many objects I add; guess until a critical value again where the size tripples...or am I wrong??).
It almost seems to me that the realm-file at 786 kB is doubled in size as soon as something is added (either a property that has more letters or an object that is added). Why does the realm-file double at a critical value and not linearly increase in size with more content added ??
Thanks for a clarification on this.
It's pretty well observed. :-) The Realm file starts out at about 4k and will double in size once it runs out of free space. It keeps doubling until 128M and then adds constantly 128M thereafter.
The reason to double the file and not just grow linearly is only due to performance. It's a common algorithm for dynamic data structures to just keep doubling.
You can use the methods available as seen below to write a compacted copy removing all free space in the file. This can be useful if you don't add new data anymore, want to ship a static database or want to send the file over the network.
Realm.writeCopyToURL(_:encryptionKey:) in Swift
-[RLMRealm writeCopyToURL:encryptionKey:error:] in Objective-C
Realm.writeCopyTo() in Java
Those thresholds and algorithm mentioned are the current ones, and may change in future versions though.
Hope this clarifies?
Related
Background:
I am working on a 2d infinite world generation. It is tile based meaning my terrain is fully made out of squares. You can imagine it like 2d Minecraft (looking at the terrain from above).
I implemented standard chunk system where the terrain gets chopped into small 8x8 tile areas that get loaded and deleted as player moves around the world. This, so far mentioned, works perfectly smooth without any hiccups or lag. I am using Lua and Corona SDK.
The problem:
Since the player will be able to modify the terrain, I need a fast and efficient system of saving chunks in memory once the player loads a new chunk and a system of loading those chunks from memory if they have been loaded previously.
This is where the problem takes place. It needs to read from and save to files (memory) quite often which causes noticeable lag. Making chunks bigger is not an option.
Solutions I tried but all caused lag:
a) First and obvious solution I implemented was to just create a text file for each chunk with tile names as strings. It looked something like this: x12y10.txt and inside the file I just dumped all tile names in order they need to be placed on screen: "Grass Grass Water Sand Sand Sand Grass Grass...". That worked but loading strings was slow so I tried another solution: save tiles as indexes.
b) Saving tiles as their indexes. I paired every tile to a number. Since numbers are shorter, they take less memory and are faster to load. I gave each tile it's own index: Grass -> id 1, Water -> id 2, Sand -> id 3 and so on. This way I only needed to save 1 or 2 chars instead of full string per tile. My txt files looked like this now: "1 1 2 3 3 3 1 1...". This worked better but still caused lag.
c) Next improvement I did was with how chunks are organized in memory. Instead of dumping all the chunks in a single folder, for each x coordinate I made a folder and put all chunks that have that x value in there.
So instead of this:
Folder with all chunks: x0y0.txt, x0y1.txt, x0y2.txt, x1y0.txt, x1y1.txt, x1y2.txt
Inside folder with all chunks I had this:
Folder x0: x0y0.txt, x0y1.txt, x0y2.txt
Folder x1: x1y0.txt, x1y1.txt, x1y2.txt
I am not sure how much this helped for small number of chunks, but I am pretty sure for thousands of chunks, improvement is there.
Possible solutions?
I have some ideas for improvements, but I would like to hear your opinion on the solutions.
a) Saving terrain in binary files?
b) I have read about Minecraft region format, really tried to understand how it works, but did not get it since there is little information about it. So if anyone knows it and could explain their system to me, I would be really grateful.
c) Another faster file format?
d) Is making/accessing many folders slow? Is there a better alternative?
I really feel like this is cs-101 question, but cannot google up any answer right away so quick summary.
All files are just sequences of bytes. If we're talking about reading and writing raw bytes, no format will make 64 bytes appear in memory faster than another.
Text file is a sequence of bytes with slight limitations on their values (well, the limitation is if you want standard text programs to display it). A string "11" (sequence of bits: 110001110001) from a text file won't be loaded faster than sequence of unprintable bits 100000100000 from "binary" file.
Structuring directories at the very least reduces the number of nodes system checks when trying the file you've requested to open. But mechanisms underlying the filesystems are very complex and affected by a lot of factors. The overall guess is that frequent reading even of small files will be slow. And all files carry some stockpiling overhead (system info to keep them tracked and ordered), small files will have lower useful/auxiliary info ratio. I know of at least one 2d project with mutable map that was making hdds growl and grunt before they moved onto bigger files years ago.
You don't have to make chunks bigger, that's different thing, but you can write them into the same file.
Instead of million of files by 64 bytes you can have a single megabyte file (assuming you use a byte per chunk). A million chunks is lot for a player to modify or walk around. If you unpack that data to tables, it will take up more space but you don't have to decipher all the string, only the currently needed bytes. Yes, modifying a megabyte string in lua will cause creation of another megabyte string which is slow, but you don't have to do it every time, or you can split string into smaller ones and modify those. And only do writing when needed. I/O bufferization may even happen without your intervention but again it is usually helpful for big files.
Yes there will be more than a byte of info per tile (2^8 possible states per tile is a lot however), the system stays the same.
The same thing is done for textures, because loading data in a single big chunk in a single big scoop is faster than searching around for tiny bit here and there. Indexing a single long area of memory is also faster than chasing pointers around.
On top of that, you may try to read\write less bytes than you want in the memory. For example by compressing data.
In minecraft chunks are not stored unless they have been visited / modified, otherwise they are generated.
That would leave you a system where only blocks which have been modified by the player would need to be stored, with the un-modified areas being re-generated by using the same random seed, each time.
Creating a hierarchy of modifications ... A chunk is an 8x8 block, create a super-chunk which is 8x8 chunks, and only look for a file, if any of the 8x8 super-chunk has been modified.
Possibly store all of the super-chunk in one file, which would limit the number of files (adding more files does decrease the speed of the system, and also uses space on the system inefficiently).
If you have any spare time-space, perhaps have a cache of the chunks near the player, and pre-load the modified areas which are being approached. This would limit the visible lag required
I realized Data memory implementation in nand2tetris course. But I really don't understand some parts of my implementation:
CHIP Memory {
IN in[16], load, address[15];
OUT out[16];
PARTS:
DMux4Way(in=load, sel=address[13..14], a=RAM1, b=RAM2, c=scr, d=kbr);
Or(a=RAM1, b=RAM2, out=RAM);
RAM16K(in=in, load=RAM, address=address[0..13], out=RAMout);
Screen(in=in, load=scr, address=address[0..12], out=ScreenOut);
Keyboard(out=KeyboardOut);
Mux4Way16(a=RAMout, b=RAMout, c=ScreenOut, d=KeyboardOut, sel=address[13..14], out=out);
}
Is responsible for what load here. I understand that if load is 0 - out of Dmux4Way in any case will be 0 0 0 0. But i don't understand how it works in that case after that. Namely how it allows don't load data in Memory.
At least incomprehensible why in Screen we fed address[0..12] instead address[0..14] - full address. In my opinion we should use second because Screen memory map stay after RAM memory map and if we want to request for Screen memory map - we should use range (16 384 - 24 575) - decimal or (100000000000000 - 101111111111111) - binary. But how we can represent that range use just 13 width buss (address[0..12]) ??? It's impossible.
Therefore if we want to represent Screen memory map we should use range which was presented above. And that range has 15 width or address[0..14] BUT not address[0..12] (width 13). But why works just address[0..12] and doesn't work address[0..14](full address)
DMux4Way(in=load, sel=address[13..14], a=RAM1, b=RAM2, c=scr, d=kbr);
I'm sorry to criticize you at the beginning, but questions you ask suggest that you didn't do this exercise yourself or didn't start the whole course from the beginning.
To answer your questions:
Ad.1.
You demultiplex a single bit (load bit) to the correct memory part. Thereafter, you then feed the input data to all memory parts at the same time.
It's easier and neater than doing it the other way around, namely, to direct 16-bit input to the correct part (RAM16K, screen, or keyboard) while having a load bit that is connected and active at every register in all the parts.
To clarify. You have 2 possible destinations when writing data: RAM and Screen. The smallest demultiplexer you have is a 4-way multiplexer and that's what you're using. When you write into memory, you need to provide 2 pieces of information: the data and destination, both at the same time.
You might demultiplex the input data with DMux4Way16 and separately single load bit with DMux4Way but that would take 2 demultiplexers, and we can do better than that. That's what's done here, you direct data input to both RAM and Screen and then only use one demultiplexer : DMux4Way to select one of 2 possible destinations, only the one selected will be loaded with new data, on the other data input will be ignored. Knowing that, you need to study A-instruction format: when bit 14 and 13 of A-instruction (or data residing in A-register) have the binary value 00 or 01, the destination is RAM. When bit 14 and 13 have the binary value 10, it means the screen is the destination.
When you notice that you choose these 2 bits as sel for your demultiplexer. Selections 0 and 1 have the same meaning, so you can OR them and feed the output as load to RAM. Selection 2 means Screen will be loaded with a, new value, so load bit goes there. Selection 3 is never used so we don't care about it - output d of demultiplexer will not be connected anywhere. We make use of the demultiplexer's feature: The selected output will have value 1 and all other outputs will yield 0 as a result. It means only 1 memory destination will be loaded.
Ad.2.
Screen is separate device, it has nothing to do with RAM, ROM or Keyboard memory devices here. You, and only you, give meaning to what bits mean what to this specific device. To answer your question, when you address some register in Screen you address it in its own internal address space. In its internal address space first address will be 0, but from whole Memory it will be 16384. It's your job to make this transition. In this particular case, size of Screen memory device it is not necessary to use 14-bit address bus, 13 bits is all you need. What would 14th bit mean in this case? It wouldn't add any value. Also, you are user and not designer of Screen, you only look at and follow its interface description.
Hope it answers your questions, if not I urge you to go back and study more carefully previous hardware related chapters from course.
This is a beginners question on Cassandra Architecture.
I have a 3 node Cassandra cluster. The data directory is at $CASSANDRA_HOME/data/data. I've loaded a huge data set. I did a nodetool flush and then nodetool tablestats on the table I loaded the data. This says the total space occupied is around 50GiB. I was curious and checked the size of my data directory du $CASSANDRA_HOME/data/data on each of the nodes,which shows around 1-2GB on each. How could the data directory be less than the space occupied by a single table? Am I missing something? My table is created with replication factor 1
du gives out the true storage capacity used by the paths given to it. This is not always directly connected to the size of the data stored in these paths.
Two main factors mix up the output of du compared to any other storage usage information you might get (e. g. from Cassandra).
du might give out a smaller number than expected because of two reasons: ⓐ It combines hard links. This means that if the paths given to it contain hard linked files (I won't explain hard links here, but this term is a fixed one for Unixish operating systems so it can be looked up easily), these are counted only once while the files exist multiple times. ⓑ It is aware of sparse files; these are files which contain large (sometimes huge) areas of empty space (zero-bytes). In many Unixish file systems these can be stored efficiently, depending on how they have been created.
du might give out a larger number than expected because file systems have some overhead. To store a file of n bytes, n + h bytes need to be stored because of this. h depends on the file system and its configuration. The most important factor is that file systems typically store files in a block structure. If a file isn't exactly the size of a multiple of the block size of the file system, the last needed block is still allocated completely by this file, so some of its size if wasted. du will show the whole block as allocated because, in fact, it is.
So in your case Cassandra might talk about space occupied of 50GiB but a lot of it might be empty (never written-to) space. This might be stored in a sparse file on the file system which in fact only uses 2GiB of storage size (which du shows).
#pozs... this is NOT a duplicate of the one you indicated. That was the first place I looked. I could care less about the difference between text and varchar. I'm asking about physical space used within the medium aka server hard drive.
I know that hard drives are split into blocks of bytes aka chunks, that if used less then the total amount of the block the remaining space is an empty waste of unused space. What I'm curious about is that the text option itself uses a certain amount of storage. Can the space used be reduced rather than just limiting quantity of input. I could say text limit => 1, and it may still use thousands upon thousands of bytes... this is what I'm asking about.
This is a photo of hard drive blocks. This is how I imagine ActiveRecord text type space used
Here's the wiki on Blocks(data storage) http://en.wikipedia.org/wiki/Block_(data_storage) As you can see they say "Block storage is normally abstracted by a file system or database management system (DBMS)" What they do NOT say is HOW it is abstracted.
According to Igor's blog he says "To my surprise, they determined that the average I/O size of our Postgres databases was much higher that 8KB block size, and up to 1MB." http://igorsf.wordpress.com/2010/11/01/things-to-check-when-configuring-new-postgres-storage-for-high-performance-and-availability/ While this is helpful to know it doesn't tell me the default behaviour between ActiveRecord and PostgreSQL in handling blocks.
According to concernedoftunbridgewells "The database will allocate space in a table or index in some given block size. In the case of Postgres this is 8K". https://dba.stackexchange.com/questions/15510/understanding-block-sizes/15514#15514?newreg=fc10593601be479b8ed697d1bbd108ed So if 8K is used as a block, then how high or low do I set the text type limit to match and fit within the one 8K block, because it may use more then just one block.
I know that PostgreSQL block size setting can be changed. So I would like clarity on "how ActiveRecord PostgreSQL block size handling currently works". I will accept a good answer for that.
A page contains more than one item (assuming there is space obviously). If your row is less than 8k, then other rows will be stored on the same page with it (I simplify slightly - postgres stores large columns separately anyway). Limiting the max length of a column doesn't interact with this.
My reading of the details on character types is that strings under 126 characters incur 3 bytes less overhead, but this happens on a row by row basis, independently of what the maximum length is.
The postgresql docs have details on the exact on disk format and how postgresql deals with large columns.
IMO, the size takes by a text type db column is mainly depends on the content store in the column. The limit setting inside the ActiveRecord will just do a validation before the content is saving into the db column, and didn't has an impact to the actual storage.
In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.