Are there any good DEFLATE-like compression algorithms that are additive and immutable? - stack

I need to add more things onto a stacklike structure, but compress them additively such that addition of new data results in the expected compression gain, but new chunks are still stored as compressed data without altering any of the past data chunks.
So in other words, it must preserve the additive property over successive compressed chunks. If f() is the 'adding another chunk' function, I need the following to hold for all chunks 'x':

Sure. Deflate does this, if you just keep it running and terminate each chunk at a byte boundary, e.g. with Z_SYNC_FLUSH, so that it can be written to file up to and including that chunk.
So long as your chunks are large enough, you will get the same compression gain.

Related

Reading file as stream of strings in Dart: how many events will be emitted?

Standard way to open a file in Dart as a stream is to use file.openRead() which returns a Stream<List<int>>.
The next standard step is to transform this stream with utf8.decoder SteamTranformer which returns Stream<String>.
I noticed that with the files I've tried this resulting stream only emits a single event with the whole file content represented as one string. But I feel like this should not be a general case since otherwise the API wouldn't need to return a stream of strings, a Future<String> would suffice.
Could you explain how can I observe the behavior when this stream emits more than one event? Is this dependent on the file size / disk IO rate / some buffers size?
It depends on file size and buffer size, and however the file operations are implemented.
If you read a large file, you will very likely get multiple events of a limited size. The UTF-8 decoder decodes chunks eagerly, so you should get roughly the same number of chunks after decoding. It might carry a few bytes across chunk boundaries, but the rest of the bytes are decoded as soon as possible.
Checking on my local machine, the buffer size seems to be 65536 bytes. Reading a file larger than that gives me multiple chunks.

Storing and loading 2d infinite procedurally generated tile based world

Background:
I am working on a 2d infinite world generation. It is tile based meaning my terrain is fully made out of squares. You can imagine it like 2d Minecraft (looking at the terrain from above).
I implemented standard chunk system where the terrain gets chopped into small 8x8 tile areas that get loaded and deleted as player moves around the world. This, so far mentioned, works perfectly smooth without any hiccups or lag. I am using Lua and Corona SDK.
The problem:
Since the player will be able to modify the terrain, I need a fast and efficient system of saving chunks in memory once the player loads a new chunk and a system of loading those chunks from memory if they have been loaded previously.
This is where the problem takes place. It needs to read from and save to files (memory) quite often which causes noticeable lag. Making chunks bigger is not an option.
Solutions I tried but all caused lag:
a) First and obvious solution I implemented was to just create a text file for each chunk with tile names as strings. It looked something like this: x12y10.txt and inside the file I just dumped all tile names in order they need to be placed on screen: "Grass Grass Water Sand Sand Sand Grass Grass...". That worked but loading strings was slow so I tried another solution: save tiles as indexes.
b) Saving tiles as their indexes. I paired every tile to a number. Since numbers are shorter, they take less memory and are faster to load. I gave each tile it's own index: Grass -> id 1, Water -> id 2, Sand -> id 3 and so on. This way I only needed to save 1 or 2 chars instead of full string per tile. My txt files looked like this now: "1 1 2 3 3 3 1 1...". This worked better but still caused lag.
c) Next improvement I did was with how chunks are organized in memory. Instead of dumping all the chunks in a single folder, for each x coordinate I made a folder and put all chunks that have that x value in there.
So instead of this:
Folder with all chunks: x0y0.txt, x0y1.txt, x0y2.txt, x1y0.txt, x1y1.txt, x1y2.txt
Inside folder with all chunks I had this:
Folder x0: x0y0.txt, x0y1.txt, x0y2.txt
Folder x1: x1y0.txt, x1y1.txt, x1y2.txt
I am not sure how much this helped for small number of chunks, but I am pretty sure for thousands of chunks, improvement is there.
Possible solutions?
I have some ideas for improvements, but I would like to hear your opinion on the solutions.
a) Saving terrain in binary files?
b) I have read about Minecraft region format, really tried to understand how it works, but did not get it since there is little information about it. So if anyone knows it and could explain their system to me, I would be really grateful.
c) Another faster file format?
d) Is making/accessing many folders slow? Is there a better alternative?
I really feel like this is cs-101 question, but cannot google up any answer right away so quick summary.
All files are just sequences of bytes. If we're talking about reading and writing raw bytes, no format will make 64 bytes appear in memory faster than another.
Text file is a sequence of bytes with slight limitations on their values (well, the limitation is if you want standard text programs to display it). A string "11" (sequence of bits: 110001110001) from a text file won't be loaded faster than sequence of unprintable bits 100000100000 from "binary" file.
Structuring directories at the very least reduces the number of nodes system checks when trying the file you've requested to open. But mechanisms underlying the filesystems are very complex and affected by a lot of factors. The overall guess is that frequent reading even of small files will be slow. And all files carry some stockpiling overhead (system info to keep them tracked and ordered), small files will have lower useful/auxiliary info ratio. I know of at least one 2d project with mutable map that was making hdds growl and grunt before they moved onto bigger files years ago.
You don't have to make chunks bigger, that's different thing, but you can write them into the same file.
Instead of million of files by 64 bytes you can have a single megabyte file (assuming you use a byte per chunk). A million chunks is lot for a player to modify or walk around. If you unpack that data to tables, it will take up more space but you don't have to decipher all the string, only the currently needed bytes. Yes, modifying a megabyte string in lua will cause creation of another megabyte string which is slow, but you don't have to do it every time, or you can split string into smaller ones and modify those. And only do writing when needed. I/O bufferization may even happen without your intervention but again it is usually helpful for big files.
Yes there will be more than a byte of info per tile (2^8 possible states per tile is a lot however), the system stays the same.
The same thing is done for textures, because loading data in a single big chunk in a single big scoop is faster than searching around for tiny bit here and there. Indexing a single long area of memory is also faster than chasing pointers around.
On top of that, you may try to read\write less bytes than you want in the memory. For example by compressing data.
In minecraft chunks are not stored unless they have been visited / modified, otherwise they are generated.
That would leave you a system where only blocks which have been modified by the player would need to be stored, with the un-modified areas being re-generated by using the same random seed, each time.
Creating a hierarchy of modifications ... A chunk is an 8x8 block, create a super-chunk which is 8x8 chunks, and only look for a file, if any of the 8x8 super-chunk has been modified.
Possibly store all of the super-chunk in one file, which would limit the number of files (adding more files does decrease the speed of the system, and also uses space on the system inefficiently).
If you have any spare time-space, perhaps have a cache of the chunks near the player, and pre-load the modified areas which are being approached. This would limit the visible lag required

Control the tie-breaking choice for loading chunks in the scheduler?

I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.

Zip stream implementation with C in embedded devices

I have to embed a large text file in the limited space of internal memory of a MCU. This MCU will use the content of the text file for some purposes later.
The memory limitation dos not allow me to embed file content directly in my code (suppose that I use a character array to store file content), but if I compress the content of the file (using a light-weight algorithm like zip or gzip) then everything would be OK.
Suppose that the MCU uses getBytes(i, len) function to read content of my array (where i is index of the begining of required byte & len is length of data to be readed),
Here the problem is that when I compress the content & store it on the device(in my character array) I can't use getBytes function anymore for getting target data, so if I can write a wrapper on top of the getBytes function to map compressed content to requested content, then my problem will be solve.
I've no processing limitation on the MCU, anymore the memory amount is limited, & as I know access to the content of a zip compressed file is sequential, so I don't know is it possible to do this in an acceptable manner using C or C++ in such environment?
It is definitely possible to do this in a simple and efficient manner.
However, it's better to use piecewise compression (at the expense of compression ratio) instead of compressing/decompressing the entire file at once, otherwise you need to store the entire decompressed file in RAM.
With a small piece, the compression ratio of a strong algorithm will not be much different from a relatively weak one. So I recommend using a simple compression algorithm.
Disk compression algorithms are best suited for such purposes as they are designed to compress/decompress blocks.

Can HDF5 perform "value mapping"?

If I have a 32^3 array of 64 bit integers, but it contains only a dozen different values, can you tell HDF5 to use an "internal mapping" to save memory and/or disk space? What I mean is that the array would be access normally with 64 bit ints, but each value would internally be stored as a byte (?) index into a table of 64 bit ints, potentially saving about 7/8 of the memory and/or disk space. If this is possible, does it actually saves memory, disk space or both?
I don't believe that HDF5 provides this functionality right out of the box, but there is no reason why you couldn't implement routines to write your data to an HDF5 file and read it back again in the way that you seem to want. I suppose you could write your look-up table and your array into different datasets.
It's possible, but not something I have any evidence to indicate, that HDF's compression facility would sufficiently compress your integer dataset that you could save a useful amount of space.
Then again, for the HDF5 files I work with (10s of GBs) I wouldn't bother to try to devise my own encoding scheme to save such modest amounts of space as a 32768 element array of 64 bit numbers might be able to dispense with. Sure, you could transform a dataset of 2097152 bits into one of 131072 but disk space (even RAM) just isn't that tight these days.
I'm beginning to form the impression that you are trying to use HDF5 on, perhaps, a smartphone :-)

Resources