using a zarr store as a fixed-size buffer - buffer

I'm trying to use a zarr store as a fixed-size buffer (i.e. new data is appended to the end, and the same amount of data is removed from the beginning when a certain size is reached).
The store is huge (20 TB), and contains a 2D matrix (positions over time).
Writing to zarr is handled by xarray.
However, I'm not sure whether zarr supports this.
I can think of two solutions:
create a new xarray object from the first, eliminating the older data. However, writing that to disk will either append ("a"), leaving the older data intact, or overwrite ("w"), in which case I'm afraid the whole thing is rewritten which would not be performant for 20 TB.
use zarr.core.Array.resize, but this does not seem to allow dropping data at the start
Maybe zarr does not support this and I have to think of another solution, or writing my own store specifically aimed at this type of problem.

Related

Reed-Solomon in file recovery

A piece of software I'm working on outputs quite a lot of files which are the stored on a server. During its runtime I've had one file go corrupt on me. These files are critical to the operation, so this cannot happen. I'm therefore trying to come up with a way of adding error correction to the files to prevent this from ever happening again.
I've read up on Reed-Solomon, which encodes k blocks of data plus m blocks of parity, and can then reconstruct up to m missing blocks. So what I'm thinking is taking the data stream, split it into these blocks, and then store them in sequence on disk, first the data blocks, then the parity blocks. Repeat until entire file is stored. k, m, and block sizes are of course variables I'll have to investigate and play with.
However, it's my understanding that Reed-Solomon requires you to know which blocks are corrupt. How could I possibly know that? My thinking is I'd have to add some extra, simpler, error detection code to each of the blocks as I write them, otherwise I can't know if they're corrupted. Like CRC32 or something.
Have I understood this correctly, or is there a better way to accomplish this?
This is a bit of an older question, but (in my mind) is always something that is useful and in some cases necessary. Bit rot will never be completely cured (hush ZFS community; ZFS only has control of what's on it's filesystem while it's there), so we always have to come up with proactive prevention and recovery plans.
While it was designed to facilitate piracy (specifically storing and extracting multi-GB files in chunks on newsgroups where any chunk could go missing or be corrupted), "Parchives" are actually exactly what you're looking for (see the white paper, though don't implement that scheme directly as it has a bug and newer schemes are available), and they work in practice as follows:
The complete file is input in to the encoder
Blocks are processed and Reed-Solomon blocks are generated
.par files containing those blocks are output along side the original file
When integrity is checked (typically on the other end of a file transfer), the blocks are rechecked and any blocks that need to be used to reconstruct missing data are pulled from the .par files.
Things eventually settled in to "PAR2" (essentially a rewrite with additional features) with the following scheme:
Large file compressed with RAR and split in to chunks (typically around 100MB each as that was a "usually safe" max of usenet)
An "index" file is placed along side the file (for example bigfile.PAR2). This has no recovery chunks.
A series of par files totaling 10% of the original data size are along side in increasingly larger filesizes (bigfile.vol029+25.PAR2, bigfile.vol104+88.PAR2, etc)
The person on the other end can then gets all .rar files
An integrity check is run, and returns a MB count of out how much data needs recovery
.PAR2 files are downloaded in an amount equal to or greater than the need
Recovery is done and integrity verified
RAR is extracted, and the original file is successfully transferred
Now without a filesystem layer this system is still fairly trivial to implement using the Parchive tools, but it has two requirements:
That the files do not change (as any change to the file on-disk will invalidate the parity data (of course you could do this and add complexity with a copy-on-change writing scheme))
That you run both the file generation and integrity check/recovery when appropriate.
Since all the math and methods are both known and battle-tested, you can also roll your own to meet whatever needs to have (as a hook in to file read/write, spanning arbitrary path depths, storing recovery data on a separate drive, etc). For initial tips, refer to the pros: https://www.backblaze.com/blog/reed-solomon/
Edit: The same research that led me to this question led me to a whole subset of already-done work that I was previously unaware of
https://crates.io/crates/solana-reed-solomon-erasure (as well as a bunch of other implementations in the Rust crate registry)
https://github.com/klauspost/reedsolomon (based on the BackBlaze code, and processes 1Gbps per core)
Etc. Look for "Reed-Solomon file recovery "

iOS SQLite or 1000 loose files?

Suppose I have 1000 records of variable size, ranging from around 256 bytes to a few K. I wonder is there any advantage of putting them into a sqlite database versus just reading/writing 1000 loose files on iOS? I don't need to do any operations other than access by a single key, which I can use as the filename. Seems like the file system would be the winner unless the number of records grows very large.
If your system were read-only, I would say that the file system is the clear winner: a simple binary file and perhaps a small index to know where each record starts would be all that you need. You could read the entire index into memory, and then grab your records from the file system as needed, for a performance that would be extremely tough to match for any RDBMS.
However, since you are planning on writing data back, I would suggest going with SQLite because of potential data integrity issues.
Performance concerns should not be underestimated, too: since your records are of variable size, writing the data back may prove to be difficult in cases when records need to expand. Moreover, since you are on a mobile platform, you would need to build something in to avoid data corruption when the program is killed unexpectedly in the middle of a write. SQLite takes care of this; your code would have to build something comparable to it, or risk data corruption problems.

What is the fastest way for reading huge files in Delphi?

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.
If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).
Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.
The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

How many 'screens' of data could a game store before having to delete some?

Assuming I was making a Temporal-esque time travel game, and wanted a to save the current state of the screen (location of player and enemies, whether or not destructible objects are destroyed, et cetera) every second to an array, how much data would I be able to store on this array before the game would start to lag considerably and I would have to either delete the array or save it to a file out of the game (ie: a .bin).
On a similar note, is it faster to constantly save every screen to a .bin, or to only do this when it is necessary (start saving when the array is halfway 'full', for example).
And I know that the computer it is being run on matters, but assume it is being run on a reasonably modern computer (not old, but not a nasa supercompeter either), particularily because I have no way of telling exactly what the people who play the game will be using.
Depending on how you use the data afterwards, you could consider storing the changes between states instead of the actual states.
You should use a buffer to reduce the number of I/O-operations. Put data in main memory and write a larger amount of data to disk when needed.
It would depend on the amount of objects you needs to save and how much memory is taken up by each object.
Hypothetically, let's take a vastly oversimplified and naive example, and say that your game contains an average of 40 objects, each of which has 20 properties that take up two bytes of storage. That's 1600 bytes per second, if you're saving each second.
Well it is impossible to give you an answer that will definitely work for your scenario. You'll need to try a few things.
Assuming you are loading large images, sounds, or graphics from disk it may not be good to write to disk with high frequency due to contention. I say may because it really depends on th computer and everything that is going on. So how do you deal with this issue? One way is to run a background thread that watches a queue for items that need to be written to disk. The thread can monitor the queue for a certain number of items before writing to disk. The alternative is to wait for certain other events to happen in the game where I/O is happening and save it then. You may need to analyse the size of events that you are saving and try different options.
You would want to get an estimate as to how much data is saved per screen, then decide how much of someone's memory you want to use, and then just divide, as you will have huge variances. I am using a 64 bit OS so how much you can store on my machine is different than on a 32-bit machine.
You may want to decide what is actually needed. For example, if you just save the properties of each object into a json array you may save yourself some space, but you want to limit how much you write to a disk, as that will need to be done on a separate thread that only writes to this file, so that you don't have two threads trying to access the same resource, queue up the writes.
Saving the music, for example, may not be useful, as that will change anyway, I expect.
Be very judicious about what you will save, try it and see if you are saving enough.

Resources