Reed-Solomon in file recovery - data-integrity

A piece of software I'm working on outputs quite a lot of files which are the stored on a server. During its runtime I've had one file go corrupt on me. These files are critical to the operation, so this cannot happen. I'm therefore trying to come up with a way of adding error correction to the files to prevent this from ever happening again.
I've read up on Reed-Solomon, which encodes k blocks of data plus m blocks of parity, and can then reconstruct up to m missing blocks. So what I'm thinking is taking the data stream, split it into these blocks, and then store them in sequence on disk, first the data blocks, then the parity blocks. Repeat until entire file is stored. k, m, and block sizes are of course variables I'll have to investigate and play with.
However, it's my understanding that Reed-Solomon requires you to know which blocks are corrupt. How could I possibly know that? My thinking is I'd have to add some extra, simpler, error detection code to each of the blocks as I write them, otherwise I can't know if they're corrupted. Like CRC32 or something.
Have I understood this correctly, or is there a better way to accomplish this?

This is a bit of an older question, but (in my mind) is always something that is useful and in some cases necessary. Bit rot will never be completely cured (hush ZFS community; ZFS only has control of what's on it's filesystem while it's there), so we always have to come up with proactive prevention and recovery plans.
While it was designed to facilitate piracy (specifically storing and extracting multi-GB files in chunks on newsgroups where any chunk could go missing or be corrupted), "Parchives" are actually exactly what you're looking for (see the white paper, though don't implement that scheme directly as it has a bug and newer schemes are available), and they work in practice as follows:
The complete file is input in to the encoder
Blocks are processed and Reed-Solomon blocks are generated
.par files containing those blocks are output along side the original file
When integrity is checked (typically on the other end of a file transfer), the blocks are rechecked and any blocks that need to be used to reconstruct missing data are pulled from the .par files.
Things eventually settled in to "PAR2" (essentially a rewrite with additional features) with the following scheme:
Large file compressed with RAR and split in to chunks (typically around 100MB each as that was a "usually safe" max of usenet)
An "index" file is placed along side the file (for example bigfile.PAR2). This has no recovery chunks.
A series of par files totaling 10% of the original data size are along side in increasingly larger filesizes (bigfile.vol029+25.PAR2, bigfile.vol104+88.PAR2, etc)
The person on the other end can then gets all .rar files
An integrity check is run, and returns a MB count of out how much data needs recovery
.PAR2 files are downloaded in an amount equal to or greater than the need
Recovery is done and integrity verified
RAR is extracted, and the original file is successfully transferred
Now without a filesystem layer this system is still fairly trivial to implement using the Parchive tools, but it has two requirements:
That the files do not change (as any change to the file on-disk will invalidate the parity data (of course you could do this and add complexity with a copy-on-change writing scheme))
That you run both the file generation and integrity check/recovery when appropriate.
Since all the math and methods are both known and battle-tested, you can also roll your own to meet whatever needs to have (as a hook in to file read/write, spanning arbitrary path depths, storing recovery data on a separate drive, etc). For initial tips, refer to the pros: https://www.backblaze.com/blog/reed-solomon/
Edit: The same research that led me to this question led me to a whole subset of already-done work that I was previously unaware of
https://crates.io/crates/solana-reed-solomon-erasure (as well as a bunch of other implementations in the Rust crate registry)
https://github.com/klauspost/reedsolomon (based on the BackBlaze code, and processes 1Gbps per core)
Etc. Look for "Reed-Solomon file recovery "

Related

What are the scenario that makes us compress data before we transfer it?

I am wondering the reason why we need to apply file compression before we upload files to server under some scenarios. For my understanding, as soon as the server received the compressed files, the compressed file need to be extracted to allow the server read the file content. It certainly consumes the computation power of the server if multiple Http POSTs are sent from many client side platforms.
Therefore, as far as I can think of the scenario of sending the compressed file is uploading the backup files, setting files, files that only servers as back up for the client side platforms. Please give me more scenarios for uploading compressed data.
I think the following article gives an perfect explanation to the question:http://www.dataexpedition.com/support/notes/tn0014.html
Here's the content:
Compression Pros & Cons
Simply put, compression is a process which trades CPU cycles for bytes. But the trade isn't always a good one. Sometimes you can spend a lot of valuable CPU cycles for little or no gain.
In the context of network data transport, "Should I compress?" is a common question. But the answer can get complicated, depending on several factors. The most important thing to remember is that compression can actually make your data move much slower, so it should not be used without some consideration.
When Compression Is Good
Compression algorithms try to identify large repeating patterns in a data set and replace them with smaller patterns. Ideally, this shrinks the size of the data set. For the purposes of network transport, having less data to move means it should take less time to move it.
Documents and files which consist mostly of plain text or machine executable code tend to compress well. Examples include word processing documents, HTML files, some .exe files, and some database files.
Combining many small files into a single archive prior to network transfer can often result in faster speeds than transferring each file individually. This may be true even if the individual files themselves are not compressible. Many archiving utilities have options to pack files into an archive without compression, such as the "-0" option for "zip". ExpeDat will combine the contents of a folder into a single data stream when you enable Streaming Folders.
When Compression Is Bad
Many data types are not compressible, because the repeating patterns have already been removed. This includes most images, videos, songs, any data that is already compressed, or any data that has been encrypted.
Trying to compress data that is not compressible wastes CPU time. When you are trying to move data at high speeds, that CPU time may be critical to feeding the network. So by taking away processing time with worthless compression, you can actually end up moving your data much more slowly than if you had compression turned off.
If you are using a compression utility only for the purposes of combining many small files, check for options that disable compression. For example, the "zip" command has a "-0" option which packages files into an archive without spending time trying to compress them.
Inline versus Offline
Many transport mechanisms allow you to apply compression algorithms to data as its being transferred. This is convenient because the compression and decompression occur seamlessly without the user having to perform extra steps. But it is also risky because any CPU time spent on compression is time NOT being spent on feeding data through the network. If the network is very fast, the CPU is very slow, or the compression algorithm is unable to scale, having inline compression turned on may cause your data to move more slowly than if you turn compression off. Inline compression can be slower than no compression even when the data is compressible!
If you are going to be transferring the same data set multiple times, it pays to compress it first using Zip or Tar-Gzip. Then you can transfer the compressed archive without taking CPU cycles away from the network processing. If you are planning to encrypt your data, make sure you compress it first, then encrypt second.
Hidden Compression
Devices in your network may be applying compression without you realizing it. This becomes evident if the "speed" of the network seems to change for different data types. If the network seems slow when you are transferring data that is already compressed, but fast when you are transferring uncompressed text files, then you can be pretty sure that something out there is making compression decisions for you.
Network compression devices can be helpful in that they take the compression burden away from the end-point CPUs. But they can also create very inconsistent results since they will not work for all destinations and data types. Network level compression can also run into the same CPU trade-offs discussed above, resulting in some files moving more slowly than they would if there was no compression.
If you are testing the speed of your network, try using data that is already compressed or encrypted to ensure consistent results.
Should I Turn On Inline Compression?
For compressed data, images, audio, video, or encrypted files: No.
For other types of data, test it both ways to see which is faster.
If the network is very fast (hundreds of megabits per second or faster), consider turning off inline compression and instead compress the data before you move it.

iOS SQLite or 1000 loose files?

Suppose I have 1000 records of variable size, ranging from around 256 bytes to a few K. I wonder is there any advantage of putting them into a sqlite database versus just reading/writing 1000 loose files on iOS? I don't need to do any operations other than access by a single key, which I can use as the filename. Seems like the file system would be the winner unless the number of records grows very large.
If your system were read-only, I would say that the file system is the clear winner: a simple binary file and perhaps a small index to know where each record starts would be all that you need. You could read the entire index into memory, and then grab your records from the file system as needed, for a performance that would be extremely tough to match for any RDBMS.
However, since you are planning on writing data back, I would suggest going with SQLite because of potential data integrity issues.
Performance concerns should not be underestimated, too: since your records are of variable size, writing the data back may prove to be difficult in cases when records need to expand. Moreover, since you are on a mobile platform, you would need to build something in to avoid data corruption when the program is killed unexpectedly in the middle of a write. SQLite takes care of this; your code would have to build something comparable to it, or risk data corruption problems.

Can ImageMagic's commands like convert be trusted on untrustable input files?

I wonder if it is ok (apart from the fact it is done million times daily...) to feed ImageMagic's convert command with user-uploaded files?
Obviously, ImageMagic has a large attack front, as it is capable of loading tons of formats and thus using vast amounts of code processing input data.
Excessive CPU or memory consumption do no harm, as this can be kept under control by simple means (for example by ulimit). Abitrary file access or runing injected code with network access however must be prevented by convert. Can we expect this?
Are there procedures to follow for the ImageMagic authors that would provide a feasable amount of security against such exploits?

What is the fastest way for reading huge files in Delphi?

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.
If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).
Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.
The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.

OutOfMemoryException Processing Large File

We are loading a large flat file into BizTalk Server 2006 (Original release, not R2) - about 125 MB. We run a map against it and then take each row and make a call out to a stored procedure.
We receive the OutOfMemoryException during orchestration processing, the Windows Service restarts, uses full 2 GB memory, and crashes again.
The server is 32-bit and set to use the /3GB switch.
Also I've separated the flow into 3 hosts - one for receive, the other for orchestration, and the third for sends.
Anyone have any suggestions for getting this file to process wihout error?
Thanks,
Krip
If this is a flat file being sent through a map you are converting it to XML right? The increase in size could be huge. XML can easily add a factor of 5-10 times over a flat file. Especially if you use descriptive or long xml tag names (which normally you would).
Something simple you could try is to rename the xml nodes to shorter names, depending on the number of records (sounds like a lot) it might actually have a pretty significant impact on your memory footprint.
Perhaps a more enterprise approach, would be to subdivide this in a custom pipeline into separate message packets that can be fed through the system in more manageable chunks (similar to what Chris suggests). Then the system throttling and memory metrics could take over. Without knowing more about your data it would be hard to say how to best do this, but with a 125 MB file I am guessing that you probably have a ton of repeating rows that do not need to be processed sequentially.
Where does it crash? Does it make it past the Transform shape? Another suggestion to try is to run the transform in the Receive Port. For more efficient processing, you could even debatch the message and have multiple simultaneous orchestration instances be calling the stored procs. This would definately reduce the memory profile and increase performance.

Resources