Efficient whole file CRC computation in the presence of small overwrites - checksum

I have a large file and I maintain crc32 checksum over its contents. If a fixed portion of the file were to change either at the start of the file or the end of the file, I can maintain crc32 checksum of the static portion and the dynamic portion and use crc32_combine to efficiently calculate the new whole file checksum. Mark Adler answered it beautifully here: CRC Calculation Of A Mostly Static Data Stream.
But if the content in the middle of the file were to change and not always at a predefined offset (and length), is there a way to efficiently compute the whole file checksum without reading the whole file?

Yes, so long as you know the before and after values of the bytes changed. And their location, of course.
Compute the exclusive-or of the before and after. That is zeros where there are no changes, and non-zero where there are changes. Then compute the raw CRC of the exclusive-or for the entire file, and then exclusive-or the result of that with the CRC.
Presumably you will have a long sequence of zeros, and some non-zero values, and then another long sequence of zeros. You can ignore the initial long sequence and just start computing the CRC of the non-zero values. Then use the same trick in the link to apply the long sequence of zeros after that to the raw CRC.

Related

Are there any good DEFLATE-like compression algorithms that are additive and immutable?

I need to add more things onto a stacklike structure, but compress them additively such that addition of new data results in the expected compression gain, but new chunks are still stored as compressed data without altering any of the past data chunks.
So in other words, it must preserve the additive property over successive compressed chunks. If f() is the 'adding another chunk' function, I need the following to hold for all chunks 'x':
Sure. Deflate does this, if you just keep it running and terminate each chunk at a byte boundary, e.g. with Z_SYNC_FLUSH, so that it can be written to file up to and including that chunk.
So long as your chunks are large enough, you will get the same compression gain.

Reading file as stream of strings in Dart: how many events will be emitted?

Standard way to open a file in Dart as a stream is to use file.openRead() which returns a Stream<List<int>>.
The next standard step is to transform this stream with utf8.decoder SteamTranformer which returns Stream<String>.
I noticed that with the files I've tried this resulting stream only emits a single event with the whole file content represented as one string. But I feel like this should not be a general case since otherwise the API wouldn't need to return a stream of strings, a Future<String> would suffice.
Could you explain how can I observe the behavior when this stream emits more than one event? Is this dependent on the file size / disk IO rate / some buffers size?
It depends on file size and buffer size, and however the file operations are implemented.
If you read a large file, you will very likely get multiple events of a limited size. The UTF-8 decoder decodes chunks eagerly, so you should get roughly the same number of chunks after decoding. It might carry a few bytes across chunk boundaries, but the rest of the bytes are decoded as soon as possible.
Checking on my local machine, the buffer size seems to be 65536 bytes. Reading a file larger than that gives me multiple chunks.

Reverse engineering checksum from ascii string?

I'm currently working on reverse engineering a device I have serial protocol.
I'm mostly there however I can't figure out one part of the string.
For each string the machine returns it always has !XXXX where the XXXX varies in a hex value. From what I can find this may be CRC16?
However I can't figure out how to calculate the CRC myself to confirm it is correct.
Here's an example of 3 Responses.
U;0;!1F1B
U;1;!0E92
U;2;!3C09
The number can be replaced with a range of ascii characters. For example here's what I'll be using most often.
U;RYAN W;!FF0A
How do I calculate how the checksum is generated?
You need more examples with different lengths.
With reveng, you will want to reverse the CRC byte, e.g. 1b1f, not 1f1b. It appears that the CRC is calculated over what is between the semicolons. With reveng I get that the polynomial is 0x1021, which is a very common 16-bit polynomial, and that the CRC is reflected.
% reveng -w 16 -s 301b1f 31920e 32093c 5259414e20570aff
width=16 poly=0x1021 init=0x1554 refin=true refout=true xorout=0x07f0 check=0xfa7e name=(none)
width=16 poly=0x1021 init=0xe54b refin=true refout=true xorout=0xffff check=0xfa7e name=(none)
With more examples, you will be able to determine the initial value of the CRC register and what the result is exclusive-or'ed with.
There is a tool available to reverse-engineer CRC calculations: CRC RevEng http://reveng.sourceforge.net/
You can give it hex strings of the input and checksum and ask it what CRC algorithm matches the input. Here is the input for the first three strings (assuming the messages are U;0;, U;1; and U;2;):
$ reveng -w 16 -s 553b303b1f1b 553b313b0e92 553b323b3c09
width=16 poly=0xa097 init=0x63bc refin=false refout=false xorout=0x0000 check=0x6327 residue=0x0000 name=(none)
The checksum follows the input messages. Unfortunately this doesn't work if I try the RYAN W message. You'll probably want to try editing the input messages to see which part of the string is being input into the CRC.

Is there some byte combination that can be used as a separator of streams of Int16

I was given the task to specify a file format for internal use inside an application.
One of the intended requirements says:
The data section of the file should be made up of a series of streams of type Int16 values (short integers), delimited by a suitable combination of one or more bytes.
As I understand, Int16 can contain any single byte value, so I don't know how I could choose some sequence of bytes that is guaranteed not to appear incidentally inside a stream. Is there such a sequence?
(And also, if the answer is "no", what would be a good way to determine the position and size of each stream in the file?)
By "streams," I assume the request indicates that the length is unknown when the writing of the data begins.
Therefore, I'd suggest a "chunked" encoding, where each substream is parcelled out into variable-size pieces, with the length of each piece written at the beginning as a fixed size integer. An empty chunk signals the end of the substream. Normally, there would be a maximum length of a chunk to facilitate allocation of buffers for efficient reading and writing.
This is patterned after HTTP's "chunked" transfer encoding and a similar approach is used in many other formats, such as the indefinite length encoding supported by the basic encoding rules for ASN.1.
I would suggest prefixing each stream with a length field, rather than trying to use delimiters, for the reason you've already given (no suitable unique delimiter). E.g.:
<length>
<stream>
<length>
<stream>
<length>
<stream>
...
where <length> is, say, a 4 byte integer which defines the number of 16 bit elements in the following stream.

calculate checksum for spilted file by boost crc

I wonder that if I can calculate 2 checksums by read first half of the file to get one checksum A and then read the rest of the file to get another checksum B, and these two checksums A,B will combined to a uniq check sum (with longer length)
I use the boost::CRC library try to implement it, but I don't know if I use it right?
(1)The second parameter of process_bytes, is that means the total buffer length? (2) Does the result will calculated by the function recursively that I don't have to worry about the array? Or when I call the process_byte, it just calculate the new checksum of the new single byte of the buffer array?
Frankly
std::ifstream ifs( argv[i], std::ios_base::binary );
if ( ifs )
{
do
{
char buffer[BUFFER_SIZE];
ifs.read( buffer, BUFFER_SIZE);
result.process_bytes( buffer, HALF_FILE_SIZE );
} while (HALF or END of FILE );
}
std::cout << result.checksum() << std::endl;
plz refer to this page to see the boost::CRC example code:
http://www.boost.org/doc/libs/1_37_0/libs/crc/crc_example.cpp
I can't figure out what you're asking.
First off, a CRC is not a checksum. The "sum" in checksum means that the data is added. A CRC computes a polynomial remainder over a finite field, which is not a sum. This is important, since you seem to be asking about combining CRCs. The CRCs of two pieces cannot be added to get the CRC of the whole thing.
Second, the way to get the CRC of multiple pieces is to compute a single CRC over those pieces. That is what the example code does. result contains a single CRC that is updated with each piece that is run through it with process_bytes.
Third, it is possible to combine two CRCs, given the two CRCs and the length of the first piece, to get what would have been a single CRC of the two pieces concatenated. The operation is not trivial, but you can find it in zlib's crc32_combine() routine.

Resources