Validating an HDF5 superblock checksum

Validating an HDF5 superblock checksum - hdf5

I am having a problem writing a program which verifies the checksum in the superblock of an HDF5, Version 2 file. I am not using the HDF5 software, but I have a copy of H5_checksum_fletcher32 (from the HDF5 H5checksum.c) in my code.
I can assume that the file signature block is at position 0.
My logic is:
Let offset = the value of byte 9 of the file.
The superblock spans bytes 0 to (15+4*offset).
The last 4 bytes are the checksum as an unsigned int.
The checksum should equal H5_checksum_fletcher32 applied to bytes 0 to (11+4*offset).
I have applied this logic to several test files from NOAA that I believe to be reliable, but the checksum never matches the result of H5_checksum_fletcher32. The other values in the superblock appear to be correct. Can anyone see the flaw in my logic?

From the HDF5 file format specification:
All checksums used in the format are computed with the Jenkins' lookup3 algorithm.
This is provided in H5checksum.c as H5_checksum_lookup3().
Actually, it seems the correct routine to call is H5_checksum_metadata(), but this just calls H5_checksum_lookup3() using a macro.

Related

Efficient whole file CRC computation in the presence of small overwrites

I have a large file and I maintain crc32 checksum over its contents. If a fixed portion of the file were to change either at the start of the file or the end of the file, I can maintain crc32 checksum of the static portion and the dynamic portion and use crc32_combine to efficiently calculate the new whole file checksum. Mark Adler answered it beautifully here: CRC Calculation Of A Mostly Static Data Stream.
But if the content in the middle of the file were to change and not always at a predefined offset (and length), is there a way to efficiently compute the whole file checksum without reading the whole file?

Yes, so long as you know the before and after values of the bytes changed. And their location, of course.
Compute the exclusive-or of the before and after. That is zeros where there are no changes, and non-zero where there are changes. Then compute the raw CRC of the exclusive-or for the entire file, and then exclusive-or the result of that with the CRC.
Presumably you will have a long sequence of zeros, and some non-zero values, and then another long sequence of zeros. You can ignore the initial long sequence and just start computing the CRC of the non-zero values. Then use the same trick in the link to apply the long sequence of zeros after that to the raw CRC.

Reverse engineering checksum from ascii string?

I'm currently working on reverse engineering a device I have serial protocol.
I'm mostly there however I can't figure out one part of the string.
For each string the machine returns it always has !XXXX where the XXXX varies in a hex value. From what I can find this may be CRC16?
However I can't figure out how to calculate the CRC myself to confirm it is correct.
Here's an example of 3 Responses.
U;0;!1F1B
U;1;!0E92
U;2;!3C09
The number can be replaced with a range of ascii characters. For example here's what I'll be using most often.
U;RYAN W;!FF0A
How do I calculate how the checksum is generated?

You need more examples with different lengths.
With reveng, you will want to reverse the CRC byte, e.g. 1b1f, not 1f1b. It appears that the CRC is calculated over what is between the semicolons. With reveng I get that the polynomial is 0x1021, which is a very common 16-bit polynomial, and that the CRC is reflected.
% reveng -w 16 -s 301b1f 31920e 32093c 5259414e20570aff
width=16 poly=0x1021 init=0x1554 refin=true refout=true xorout=0x07f0 check=0xfa7e name=(none)
width=16 poly=0x1021 init=0xe54b refin=true refout=true xorout=0xffff check=0xfa7e name=(none)
With more examples, you will be able to determine the initial value of the CRC register and what the result is exclusive-or'ed with.

There is a tool available to reverse-engineer CRC calculations: CRC RevEng http://reveng.sourceforge.net/
You can give it hex strings of the input and checksum and ask it what CRC algorithm matches the input. Here is the input for the first three strings (assuming the messages are U;0;, U;1; and U;2;):
$ reveng -w 16 -s 553b303b1f1b 553b313b0e92 553b323b3c09
width=16 poly=0xa097 init=0x63bc refin=false refout=false xorout=0x0000 check=0x6327 residue=0x0000 name=(none)
The checksum follows the input messages. Unfortunately this doesn't work if I try the RYAN W message. You'll probably want to try editing the input messages to see which part of the string is being input into the CRC.

Jfif/jpeg parsing, bytes between streams

I'm parsing an Jpeg/JFIF file and I noticed that after the SOI (0xFF D8) I parse the different "streams" starting with 0xFFXX (where XX is a hexadecimal number) until I find the EOI (0XFFD9). Now the structure of the diffrent chunks is:
APP0 marker 2 Bytes
Length 2 Bytes
Now when I parse the a chunk I parse until i reach the length written in the 2 Bytes of the length field. After that I thought I would immediately find another Marker, followed by a length for the next chunk. According to my parser that is not always true, there might be data between the chunks. I couldn't find out what that data is, and if it is relevant to the image. Do you have any hints what this could be and how to interpret those bytes?
I'm lost and would be happy if somebody could point me in the correct direction. Thanks in advance

I've recently noticed this too. In my case it's an APP2 chunk which is the ICC profile which doesn't contain the length of the chunk.
In fact so far as I can see the length of the chunk needn't be the first 2 bytes (though it usually is).
In JFIF all 0xFF bytes are replaced with 0xFF 0x00 in the data section, so it should just be a matter of calculating the length from that. I just read until I hit another header, however I've noticed that sometimes (again in the ICC profile) there are byte sequences which don't make sense such as 0xFF 0x6D, so I may still be missing something.

Reading TIFF files

I need to read and interpret a binary file containing a TIFF image. I know there exist readers for doing this but I want to go the hard way. I found the TIFF format description and need to parse the binary file in small chunks. Assume I was able to read in memory the complete binary file. This means that I have a variable containing one long list of bytes.
I know via the format definition what the meaning is of the different groups of n bytes.
How can one define character variables with different lengths (sometimes 2, sometimes 3, sometimes 4 etc.) so that the variable address points to the right position in the image variable array?
With other words, assume my image is loaded into an array Image containing all bytes of the file.
The first 2 bytes I want to load in a string with length 2 bytes so that I can just link the address pointer to the first position in the Image array and automatically the first 2 bytes are associated with the first character string. A second string of 4 bytes would have another meaning and so I make the address for the second string of 4 bytes point to the 3rd position of the Image array.
Is this feasible in C++? I remember that this was a normal way of working for dynamical memory allocation in Fortran 77 in a simulation code I analysed a long time ago.
Thanks in advance for the hints!
Regards,
Stefan

The C++ language is easily capable of processing TIFF files from a byte array. The idea you have in mind is basically correct, but there a few problems with it. C strings are zero-terminated and the strings which appear in TIFF files are not necessarily zero terminated since their length is specified explicitly. It really is simpler to create a dedicated data structure to hold the TIFF-specific data fields and then parse the binary data into the structure. Your method will immediately run into trouble with the Motorola/Intel byte issue if your machine has the opposite endian-ness.

Save a CRC value in a file, without altering the actual CRC Checksum?

I am saving some Objects I have defined from my own classes, to File. (saving the stream data).
That is all fine, but I would like to be able to store in the File the CRC checksum of that File.
Then, whenever my Application attemps to Open a File, it can read the internally stored CRC value.
Then perform a check on the actual File, if the CRC of the File matches the internally stored CRC value I can process the File normally, otherwise display an error message to say the File is not valid.
I need some advice on how to do this though, I thought I could do something like this:
Save the File from my Application.
Calculate the CRC of the Saved File.
Edit the Saved File storing the CRC Value.
Whenever a File is Opened, Check the CRC matches internal CRC Value.
Problem is, as soon as a single Byte of Data is altered in the File, results in the CRC checksum being completely different - as expected.

I'd generally prefer the approach where the CRC is excluded from the checking. But if that's not possible for some reason, there is a workaround:
You need to reserve 8 bytes, 4 for the CRC, and 4 for compensation data. First fill the reserved bytes with a certain dummy value (say 0x00). Then calculate the CRC into the first 4 bytes, and finally change the other 4 bytes so the CRC of the file stays the same.
For details on how to perform this calculation: Reversing CRC32
I actually used this in one of my projects:
I was designing a file format based on zip. The first file in the archive is stored uncompressed and serves as header file. This also means it is stored at a fixed offset in the file. So far pretty standard, and similar to for example ePub.
Now I decided to include a sha1 hash in the header, to give each file a unique content based Id and for integrity checking. Since the header and thus the sha1 hash is at a known offset in the file, masking it when hashing is trivial. So I put in a dummy hash and create the zip file, then hash the file and fill in the real hash.
But now there is a problem: Zip stores the CRC of all contained files. And not only in one place which would be easy to mask when sha1-hashing, but in a second place with variable offset near the end of the file. So I decided to go with CRC faking, so I get my strong hash, and zip gets its valid CRC32.
And since I was already faking the CRC for the final file, I decided faking it for the original header file wouldn't hurt either. Thus all files in this format now start with a header file that has the CRC 0xD1CE0DD5.

Simply put you need to exclude the bytes used to store the checksum from the checksum calculation.
Write the checksum as the last thing in the file. Calculate it based on the contents of the file apart from the checksum. When you come to read the file calculate the checksum based on the contents before the checksum. Or you could write the checksum as the first bytes of the file with random access. Just so long as you know where it is.

Store the CRC as part of the file itself, but don't include the data for it in the CRC calculation. If you have some sort of fixed header zero out the CRC field before passing it to the CRC function. If not, just append it to the end of the file and pass everything but the last 4 bytes into the CRC function.
Alternatively, if the files are stored on an NTFS drive and you don't need to transfer them to another computer you can use NTFS Alternate Data Streams to store the CRCs. Basically you open the file with the ADS name separated from the filename by a colon (like C:\file.txt:CRC). Windows handles the difference internally, so you can use plain TFileStream functions to manipulate them.
Alternate data streams are stored separately from the standard file stream, so opening or modifying just C:\file.txt won't affect it.
So, the code would look like this:
procedure UpdateCRC(const aFileName: string);
var
FileStream, ADSStream: TStream;
CRC: LongWord;
begin
FileStream := TFileStream.Create(aFileName, fmOpenRead);
try
CRC := CrcOf(FileStream);
finally
FileStream.Free;
end;
ADSStream := TFileStream.Create(aFileName + ':CRC', fmCreate);
try
ADSStream.WriteBuffer(CRC, SizeOf(CRC));
finally
ADSStream.Free;
end;
end;
If you need to find all of the alternate data streams attached to a file (there can be more than one), you can iterate over them using BackupRead. Internet Explorer uses ADSs to support the "This file has been downloaded from the Internet. Are you sure you want to open it?" prompt.

I would recommend storing the checksum in another file, maybe a .ini file. Or for a really weird idea, you could incorporate the checksum as part of the filename.
i.e. MyFile_checksum_digits_here.dat

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart