Why bytes of one word has opposite order in binary files? - binary-data

I was reading BMP file in hex editor while discovered something odd. Two first letters "BM" are written in order, however the next word(2B), which is means file size, is 36 30 in hex. Actual size is 0x3036. I've noticed that other numbers are stored the same way.
I'm also using MARS MIPS emulator which can display memory by words. String in.bmp is stored as b . n i / \0 p m.
Why data isn't stored continuously?

It depends not on the data itself but on how you store this data: per byte, per word (2 bytes, usually), or per long (4 bytes -- again, usually). As long as you store data per byte you don't see anything unusual; data appears "continuous". However, with longer units, you are subject to endianness.
It appears your emulator is assuming all words need to have their bytes reversed; and you can see in your example that this assumption is not always valid.
As for the BM "magic" signature: it's not meant to be read as a word value "BM", but rather as "first, a single byte B, then a single byte M". All next values are written in little-endian order, not only 'exchanging' your 36 and 30 but also the 2 zeroes 'before' (or 'after') (the larger values in the BMP header are of 4 bytes long type).

Related

JVM 64-bit different memory usages?

I've done some reading but I'm not entirely sure about one thing, for example how much memory would this use in JVM 64 bit(sorry if stupid question, but I'm a bit confused and don't know much about this):
MyObject[] myArray; - I know an array takes up 24 bytes, but how much will each element in this array take? is every element an object reference, meaning 8 byte per element? If not, how do I know how many bytes each element in this array needs?
Normally, that is when using heap sizes of less than 32 GB, the 64-bit JVM uses compressed oops which store object pointers as a 32-bit integer (scaled by three bits when used, since all objects are aligned to 8 bytes; see the link for details), so each element would actually only use 4 bytes.
If you use more than 32 GB of heap or otherwise turn off compressed oops, however, then each element will indeed use 8 bytes.
Also, I suspect that your statement on the array header being 24 bytes is wrong. To begin with, when compressing oops, the class reference in the header is also compressed, and the identity-hash-code and array length fields are 32-bit to begin with, so I suspect it is more likely to use 12 bytes. Even when using full-length oops, it should still only take 16 bytes. I can't find any hard source verifying either, however. In general, however, it should be said that Hotspot does not even use a fixed-size object header but one that varies in size depending on various circumstances of the object. This article describes some of those circumstances.
That is on the Hotspot JVM, at least. Since the JLS doesn't specify any primitive sizes, it could, theoretically, be anything on any given JVM, though 8 bytes are, of course, the most likely implementation choice.
Here is good information on how to calculate the memory usage of a Java array
For Example
let's consider a 10x10 int array. Firstly, the "outer" array has its 12-byte object header followed by space for the 10 elements. Those elements are object references to the 10 arrays making up the rows. That comes to 12+4*10=52 bytes, which must then be rounded up to the next multiple of 8, giving 56. Then, each of the 10 rows has its own 12-byte object header, 4*10=40 bytes for the actual row of ints, and again, 4 bytes of padding to bring the total for that row to a multiple of 8. So in total, that gives 11*56=616 bytes. That's a bit bigger than if you'd just counted on 10*10*4=400 bytes for the hundred "raw" ints themselves.

Jfif/jpeg parsing, bytes between streams

I'm parsing an Jpeg/JFIF file and I noticed that after the SOI (0xFF D8) I parse the different "streams" starting with 0xFFXX (where XX is a hexadecimal number) until I find the EOI (0XFFD9). Now the structure of the diffrent chunks is:
APP0 marker 2 Bytes
Length 2 Bytes
Now when I parse the a chunk I parse until i reach the length written in the 2 Bytes of the length field. After that I thought I would immediately find another Marker, followed by a length for the next chunk. According to my parser that is not always true, there might be data between the chunks. I couldn't find out what that data is, and if it is relevant to the image. Do you have any hints what this could be and how to interpret those bytes?
I'm lost and would be happy if somebody could point me in the correct direction. Thanks in advance
I've recently noticed this too. In my case it's an APP2 chunk which is the ICC profile which doesn't contain the length of the chunk.
In fact so far as I can see the length of the chunk needn't be the first 2 bytes (though it usually is).
In JFIF all 0xFF bytes are replaced with 0xFF 0x00 in the data section, so it should just be a matter of calculating the length from that. I just read until I hit another header, however I've noticed that sometimes (again in the ICC profile) there are byte sequences which don't make sense such as 0xFF 0x6D, so I may still be missing something.

How to Save Huffman code [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
need help on how to encode words using huffman code
Suppose I have the following Huffman coded symbols
A - 0
B - 10
C - 110
D - 111
and that you want to encode the sequence
A B A A C D A D B B
then I get the binary code like this:
01000110 11101111 010
(01000110) = 0x46
(11101111) = 0xEF
010=????
if there is no 010 in this code, I can save these byte into a file.
now how should I process this 010? Save it as 00000010? that doesn't work.
You'll need some header for your encoded data (a byte should be enough for this purpose, but you may need more, depending on what you actually need) and you can store how many padding bits you have in the last data byte. So in your example with 010 your header byte would contain 5 because you have 5 bits at the end of the last byte that you need to ignore.
Finding what values are stored in the bits that are actually useful is something that you need to handle bit by bit, since you can have overlapping codes where a code may be split between two bytes - so some bits may be at byte N and some bits may overlap onto byte N+1.
You can include an end byte that marks that you've finished, then perhaps another byte after that that says how many items were stored in the last data byte. Then you can sort out the remaining characters from the last byte with knowledge of how many bits are actually significant.

Reading TIFF files

I need to read and interpret a binary file containing a TIFF image. I know there exist readers for doing this but I want to go the hard way. I found the TIFF format description and need to parse the binary file in small chunks. Assume I was able to read in memory the complete binary file. This means that I have a variable containing one long list of bytes.
I know via the format definition what the meaning is of the different groups of n bytes.
How can one define character variables with different lengths (sometimes 2, sometimes 3, sometimes 4 etc.) so that the variable address points to the right position in the image variable array?
With other words, assume my image is loaded into an array Image containing all bytes of the file.
The first 2 bytes I want to load in a string with length 2 bytes so that I can just link the address pointer to the first position in the Image array and automatically the first 2 bytes are associated with the first character string. A second string of 4 bytes would have another meaning and so I make the address for the second string of 4 bytes point to the 3rd position of the Image array.
Is this feasible in C++? I remember that this was a normal way of working for dynamical memory allocation in Fortran 77 in a simulation code I analysed a long time ago.
Thanks in advance for the hints!
Regards,
Stefan
The C++ language is easily capable of processing TIFF files from a byte array. The idea you have in mind is basically correct, but there a few problems with it. C strings are zero-terminated and the strings which appear in TIFF files are not necessarily zero terminated since their length is specified explicitly. It really is simpler to create a dedicated data structure to hold the TIFF-specific data fields and then parse the binary data into the structure. Your method will immediately run into trouble with the Motorola/Intel byte issue if your machine has the opposite endian-ness.

Why are memory addresses incremented by 4 in MIPS?

If something is stored at 0x1001 0000 the next thing is stored at 0x1001 0004. And if I'm correct the memory pieces in a 32-bit architecture are 32 bits each. So would 0x1001 0002 point to the second half of the 32 bits?
First of all, memory addresses in MIPS architecture are not incremented by 4. MIPS uses byte addressing, so you can address any byte from memory (see e.g. lb and lbu to read a single byte, lh and lhu to read a half-word).
The fact is that if you read words which are 32 bits length (4 bytes, lw), then two consecutive words will be 4 bytes away from each other. In this case, you would add 4 to the address of the first word to get the address of the next word.
Beside this, if you read words you have to align them in multiples of 4, otherwise you will get an alignment exception.
In your example, if the first word is stored in 0x10010000 then the next word will be in 0x10010004 and of course the first half/second half would be in 0x1001000 and 0x1001002 (the ordering will depend on the endianness of the architecture).
You seem to have answered this one yourself! 32 bits make 4 bytes, so if you're e.g. pushing to a stack, where all elements are pushed as the same size, each next item will be 4 bytes ahead (or before) the next.

Resources