Jfif/jpeg parsing, bytes between streams - parsing

I'm parsing an Jpeg/JFIF file and I noticed that after the SOI (0xFF D8) I parse the different "streams" starting with 0xFFXX (where XX is a hexadecimal number) until I find the EOI (0XFFD9). Now the structure of the diffrent chunks is:
APP0 marker 2 Bytes
Length 2 Bytes
Now when I parse the a chunk I parse until i reach the length written in the 2 Bytes of the length field. After that I thought I would immediately find another Marker, followed by a length for the next chunk. According to my parser that is not always true, there might be data between the chunks. I couldn't find out what that data is, and if it is relevant to the image. Do you have any hints what this could be and how to interpret those bytes?
I'm lost and would be happy if somebody could point me in the correct direction. Thanks in advance

I've recently noticed this too. In my case it's an APP2 chunk which is the ICC profile which doesn't contain the length of the chunk.
In fact so far as I can see the length of the chunk needn't be the first 2 bytes (though it usually is).
In JFIF all 0xFF bytes are replaced with 0xFF 0x00 in the data section, so it should just be a matter of calculating the length from that. I just read until I hit another header, however I've noticed that sometimes (again in the ICC profile) there are byte sequences which don't make sense such as 0xFF 0x6D, so I may still be missing something.

Related

Is there some byte combination that can be used as a separator of streams of Int16

I was given the task to specify a file format for internal use inside an application.
One of the intended requirements says:
The data section of the file should be made up of a series of streams of type Int16 values (short integers), delimited by a suitable combination of one or more bytes.
As I understand, Int16 can contain any single byte value, so I don't know how I could choose some sequence of bytes that is guaranteed not to appear incidentally inside a stream. Is there such a sequence?
(And also, if the answer is "no", what would be a good way to determine the position and size of each stream in the file?)
By "streams," I assume the request indicates that the length is unknown when the writing of the data begins.
Therefore, I'd suggest a "chunked" encoding, where each substream is parcelled out into variable-size pieces, with the length of each piece written at the beginning as a fixed size integer. An empty chunk signals the end of the substream. Normally, there would be a maximum length of a chunk to facilitate allocation of buffers for efficient reading and writing.
This is patterned after HTTP's "chunked" transfer encoding and a similar approach is used in many other formats, such as the indefinite length encoding supported by the basic encoding rules for ASN.1.
I would suggest prefixing each stream with a length field, rather than trying to use delimiters, for the reason you've already given (no suitable unique delimiter). E.g.:
<length>
<stream>
<length>
<stream>
<length>
<stream>
...
where <length> is, say, a 4 byte integer which defines the number of 16 bit elements in the following stream.

Why bytes of one word has opposite order in binary files?

I was reading BMP file in hex editor while discovered something odd. Two first letters "BM" are written in order, however the next word(2B), which is means file size, is 36 30 in hex. Actual size is 0x3036. I've noticed that other numbers are stored the same way.
I'm also using MARS MIPS emulator which can display memory by words. String in.bmp is stored as b . n i / \0 p m.
Why data isn't stored continuously?
It depends not on the data itself but on how you store this data: per byte, per word (2 bytes, usually), or per long (4 bytes -- again, usually). As long as you store data per byte you don't see anything unusual; data appears "continuous". However, with longer units, you are subject to endianness.
It appears your emulator is assuming all words need to have their bytes reversed; and you can see in your example that this assumption is not always valid.
As for the BM "magic" signature: it's not meant to be read as a word value "BM", but rather as "first, a single byte B, then a single byte M". All next values are written in little-endian order, not only 'exchanging' your 36 and 30 but also the 2 zeroes 'before' (or 'after') (the larger values in the BMP header are of 4 bytes long type).

UTF8 Encoding and Network Streams

A client and server communicate with each other via TCP. The server and client send each other UTF-8 encoded messages.
When encoding UTF-8, the amount of bytes per character is variable. It could take one or more bytes to represent a single character.
Lets say that I am reading a UTF-8 encoded message on the network stream and it is a huge message. In my case it was about 145k bytes. To create a buffer of this size to read from the network stream could lead to an OutMemoryException since the byte array needs that amount of sequential memory.
It would be best then to read from the network stream in a while loop until the entire message is read, reading the pieces in to a smaller buffer (probably 4kb) and then decoding the string and concatenating.
What I am wondering is what happens when the very last byte of the read buffer is actually one of the bytes of a character which is represented by multiple bytes. When I decode the read buffer, that last byte and the beginning bytes of the next read would either be invalid or the wrong character. The quickest way to solve this in my mind would be to encode using a non variable encoding (like UTF-16), and then make your buffer a multiple of the amount of bytes in each character (with UTF-16 being a buffer using the power 2, UTF-32 the power of 4).
But UTF-8 seems to be a common encoding, which would leave me to believe this is a solved problem. Is there another way to solve my concern other than changing the encoding? Perhaps using a linked-list type object to store the bytes would be the way to handle this since it would not use sequential memory.
It is a solved problem. Woot woot!
http://mikehadlow.blogspot.com/2012/07/reading-utf-8-characters-from-infinite.html

Separating decimal value to least & most significant byte

I'm working on some 65802 code (don't ask :P) and I need to separate a 16-bit value into two 8-bit bytes to store it in memory. How would I go about this?
EDIT:
Also, how would I take two similar bytes and combine them into one 16-bit value?
EDIT:
To clarify, many of the solutions available on the internet are not possible with the programming language I'm using (a version of MS-BASIC). I can't take modulo, and I can't left or rightshift. I've figured out that I can put the two bytes together by multiplying the high byte by 256 and adding it to the low byte, but how would I reverse the process?

Reading TIFF files

I need to read and interpret a binary file containing a TIFF image. I know there exist readers for doing this but I want to go the hard way. I found the TIFF format description and need to parse the binary file in small chunks. Assume I was able to read in memory the complete binary file. This means that I have a variable containing one long list of bytes.
I know via the format definition what the meaning is of the different groups of n bytes.
How can one define character variables with different lengths (sometimes 2, sometimes 3, sometimes 4 etc.) so that the variable address points to the right position in the image variable array?
With other words, assume my image is loaded into an array Image containing all bytes of the file.
The first 2 bytes I want to load in a string with length 2 bytes so that I can just link the address pointer to the first position in the Image array and automatically the first 2 bytes are associated with the first character string. A second string of 4 bytes would have another meaning and so I make the address for the second string of 4 bytes point to the 3rd position of the Image array.
Is this feasible in C++? I remember that this was a normal way of working for dynamical memory allocation in Fortran 77 in a simulation code I analysed a long time ago.
Thanks in advance for the hints!
Regards,
Stefan
The C++ language is easily capable of processing TIFF files from a byte array. The idea you have in mind is basically correct, but there a few problems with it. C strings are zero-terminated and the strings which appear in TIFF files are not necessarily zero terminated since their length is specified explicitly. It really is simpler to create a dedicated data structure to hold the TIFF-specific data fields and then parse the binary data into the structure. Your method will immediately run into trouble with the Motorola/Intel byte issue if your machine has the opposite endian-ness.

Resources