Advantage for hex formats like SREC or Intel HEX - binary-data

I want to ask if someone can explain me the benefit for using hex formats (e.g. by Motorola S-Record or Intel HEX) over using direct binary images like for firmware or memory dumps?
I understand that it is useful to have some Meta information about the binary file like used memory areas, checksums for data integrity and so on…
However, the fact that the actual data size is doubled, because everything will be saved in a hex-ASCII representation is confusing me.
Is the reason for using a hex-ASCII representation only the portability, to prevent problems with systems that have a different byte endianness or are there other benefits?
I found for this topic many tutorials about how to convert binary to hex and backwards or the specifications of the certain formats, but no information about the advantages and disadvantages.

There are a couple of advantages of HEX-ASCII formats over binary image. Here are a couple to start with.
Transport. An ASCII file can be transfered through most mediums. Binary files may not be transferred intact across some mediums.
Validity checks. The HEX file has a checksum on each line. The Binary may not have any checks.
Size. Data is written to selected memory address in the Hex file. The binary image has limited addressing information and is likely to contain every memory location from start address to end of file.

Related

C/C++, are objects binary compatible across platform?

I would like to send some object data, in binary, across some mcu. I treat it as a cross platform problem. How I would like to implement is like:
//mcu A
//someObj declared and initialized
Send((uint_8_t*)&someObj,sizeof(someObj));
//mcu B
SomeClass someObj;
Read((uint_8_t*)&someObj,sizeof(someObj));
Are there any guarantee in C/C++ that such thing is possible?
There is no guaranty that it works. If your data is only composed of a set of chars, it will probably work whatever the platforms.
Otherwise, you will encounter hardware and software problems.
Hardware problems include endianness and data alignment.
Endianness refers to the way multibyte data types are arranged in memory. For instance an integer has 4 bytes and some architectures store it in memory by writing at the lowest address address the least significant byte (little endian like the pentium) while others store the most significant byte at the lowest address (big endian). If endianness is different, bytes must be swapped to ensure compatibility. Note that some platforms (Arm, mips, among others) can use both endianness, but it is generally selected at boot time. Also some machines have different endianness for integers and floats.
Alignment refers to the constraint on many architectures that a 2^k bytes data must be at an address multiple of 2^k. Some architectures, like the pentium, do not have this constraint and can manipulate unaligned data, but a compiler may lay out data in an aligned way to improve performances. As a side effect of alignment constraints, a given object may not have the same size on different architectures and sizeof() applied to a struct is not guaranteed to return the same value.
Software problems are related to the nature of your data.
Obviously if your data contains any kind of pointer, it is impossible to transfer them as is across platforms.
If you have C++ objects with constructors/destructors, again you will run into problems if transferring binary data.
The process of converting data to allow a safe transfer across platforms is frequently called serialization or pickling. Many languages (java, javascript, python, R) have a native support for it. In C/C++, there is no support for serialization in the language, and custom serialization must written, but frameworks like Boost or MFC provide serialization methods. You can also have a look at XDR (external data representation) that is a serialization standard which is supported by several libraries.

Safety of xz archive format

While looking for a good option to store large amounts of data (coming mostly from numerical computations) long-term, I arrived at using xz archive format (tar.xz). The default LZMA compression there provides significantly better archive sizes (for my type of data) compared to more common tar.gz (both with reasonable compression options).
However, the first google search on the safety of long-term usage of xz, arrived at the following web-page (coming from one of the developers of lzip) that has a title
Xz format inadequate for long-term archiving
listing several reasons, including:
xz being a container format as opposed to simple compressed data preceded by a necessary header
xz format fragmentation
unreasonable extensibility
poor header design and lack of field length protection
4-byte alignment and use of padding all over the place
inability to add the trailing data to already created archive
multiple issues with xz error detection
no options for data recovery
While some of the concerns seem a bit artificial, I wonder, if there is any solid justification for not using xz as an archive format for long-term archiving.
What should I be concerned about if I choose xz as a file format?
(I guess, access to an xz program itself should not be an issue even 30 years from now)
a couple of notes:
The data stored are results of numerical computations, some of which are published in different conferences and journals. And while storing results does not necessarily imply research reproducibility, it is an important component.
While using more standard tar.gz or even plain zip might be a more obvious choice, an ability to cut about 30% of the archive size is very appealing to me.
If you read carefully the page you linked, you'll find things like this:
https://www.nongnu.org/lzip/xz_inadequate.html#misguided
"the xz format specification sets more strict requirements for the integrity of the padding than for the integrity of the payload. The specification does not guarantee that the integrity of the decompressed data will be verified, but it mandates that the decompression must be aborted as soon as a damaged padding byte is found."
What compressed format does any of the following?
Uses padding.
Protects padding with a CRC.
Aborts if the padding is corrupt.
Maybe the right question is, "is there any solid justification for using such a poorly designed format as xz for long-term archiving when properly designed formats exist?"
The IANA Time Zone Database, for example, is using gzip and lzip to distribute their tarballs, which are archived forever.
http://www.iana.org/time-zones

How to read or write huge Unicode files?

I need to read huge Unicode files into my program and convert to ANSI for parsing and for some files, store them again as Univode while others should be in ANSI code page.
As I have understood it, simple read/write don't support Unicode text, and for the biggest files (some maybe as big as 300 Mb or even bigger) using twidestring.loadfromfile is out of question both because memory usage and time to load.
I have been wondering if loading blocks could be a path of solution, but as I know, it doesn't support Unicode BOM ?
Any suggetions?
There is an excellent and very fast text reader in the german 'Delphi Forum'. It uses memory mapped files.
You will probably be able to modify it to read Unicode text files. However, you might have to test the BOM yourself.
In Delphi you can also use memory mapped files.
The primary benefit of memory mapping a file is increasing I/O
performance, especially when used on large files.
...
A possible benefit of memory-mapped files is a "lazy loading", thus using small amounts of RAM even for a very large file.
Memory-mapped file. (2013, February 26). In Wikipedia, The Free Encyclopedia. Retrieved 15:14, March 17, 2013, from http://en.wikipedia.org/w/index.php?title=Memory-mapped_file&oldid=540609840

Matlab Parse Binary File

I am looking to speed up the reading of a data file which has been converted from binary (it is my understanding that "binary" can mean a lot of different things - I do not know what type of binary file I have, just that it's a binary file) to plaintext. I looked into reading files quickly awhile ago, and was informed that reading/parsing a binary file is faster than text. So, I would like to parse/read the binary file (that was converted to plaintext) in an effort to speed up the program.
I'm using Matlab for this project (I have a Matlab "program" that needs the data in the file). I guess I need some information on the different "types" of binary, but I really want information on how to read/parse said binary file (I know what I'm looking for in plaintext, so I imagine I'll need to convert that to binary, search the file, then pull the result out into plaintext). The file is a logfile, if that helps in any way.
Thanks.
There are several issues in what you are asking -- however, you need to know the format of the file you are reading. If you can say "At position xx, I can expect to find data yy", that's what you need to know. In you question/comments you talk about searching for strings. You can also do it (much like a text file) "when I find xxxx in the file, give me the following data up to nth character, or up to the next yyyy".
You want to look at the documentation for fread. In the documentation there are snippets of code that will get you started, but as I (and others) said you need to know the format of your binary files. You can use a hex editor to ascertain some information if you are desperate, but what should be quicker is the documentation for the program that outputs these files.
Regarding different "binary files", well, there is least significant byte first or LSB last. You really don't need to know about that for this work. There are also other platform-dependent issues which I am almost certain you don't need to know about (unless you are moving the binary files from Mac to PC to unix machines). If you read to almost the bottom of the fread documentation, there is a section entitled "Reading Files Created on Other Systems" which talks about the issues and how to deal with them.
Another comment that I have to make, you say that "reading/parsing a binary file is faster than text". This is not true (or even if it is, odds are you won't notice the performance gain). In terms of development time, however, reading/parsing a textfile will save you huge amounts of time.
The simple way to store data in a binary file is to use the 'save' command.
If you load from a saved variable it should be significantly faster than if you load from a text file.

Unicode Precomposition and Decomposition with Delphi

The Wikipedia entry for Subversion contains a paragraph about problems with different ways of Unicode encoding:
While Subversion stores filenames as Unicode, it does not specify if
precomposition or decomposition is used for certain accented
characters (such as é). Thus, files added in SVN clients running on
some operating systems (such as OS X) use decomposition encoding,
while clients running on other operating systems (such as Linux) use
precomposition encoding, with the consequence that those accented
characters do not display correctly if the local SVN client is not
using the same encoding as the client used to add the files
While this describes a specific problem with Subversion client implementations, I am not sure if the underlying Unicode composition problem could also appear with regular Delphi applications. I guess that the problem can only arise if Delphi applications are able to use both Unicode encoding ways (maybe in Delphi XE2). If yes, what could Delphi developers do to avoid it?
There is a minor display issue in that many fonts used on Windows won't render the decomposed form in the ideal way, by using the combined glyph for both the letter and the diacritical. Instead it falls back to rendering the letter and than overlaying the standalone diacritical mark on top, which typically results in a less visually pleasing, potentially-lopsided grapheme.
However that is not the issue the Subversion bug referenced from wiki is talking about. It's actually completely fine to check in filenames to SVN that contain composed or decomposed character sequences; SVN neither knows nor cares about composition, it just uses the Unicode code points as-is. As long as the backend filesystem leaves filenames in the same state as they were put in, all is fine.
Windows and Linux both have filesystems that are equally blind to composition. Mac OS X, unfortunately, does not. Both HFS+ and UFS filesystems perform ‘normalisation’ to decomposed form before storing an incoming filename, so the filename you get back won't necessarily be the same sequence of Unicode code points you put in.
It is this [IMO: insane] behaviour that confuses SVN—and many other programs—when being run on OS X. It's particularly likely to bite because Apple happened to choose decomposed (NFD) as their normalisation form, whereas most of the rest of the world uses composed (NFC) characters.
(And it's not even real NFD, but an incompatible Apple-only variant. Joy.)
The best way to cope with this is, if you can, is never to rely on the exact filename something's stored under. If you only ever read a file from a given name, that's fine, as it'll be normalised to match the filesystem at the time. But if you're reading a directory listing and trying to match filenames you find in there to what you expected the filename to be—which is what Subversion is doing—you're going to get mismatches.
To do a filename match reliably you would have to detect that you're running on OS X, and manually normalise both the filename and the string to some normal form (NFC or NFD) before doing the comparison. You shouldn't do this on other OSes which treat the two forms as different.
AFAICT, both encodings should produce the same results when displaying, and both are valid Unicode, so I don't quite see the problem there. A display routine should be able to handle both if decomposition is encountered for. A code point é should display as-is, while e´ should only display as é in decomposition mode.
The problem is not display, IMO, it is comparison, either for equality (which fails if both use a different encoding) or lexically, i.e. for sorting. That is why one should normalize to one encoding, as David says. That way there are no abmiguities anymore.
The same problem could arise in any application that deals with text. How to avoid it depends on what operations the application is performing and the question lacks specific details. Mostly I think you'd solve such problems by normalizing the text. This involves using a single preferred representation whenever you encounter ambiguity of encoding.

Resources