efficient and flexible binary data parsing - parsing

I have an external device that spits out UDP packets of binary data and software running on an embedded system that needs to read this data stream, parse it and do somethign useful. The binary data gets logged to a file as well. I would like to write a parser that can easily take the input directly from either the UDP stream, or a file, parse the data into a specific format and then direct the output to either a file (e.g. matlab dat file) or to another process that will do some real time processing. Are there any resources that would help me with this and what is the best way to go about this? I think it might make sense to use C++ streams but I'm not familiar with creating custom output streams. Does this seem like a good approach to take or is there a better way to go about it?
Thanks.

The beauty of binary data is that its is generally of very fixed format.
A typical method of parsing it is to declare a structure that maps onto the received packets, and then to just use type-casts to read the fields as structure elements.
The beauty is that this requires no parsing.
you have to be careful about structure packing rules, and endian-ness to make the structure map exactly the same way. Use of the C "offsetof" and "sizeof" macros is useful to emit some debug info to check that your structure is indeed mapping to what you think it is mapping.
Packing rules can typically be altered either by directives (such as #pragma's) or command line options. Endian-ness you are stuck with. If its different from what your embedded system uses, declare all the fields as bytes, or use something like the "ntoh" macro to do the byte swapping.

The New Jersey Machine Code Toolkit is a scheme for decoding arbitrary binary patterns. It was originally designed for decoding instruction sets, but it ought to be just fine for decoding message formats. You provide a description of the binary format, it synthesizes code to access the fields of that format (when valid). THus you can refer to message fields using generated function calls rather than think about where the field is or how it is encoded.

Related

Matlab Parse Binary File

I am looking to speed up the reading of a data file which has been converted from binary (it is my understanding that "binary" can mean a lot of different things - I do not know what type of binary file I have, just that it's a binary file) to plaintext. I looked into reading files quickly awhile ago, and was informed that reading/parsing a binary file is faster than text. So, I would like to parse/read the binary file (that was converted to plaintext) in an effort to speed up the program.
I'm using Matlab for this project (I have a Matlab "program" that needs the data in the file). I guess I need some information on the different "types" of binary, but I really want information on how to read/parse said binary file (I know what I'm looking for in plaintext, so I imagine I'll need to convert that to binary, search the file, then pull the result out into plaintext). The file is a logfile, if that helps in any way.
Thanks.
There are several issues in what you are asking -- however, you need to know the format of the file you are reading. If you can say "At position xx, I can expect to find data yy", that's what you need to know. In you question/comments you talk about searching for strings. You can also do it (much like a text file) "when I find xxxx in the file, give me the following data up to nth character, or up to the next yyyy".
You want to look at the documentation for fread. In the documentation there are snippets of code that will get you started, but as I (and others) said you need to know the format of your binary files. You can use a hex editor to ascertain some information if you are desperate, but what should be quicker is the documentation for the program that outputs these files.
Regarding different "binary files", well, there is least significant byte first or LSB last. You really don't need to know about that for this work. There are also other platform-dependent issues which I am almost certain you don't need to know about (unless you are moving the binary files from Mac to PC to unix machines). If you read to almost the bottom of the fread documentation, there is a section entitled "Reading Files Created on Other Systems" which talks about the issues and how to deal with them.
Another comment that I have to make, you say that "reading/parsing a binary file is faster than text". This is not true (or even if it is, odds are you won't notice the performance gain). In terms of development time, however, reading/parsing a textfile will save you huge amounts of time.
The simple way to store data in a binary file is to use the 'save' command.
If you load from a saved variable it should be significantly faster than if you load from a text file.

Convert erlang terms to string, or decode erlang binary

I have an erlang program which generates data. This data needs to be transferred via udp to a non-erlang program for further processing. I already have this part working - sending the data via udp and receiving it on the other non-erlang side.
Here's the problem. The data (erlang terms like tuples containing lists) doesn't seem to be able to go over "as is" (i.e. I can't just send arbitrary erlang terms). It apparently needs to be converted to either text or binary first. Converting to binary seems easy enough with a bif I found. The problem is, binary gobbledygook comes out the other side, and I don't know any easy way to decode it (the other side is non-erlang).
Barring someone telling me some easy way to decode binary gobbledygook on the other side, I'd like the data to be sent as a simplistic string representation of the terms - for instance a tuple like this:
{[1,2,3],[4,5,6]}
sent like this:
"{[1,2,3],[4,5,6]}"
I haven't seen any such bif, i.e. "convert_term_to_ascii/1" etc. I know I could scan it and send token representations of the terms, but I don't want to do that - decoding that on the other side is just a pain I don't want to deal with.
I know I'm not the first, second, or third person to have this problem. It has to be fairly common. How is it normally dealt with?
Can someone point me to some resource showing me how to either 1) convert binary gobbledygook to ascii (needed on the non-erlang side), or 2) straightforwardly convert terms to a string (needed on the erlang side)?
Or, tell me how I'm wrong and how I should really be doing this?
Thanks.
1) you can convert any term to string using
R= io_lib:format("~p",[yourtermhere]),
lists:flatten(R)
2) you might look at erlang external binary format, a lot of other languages have libraries for encode/decode that erlang binaries format. And in erlang you can encode any term by term_to_binary
I'd recommend converting the erlang terms into JSON, with either of known libraries (heard good words regarding rfc4267). It'd be a trivial task to convert JSON back with any non-erlang platform, I guess. )

Binary Serialized File - Delphi

I am trying to deserialize an old file format that was serialized in Delphi, it uses binary seralization. I know nothing about the structure of the file except some very high level records that are in it.
What steps would you take to solve this problem? Any tools etc?
A good hexeditor, and use the gray matter to identify structures.
If you get a hint what kind of file it is, you can search for more specialized tools.
Running the unix/Linux "file" command can be good too (*) See Barry's comment below for how it works. It can be a quick check for common filetypes like DBF,ZIP etc hidden by using a different extension.
(*) there are 3rd party builds for windows, but they might lag in versions. If you can do it on a recent *nix distro, it is advised to do so.
The serialization process simply loops over all published properties and streams their value to a text file. If you do not know the exact classes that were streamed to the file you will have a very hard time deserializing the file. (if not impossible)
A good hex editor is first. If the file is read without buffering (eg read directly from a TFileStream) you could gain some information when using ProcMon from SysInternals; You can see exactly what data is read in what chunks and thus determine more quickly where the boundaries are between the structures you already identified.

What is a 'Stream', relating to cin and cout?

A tutorial is talking about cin and cout:
"Syntactically these streams are not used as functions: instead, data are written to streams or read from them using the operators <<, called the insertion operator and >>, called the extraction operator."
What is a 'stream'?
Consider a "Stream" as a physical hose, or pipe. At one end, someone may pour some water in. At the other end, it will come out. This is 'reading' and 'writing' to the stream.
A stream is just a place where data goes. It can be a 'socket stream' (over the internet) or a 'file stream' (to a file), or perhaps a 'memory stream', just data written to a place in-memory (ram).
A "stream" is an object that represents a source of data, or a place where data can be written.
Examples include file handles and pipes - things that you can read data from or write data to.
An important property of streams is that they share a common interface, so the same code can write to either a file or a pipe (for instance) without needing to be rewritten.
You should look at streams as abstractions on underlying 'sources' or 'sinks' of data. A source is something you read data from, and a sink is something you write data to.
The concept of streams allows you to perform I/O on various forms of media, network connections, pipes between applications, files, etc.
The stream abstraction is very valuable to us as developers as it allows us to simplify input and output, and it gives us the flexibility to arrange and reconnect the sources and destinations of these streams.
A good analogy is that of a hose. You can send and receive data through hoses, and you can connect these hoses to various things.
By allowing programs to talk through hoses, we allow all sorts of programs to talk to each other, and we increase interoperability and utility vastly.
This is at the heart of the UNIX philosophy, and supports some very powerful programming idioms.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Resources