Parsing binary data

Parsing binary data - parsing

I got interested in parser generators. But I don't have the theoretical background. I just read a few things on the internet.
Currently I'm trying to do something with ANTLR
So my questions:
I have a special format of my dataframes:
The first byte of a frame is a tag that describes the nature of the data
The second byte contains the length (number of bytes) of the data itself
Then follows the data itself
The data can contain dataframes itself, and dataframes can be listed one after the other
I hope my description is clear. My questions:
Can I create such a parser with ANTLR that reads the lengs of the frame and then knows when the frame ends?
In ANTLR can I load the different tags I use from a generated file?
Thank you!

I'm not 100% sure about this, but:
Parser generators like antlr require a grammar that is at least context-free
using length-fields in your data makes your grammar not context free (context-sensitive i think)
It is the latter point i'm not sure about - maybe you want to research some more on that.
You probably have to write a packet "parser" yourself (which then has to be a parser for your context-sensitive packet grammar)
Alternatively, you could drop the length field, and use something like s-expressions, JSON or xml; these would be parseable by something generated with antlr.

I think you will be better off to create a hand written binary parser instead of using ANTLR because ANTLR is primarily intended to read and make sense of a text file and not binary data. The lexer part is focused on tokenizing text so trying to make it read binary data instead would be an uphill battle.
It sounds as if your structure would need some kind of recursive way of reading the data although it could be done easier just having a tree structure and then fill it as you read your file.

Related

Parsing and pretty printing the same file format in Haskell

I was wondering, if there is a standard, canonical way in Haskell to write not only a parser for a specific file format, but also a writer.
In my case, I need to parse a data file for analysis. However, I also simulate data to be analyzed and save it in the same file format. I could now write a parser using Parsec or something equivalent and also write functions that perform the text output in the way that it is needed, but whenever I change my file format, I would have to change two functions in my code. Is there a better way to achieve this goal?
Thank you,
Dominik

The BNFC-meta package https://hackage.haskell.org/package/BNFC-meta-0.4.0.3
might be what you looking for
"Specifically, given a quasi-quoted LBNF grammar (as used by the BNF Converter) it generates (using Template Haskell) a LALR parser and pretty pretty printer for the language."
update: found this package that also seems to fulfill the objective (not tested yet) http://hackage.haskell.org/package/syntax

Generate a parser from programatically generated BNF

I've seen two approaches to parsing:
Use a parser generator like happy. This allows you to specify your language in BNF, and not worry about the intricacies of parsing. However, since it's a preprocessor you have to write your whole parse tree textually.
Use a parser directly like megaparsec. With this approach you have direct access to your code so you can generate your parser programatically, but you haven't got the convenience of happy's simple BNF specification with precedence annotations etc. Also it seems non trivial to print out a BNF tree for documentation from your parsing code unless this is considered during it's construction.
What I'd like to do is something like this:
Generate a data structure programatically that represents BNF.
Feed this through to a "happy like" parser generator to generate a parser.
Feed this through a pretty printer to generate actual BNF documentation.
The reason I want to do this is that the grammar I'm working on has grown quite large and has a lot of repetition, as a lot of it's constructs are similar to others but slightly different. It would improve maintenence effort if it could be generated programmatically instead of modifying happy BNF spec directly, but I'd rather not have to develop my own parser from scratch.
Any ideas about a good approach here. It would be great if I could just generate a data structure and force it into happy (as it presumably generates it's own internal structure after parsing the BNF feed to it) but happy doesn't seem to have a library interface.
I guess I could generate attonated BNF, and feed that through to happy, but it seems like a messy process of converting back and forth. A cleaner approach would be better. Perhaps even a BNF style extension to parsec or megaparsec?

The simplest thing to do would to make some data type representing the relevant grammar, and then convert it to a parser using some parser combinators as a (run-time) "compile" step. Unfortunately, most parser combinators are less efficient and/or less flexible (in some ways) than the parser generators, so this would be a bit of a lowest common denominator approach. That said, the grammar-combinators library may be useful, though it doesn't appear to be maintained.
There are libraries that can generate parsers at run-time. One I found just now is Grempa, which doesn't appear to be maintained but that may not be a problem. Another option (by the same person who made Grempa but maintained) is Earley which, due to the way Earley parsers are made, it makes sense to have an explicit grammar that gets processed into a parser. Earley parsing is certainly flexible, but may be overpowered for you (or maybe not).

VBScript Partial Parser

I am trying to create a VBScript parser. I was wondering what is the best way to go about it. I have researched and researched. The most popular way seems to be going for something like Gold Parser or ANTLR.
The feature I want to implement is to do dynamic checking of Syntax Errors in VBScript. I do not want to compile the entire VBS every time some text changes. How do I go about doing that? I tried to use Gold Parser, but i assume there is no incremental way of doing parsing through it, something like partial parse trees...Any ideas on how to implement a partial parse tree for such a scenario?
I have implemented VBscript Parsing via GOLD Parser. However it is still not a partial parser, parses the entire script after every text change. Is there a way to build such a thing.
thks

If you really want to do incremental parsing, consider this paper by Tim Wagner.
It is brilliant scheme to keep existing parse trees around, shuffling mixtures of string fragments at the points of editing and parse trees representing the parts of the source text that hasn't changed, and reintegrating the strings into the set of parse trees. It is done using an incremental GLR parser.
It isn't easy to implement; I did just the GLR part and never got around to the incremental part.
The GLR part was well worth the trouble.
There are lots of papers on incremental parsing. This is one of the really good ones.

I'd first look for an existing VBScript parser instead of writing your own, which is not a trivial task!
Theres a VBScript grammar in BNF format on this page: http://rosettacode.org/wiki/BNF_Grammar which you can translate into a ANTLR (or some other parser generator) grammar.
Before trying to do fancy things like re-parsing only a part of the source, I recommend you first create a parser that actually works.
Best of luck!

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?

The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.

You might be interested in AttoParsec, which was designed for this purpose, I think.

I've used Data Binary successfully.

It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.

The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.

The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.

It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.

If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.

Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?

I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.

If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart