Could lex/flex be used to parse binary format source files? - parsing

When I learn lex tool, I found it helps to parse source files in text format, like building a new programming languages, etc. I also with to use it to build a tool to analyse some binary input streams, like codec/decoders.
Does lex/flex/yacc/bison support such requirements, do they have special command line options and syntax to enable this?
Thanks!

Flex (and the other lex implentations I'm familiar with) have no problem with non-ascii characters, including the NUL character. You may have to use the 8bit option, although it is the default unless you request fast state tables.
However, most binary formats use length-prefixed variable length fields, which cannot be expressed in a regular expression. Moreover, it is quite common for fixed-lengtb fields to be context-dependent; you can build a state machine in flex using start conditions, but that's a lot of work and is likely to be a waste of your time and flex's features.

Related

COBOL clone detection with ConQAT?

ConQAT's doc claims it can do clone detection on COBOL code, but I can't find any appropriate block in the list of Included blocks.
The only one that could be considered is StatementCloneAnalysis but it would get confused by the line numbers that precede each line:
016300******************************************************************0058
Interesting tool. I took a quick look and it seems to me that a simple fix might be to pre-process COBOL source to overwrite columns 1 through 6 with spaces and trim everything after column 72.
After poking around for a while I came across the NextToken scanner definition file for COBOL. It looks like it will "happily" pick up tokens from the sequence number area as well as after column 72. The tokenizer looks like it only deals with COBOL source code after it has gone through the library processing phase of a compile (i.e. after compiler directives such as COPY/REPLACE have been processed). COPY/REPLACE were specified as keywords but I really don't see how this tokenizer would deal with them properly - particularly where pseudo text is involved.
If working with an IBM COBOL compiler, you can specifying the MDECK option on a compile to generate a suitable source file for analysis. I am not familiar with other vendors so cannot comment further on how to generate a post text-manipulation source deck.
The level of clone detection conquat provides for COBOL appears to be very limited relative to other languages (e.g. java). I suspect you will have to put in a lot of hours to get anything more than trivial clone detection out of it for COBOL programs. However this could be a very useful project given the heavy use of cut/paste coding in typical COBOL programs (COBOL programmers often make a joke out of it: Only one COBOL program has ever been written, the rest are just modified copies of it). I wish you well.
Given that ConQat deals with COBOL badly, you might look at our CloneDR tool.
It has a version that works explicitly with IBM Enterprise COBOL, using a precise parser, and it handles all that sequence number nonsense correctly. (It will even read the COBOL code in its native ECBDIC, meaning a literal string containing an ASCII newline character doesn't break the parser).
[If your COBOL isn't IBM COBOL, this won't help you, but otherwise you won't "have to put a lot of hours to get anything"].
We think the AST-based detection technique detects better clones more accurately than ConQat's token-based detection. The site explains why in detail, and shows sample COBOL clones detected by CloneDR.
Specific to the OP who appears to be working in Japan: as a bonus, CloneDR handles Japanese character sets because it is implemented on top of an underlying tool infrastructure that is Unicode and Shift-JIS enabled. We haven't had a lot of experience with Japanese COBOL so there might be a remaining glitch; see G literals with Japanese characters.

Are there any known parser combinator library's in F# that can parse binary (not text) files?

I am familiar with some of the basics of fparsec but it seems to be geared towards text files or streams.
Are there any other F# library's that can efficiently parse binary files? Or can fparsec be easily modified to work efficiently with binary streams?
You may be interested in pickler combinators. These are a bit like parser combinators, but are more focused at simpler binary formats (picklers allow you to produce binary data and unpicklers parse them). There is a quite readable article about the idea (PDF) by Andrew Kennedy (the author of units of measure).
I don't have much experience with these myself, but I just realized it may be relevant for you. The idea is used in the F# compiler for generating some binary resources (like quotations stored in resources). Although, I'm not sure if the F# compiler implementation is any good (it is one of those things from early days of the F# compiler).
The problem with working with binary streams is not a parser problem per se, it's a lexing problem. The lexer is what turns the raw data in to elements that the parse can handle.
Most any parsing system has few problems letting you supply your own lexer, and if that's the case you could, ideally, readily write a compliant lexer that works on your binary stream.
The problem, however, is that most parsing and lexing systems today are themselves created from a higher level tool. And THAT tool most likely is not designed to work with binary streams. That is, it's not practical for you specify the tokens and grammar of the binary stream that can be used to create the subsequent parsers and lexer. Also, there is likely no support whatsoever for the higher level concepts of multi byte binary numbers (shorts, longs, floats, etc.) that you are likely to encounter in a binary stream, nor for the generated parser to possibly work well upon them if you actually need to work on their actual value, again because the systems are mostly designed for text based tokens, and the underlying runtime handles the details of converting that text it something the machine can use (such as sequences of ascii numerals in to actual binary integers).
All that said, you can probably actually use the parsing section of the tool, since parsers work more on abstract tokens that are fed them by the lexer. Once you create your grammar, at a symbolic level, you would need to redo the lexer to create the problem tokens from the binary stream to feed in to the parser.
This is actually good, because the parser tends to be far more complicated than the basic lexer, so the toolkit would handle much of the "hard part" for you. But you would still need to deal with creating your own lexer and interfacing it properly to the generated parser. Not an insurmountable task, and if the grammar is of any real complexity, likely worth your effort in the long run.
If it's all mostly simple, then you're likely just better off doing it your self by hand. Of the top of my head, it's hard to imagine a difficult binary grammar, since the major selling point of a binary format is that it's much closer to the machine, which is in contradiction to the text that most parsers are designed to work with. But I don't know your use case.
But consider the case of a disassembler. That's a simple lexer that may be able to under stand at a high level the different instruction types (such as those operands that have no arguments, those that take a single byte as an argument, or a word), and feed that to a parser can then be used to convert the instructions in to their mnemonics and operands in the normal assembler syntax, as well as handle the label references and such.
It's a contrived case, as a disassembler typically doesn't separate the lexing and parsing phases, it's usually not complicated enough to bother, but it's one way to look at the problem.
Addenda:
If you have enough information to convert the binary stream in to text to feed to the engine, then you you have enough information to instead of creating text, you could create the actual tokens that the parser would want to see from the lexer.
That said, what you could do is take your text format, use that as the basis for your parsing tool and grammar, and have it create the lexer and parser machines for you, and then, by hand, you can test your parser and its processing using "text tests".
But when you get around to reading the binary, rather than creating text to then be lexed and parsed, simply create the tokens that the lexer would create (these should be simple objects), and pump the parser directly. This will save you the lex step and save you some processing time.

Software to identify patterns in text files

I work on some software that parses large text files and inserts data into a database. Every time we get a new client, we have to write new parsing code for their text files.
I'm looking for some software to help simplify analyzing the text files. It would be nice to have some software that could identify patterns in the file.
I'm also open to any general purpose parsing libraries (.NET) that may simplify the job. Or any other relevant software.
Thanks.
More Specific
I open a text file with some magic software that shows me repeating patterns that it has identified. Really I'm just looking for any tools that developers have used to help them parse files. If something has helped you do this, please tell me about it.
Well, likely not exactly what you are looking for, but clone detection might be the right kind of idea.
There are a variety of such detectors. Some work only one raw lines of text, and that might apply directly to you.
Some work only on the works ("tokens") that make up the text, for some definition of "token".
You'd have to define what you mean by tokens to such tools.
But you seem to want something that discovers the structure of the text and then looks for repeating blocks with some parametric variation. I think this is really hard to do, unless you know sort of what that structure is in advance.
Our CloneDR does this for programming language source code, where the "known structure" is that of the programming language itself, as described specifically by the BNF grammar rules.
You probably don't want to Java-biased duplicate detection on semi-structured text. But if you do know something about the structure of the documents, you could write that down as a grammar, and our CloneDR tool would then pick it up.

Will rewriting a multipurpose log file parser to use formal grammars improve maintainability?

TLDR: if I built a multipurpose parser by hand with different code for each format, will it work better in the long run using one chunk of parser code and an ANTLR, PyParsing or similar grammar to specify each format?
Context:
My job involves lots of benchmark log files from ~50 different benchmarks. There are a few in XML, a few HTML, a few CSV and lots of proprietary stuff with no documented spec. To save me and my coworkers the time of entering this data by hand, I wrote a parsing tool that handles all of the formats we deal with regularly with a uniform interface. The design, though, is not so clean.
I wrote this thing in Python and created a Parser class. Each file format is handled as an implementation that provides its own code for the Parser's read() method. I like the idea of having only one definition of Parser that uses grammars to understand each format, but I've never done it before.
Is it worth my time, and will it be easier for other newbies to work with in the future once I finish refactoring?
I can't answer your question with 100% certainty, but I can give you an opinion.
I find the choice to use a proper grammar vs hand rolled regex "parsers" often comes down to how uniform the input is.
If the input is very uniform and you already know a language that deals with strings well, like Python or Perl, then I'd keep your existing code.
On the other hand I find parser generators, like Antlr, really shine when the input can have errors and inconsistencies in it. The reason is that the formal grammar allows you to focus on what should be matched in a certain context without having to worry about walking the input stream manually.
Furthermore if the input stream has an error then I find it's often easier to deal with them using Antlr vs regexs. The reason being is that if a couple of options are available Antlr has built in functionality for hosing the correct path, including rollback via predicates.
Having said all that, there is alot to be said for working code. I find if I want to rewrite something then I try to make a good use case for how the rewrite will benefit the user of the product.

Is ANTLR an appropriate tool to serialize/deserialize a binary data format?

I need to read and write octet streams to send over various networks to communicate with smart electric meters. There is an ANSI standard, ANSI C12.19, that describes the binary data format. While the data format is not overly complex the standard is very large (500+ pages) in that it describes many distinct types. The standard is fully described by an EBNF grammar. I am considering utilizing ANTLR to read the EBNF grammar or a modified version of it and create C# classes that can read and write the octet stream.
Is this a good use of ANTLR?
If so, what do I need to do to be able to utilize ANTLR 3.1? From searching the newsgroup archives it seems like I need to implement a new stream that can read bytes instead of characters. Is that all or would I have to implement a Lexer derivative as well?
If ANTLR can help me read/parse the stream can it also help me write the stream?
Thanks.
dan finucane
You might take a look at Ragel. It is a state machine compiler/lexer that is useful for implementing on-the-wire protocols. I have read reports that it generates very fast code. If you don't need a parser and template engine, ragel has less overhead than ANTLR. If you need a full-blown parser, AST, and nice template engine support, ANTLR might be a better choice.
This subject comes up from time to time on the ANTLR mailing list. The answer is usually no, because binary file formats are very regular and it's just not worth the overhead.
It seems to me that having a grammar gives you a tremendous leg up.
ANTLR 3.1 has StringTemplate and code generation features that are separate from the parsing/lexing, so you can decompose the problem that way.
Seems like a winner to me, worth trying.

Resources