Are there any known parser combinator library's in F# that can parse binary (not text) files? - parsing

I am familiar with some of the basics of fparsec but it seems to be geared towards text files or streams.
Are there any other F# library's that can efficiently parse binary files? Or can fparsec be easily modified to work efficiently with binary streams?

You may be interested in pickler combinators. These are a bit like parser combinators, but are more focused at simpler binary formats (picklers allow you to produce binary data and unpicklers parse them). There is a quite readable article about the idea (PDF) by Andrew Kennedy (the author of units of measure).
I don't have much experience with these myself, but I just realized it may be relevant for you. The idea is used in the F# compiler for generating some binary resources (like quotations stored in resources). Although, I'm not sure if the F# compiler implementation is any good (it is one of those things from early days of the F# compiler).

The problem with working with binary streams is not a parser problem per se, it's a lexing problem. The lexer is what turns the raw data in to elements that the parse can handle.
Most any parsing system has few problems letting you supply your own lexer, and if that's the case you could, ideally, readily write a compliant lexer that works on your binary stream.
The problem, however, is that most parsing and lexing systems today are themselves created from a higher level tool. And THAT tool most likely is not designed to work with binary streams. That is, it's not practical for you specify the tokens and grammar of the binary stream that can be used to create the subsequent parsers and lexer. Also, there is likely no support whatsoever for the higher level concepts of multi byte binary numbers (shorts, longs, floats, etc.) that you are likely to encounter in a binary stream, nor for the generated parser to possibly work well upon them if you actually need to work on their actual value, again because the systems are mostly designed for text based tokens, and the underlying runtime handles the details of converting that text it something the machine can use (such as sequences of ascii numerals in to actual binary integers).
All that said, you can probably actually use the parsing section of the tool, since parsers work more on abstract tokens that are fed them by the lexer. Once you create your grammar, at a symbolic level, you would need to redo the lexer to create the problem tokens from the binary stream to feed in to the parser.
This is actually good, because the parser tends to be far more complicated than the basic lexer, so the toolkit would handle much of the "hard part" for you. But you would still need to deal with creating your own lexer and interfacing it properly to the generated parser. Not an insurmountable task, and if the grammar is of any real complexity, likely worth your effort in the long run.
If it's all mostly simple, then you're likely just better off doing it your self by hand. Of the top of my head, it's hard to imagine a difficult binary grammar, since the major selling point of a binary format is that it's much closer to the machine, which is in contradiction to the text that most parsers are designed to work with. But I don't know your use case.
But consider the case of a disassembler. That's a simple lexer that may be able to under stand at a high level the different instruction types (such as those operands that have no arguments, those that take a single byte as an argument, or a word), and feed that to a parser can then be used to convert the instructions in to their mnemonics and operands in the normal assembler syntax, as well as handle the label references and such.
It's a contrived case, as a disassembler typically doesn't separate the lexing and parsing phases, it's usually not complicated enough to bother, but it's one way to look at the problem.
Addenda:
If you have enough information to convert the binary stream in to text to feed to the engine, then you you have enough information to instead of creating text, you could create the actual tokens that the parser would want to see from the lexer.
That said, what you could do is take your text format, use that as the basis for your parsing tool and grammar, and have it create the lexer and parser machines for you, and then, by hand, you can test your parser and its processing using "text tests".
But when you get around to reading the binary, rather than creating text to then be lexed and parsed, simply create the tokens that the lexer would create (these should be simple objects), and pump the parser directly. This will save you the lex step and save you some processing time.

Related

abstract syntax tree for imperative languages

I am looking for an abstract syntax tree representation that can be used for common imperative languages (Java, C, python, ruby, etc). I would like this to be as close to source as possible (as opposed to something like LLVM). I found Rose online but it is only able to handle C and Fortran. Does this exist?
You won't find "one" universal AST that can represent many languages. People have been searching for 50 years.
The essential reason is that an AST node implicitly represents the precise language semantics of the operator it encodes, and different languages have different semantics for what are apparently the same operators.
For example, the "+" operator in modern Fortran will add integers, reals, complex values, and slices of arrays of such things. Java "+" will add integers, reals, and glue strings together. If I wrote "a+b" in "universal AST", how would you know which semantic effect the corresponding AST encoded?
What you can do is build a system in which the ASTs for different languages are represented uniformly, so that you can share tool infrastructure across many languages. This is done by many Program Transformation Systems (PTS), where you provide the grammar (or pick one from an available library), and the PTS parses and builds an AST using its uniform representation. Most PTS provide additional support to analyze and transform the code.
So, all you need is a PTS and some sweat to define a grammar. That's really not true; getting a grammar right for a real language is actually pretty hard. Worse, there's a lot to Life After Parsing because you need the meaning of symbols and additional inferences such as control and data flow analysis. So you need full front ends (e.g., parsing, name/type resolution, flow analysis, ...), or as much as you can get, if you don't want to be distracted for months before beginning your real work.
What this means in practice is you want to find a tool that handles the languages of interest to you, with mature front ends already available:
Rose (you already found this) handle C, C++ and Fortran. It has no built-in parsing capability of its own; its front ends are custom built. So it is apparantly hard to extend to other languages. But it has good flow analysis capabilities and provides means to transform the code via hand-write AST walks/smashes.
Clang handles C and C++. Clang also uses hand-built front ends. It can also transform code, again by hand-written AST walks/smashes, with a small amount of pattern matching support. As I understand it, you have to use the LLVM part of Clang to do flow analysis.
Our DMS Software Reengineering Toolkit has full front ends for C, C++, Java and COBOL, and full parsers for many more languages such as Python. DMS provides pattern-based analysis and source-to-source transformation. It operates directly from a grammar (see one for Oberon, Nicklaus Wirth's latest language). (I don't know of any tool that handles Ruby, which is famously hard to parse; I understand its grammar is ambiguous, and DMS is good at handling ambiguous grammars).

Writing a Parser for mixed Languages

I am trying to write a Parser that can analyze mixed Languages and generate an AST of it. I first tried to build it from scratch on my own in Java and failed, because this is quite a hard topic for a Parser-beginner. Then i googled and found http://www2.cs.tum.edu/projects/cup/examples.php and JFlex.
The question now is: What is the best way to do it?
For example i have a Codefile, that contains several Tags, JS Code, and some $CMS_SET(x,y)$ Code. Is the best way to solve this to define a grammar for all those things in CUP and let CUP generate a Parser based on my grammer that can analyze those mixed Language files and generate and AST Tree of it?
Thanks for all helpful answers. :)
EDIT: I need to do it in Java...
This topic is quite hard even for an expert in this area, which I consider myself to be; check my bio.
The first issue is to build individual parsers for each sublanguage. The first thing you will discover is that defining parsers for specific languages is actually hard; you can read the endless list of SO requests for "can I get a parser for X" or "how do I fix my parser for X". Mostly I think these requests end up not going anywhere; the parsing engines aren't very good, you have to twist the grammars and the parsers to make them work on real languages, there isn't any such thing as "pure HTML", the standards documents disagree and your customer always has some twist in his code that you're not prepared for. Finally there's the glitches related to character set encodings, variations in newline endings, and preprocessors to complicate the parsing parsing problem. The C++ preprocessor is a lot more complex than you might think, and you have to have this right. The easiest way to defeat this problem is to find some parser generator with the languages already predefined. ANTLR has a bunch for the now deprecated ANTLR3; there's no assurance these parsers are robust let alone compatible for your purposes.
CUP isn't a particularly helpful parser generator; none of the LL(x) or LALR(x) parser generators are really helpful, because no real langauge matches the categories of things they can parse. The consequence: an endless stream of requests (on SO!) for help "resolving my shift-reduce conflict", or "eliminating right recursion". The only parser generator IMHO that has stood the test of time is a GLR parser generator (I hear good things about GLL but that's pretty recent). We've done 40+ languages with one GLR parser generator, including production IBM COBOL, full C++14 and Java8.
You second problem will be building ASTs. You can hand code the AST building process, but that gets old fast when you have to change the grammars often and/or you have many grammars as you are effectively contemplating. This you can beat your way throught with sweat. (We chose to push the problem of building ASTs into the parser so we didn't have to put any energy this in building a grammar; to do this, your parser engine has to offer you this help and none of the mainstream ones do.).
Now you need to compose parsers. You need to have one invoke the other as the need comes up; of course your chosen parser isn't designed to do this so you'll have to twist it. The first hard part is provide a parser with clues that a sublanguage is coming up in the input stream, and for it to hand off that parsing to the sublangauge parser, and get it to pass a tree back to be incorporated in the parent parser's tree, presumably with some kind of marker so you can tell where the transitions between different sublangauges are in the tree. You can often do this by hacking one language's lexer, when it sees the clue, to invoke the other; but then what you do with the tree it returns? There's no way to give that tree to the current parser and say "integrate this". You get around this by modifying the parsing machinery in arcane ways.
But all of the above isn't where the problem is.
Parsing is inconvenient, but only a small part of what you need to analyze your programs in any interesting way; you need symbol tables, control and data flow analysis, maybe points-to analyses, and the engineering on these will swamp the work listed above. See my essay on "Life After Parsing" (google or via my bio) for a long discussion of what else you need.
In short, I think you are biting off an enormous task in just "parsing", and you haven't even told us what you intend to do with the result. You are welcome to start down this path, but very few people have succeeded; my team spent over a 50 man years of PhD level engineering to get where we are, and we are hardly done.
Java won't make the solution any easier or harder; the langauge in which you solve all of the above is irrelevant.

Text parsing library

A colleague of mine works on an universal text parsing library, based on C# lambdas. The core looks cool, but unfortunately to me he has hardcoded a grammar, specifical to his private task -- math expression evaluating. So, I will not use it as I had intended before I saw the API. And now I'm looking for another lib, that meets at least some of my requirements. It has to:
Be able to load a grammar from an external file -- say, XML, YML or JSON.
Return AST from grammar and parsed tree that is built from any text.
Work fast enough to load C# grammar then parse a large code file.
I'd prefer the library that has grammar format file simple enough for easy writing a grammar for math expressions, is open source and written in C# or C++.
Regards,
--
UPDATED: point 2 has been corrected.
You might check out Text Transformer which claims to be some kind of universal text processing language. I have no specific experience with it.
Building robust langauge front ends and usable processing tools is actually a lot of work.
If you want to process computer languages in a generic way, you might consider our DMS Software Reengineering Toolkit, a kind of generalized compiler technology for parsing, analyzing, transforming, and/or generating code (or any other kind of formal document).
DMS will accept arbitrary context free grammars for langauges, automatically builds an AST with no additional specification effort on your part, and is designed to handle not only large files but very large sets of files in a single computation. Normally people
that want to process code need pattern recognition, code analysis and code transformation capabilities; DMS has all of these built in. It also has a variety of predefined, mature grammars for a wide variety of computer langauges, well-known (C, C++, C#, COBOL, Java, JavaScript, ... ) and otherwise (Natural, EGL, Python, MATLAB, ...), and has been used to carry out massive automated analyses and transformations on programs in these various langauges.
DMS does not meet your open-source or C#/C++ implementation requirements. It is implemented as a set of domain-specific langauges for describing grammars, analyzers, transformations, prettyprinters, and scripting that allows parallel execution to enable complex analyses to run faster than single-threaded programs.

Will rewriting a multipurpose log file parser to use formal grammars improve maintainability?

TLDR: if I built a multipurpose parser by hand with different code for each format, will it work better in the long run using one chunk of parser code and an ANTLR, PyParsing or similar grammar to specify each format?
Context:
My job involves lots of benchmark log files from ~50 different benchmarks. There are a few in XML, a few HTML, a few CSV and lots of proprietary stuff with no documented spec. To save me and my coworkers the time of entering this data by hand, I wrote a parsing tool that handles all of the formats we deal with regularly with a uniform interface. The design, though, is not so clean.
I wrote this thing in Python and created a Parser class. Each file format is handled as an implementation that provides its own code for the Parser's read() method. I like the idea of having only one definition of Parser that uses grammars to understand each format, but I've never done it before.
Is it worth my time, and will it be easier for other newbies to work with in the future once I finish refactoring?
I can't answer your question with 100% certainty, but I can give you an opinion.
I find the choice to use a proper grammar vs hand rolled regex "parsers" often comes down to how uniform the input is.
If the input is very uniform and you already know a language that deals with strings well, like Python or Perl, then I'd keep your existing code.
On the other hand I find parser generators, like Antlr, really shine when the input can have errors and inconsistencies in it. The reason is that the formal grammar allows you to focus on what should be matched in a certain context without having to worry about walking the input stream manually.
Furthermore if the input stream has an error then I find it's often easier to deal with them using Antlr vs regexs. The reason being is that if a couple of options are available Antlr has built in functionality for hosing the correct path, including rollback via predicates.
Having said all that, there is alot to be said for working code. I find if I want to rewrite something then I try to make a good use case for how the rewrite will benefit the user of the product.

Most effective way to parse C-like definition strings?

I've got a set of function definitions written in a C-like language with some additional keywords that can be put before some arguments(the same way as "unsigned" or "register", for example) and I need to analyze these lines as well as some function stubs and generate actual C code from them.
Is that correct that Flex/Yacc are the most proper way to do it?
Will it be slower than writing a Shell or Python script using regexps(which may become big pain, as I suppose, if the number of additional keywords becomes bigger and their effects would be rather different) provided that I have zero experience with analysers/parsers(though I know how LALR does its job)?
Are there any good materials on Lex/Yacc that cover similar problems? All papers I could find use the same primitive example of a "toy" calculator.
Any help will be appreciated.
ANTLR is commonly used (as are Lex\Yacc).
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There is also the Lemon Parser, which features a less restrictive grammar. The down side is you're married to lemon, re-writing a parser's grammar to something else when you discover some limitation sucks. The up side is its really easy to use .. and self contained. You can drop it in tree and not worry about checking for the presence of others.
SQLite3 uses it, as do several other popular projects. I'm not saying use it because SQLite does, but perhaps give it a try if time permits.
That entirely depends on your definition of "effective". If you have all the time of the world, the fastest parser would be a hand-written pull parser. They take a long time to debug and develop but today, no parser generator beats hand-written code in terms of runtime performance.
If you want something that can parse valid C within a week or so, use a parser generator. The code will be fast enough and most parser generators come with a grammar for C already which you can use as a starting point (avoiding 90% of the common mistakes).
Note that regexps are not suitable for parsing recursive structures. This approach would both be slower than using a generator and more error prone than a hand-written pull parser.
actually, it depends how complex is your language and whether it's really close to C or not...
Still, you could use lex as a first step even for regular expression ....
I would go for lex + menhir and o'caml....
but any flex/yacc combination would be fine..
The main problem with regular bison (the gnu implementation of yacc) stems from the C typing.. you have to describe your whole tree (and all the manipulation functions)... Using o'caml would be really easier ...
For what you want to do, our DMS Software Reengineering Toolkit is likely a very effective solution.
DMS is designed specifically to support customer analyzers/code generators of the type you are discussing. It provides very strong facilities for defining arbitrary language parsers/analyzers (tested on 30+ real languages including several complete dialects of C, C++, Java, C#, and COBOL).
DMS automates the construction of ASTs (so you don't have to do anything but get the grammar right to have a usable AST), enables the construction of custom analyses of exactly the pattern-directed inspection you indicated, can construct new C-specific ASTs representing the code you want to generate, and spit them out as compilable C source text. The pre-existing definitions of C for DMS can likely be bent to cover your C-like language.

Resources