What is the difference between GNU bison and yacc? - parsing

I am writing a parser and I want it to be as portable as possible.
Right now I am using GNU bison to generate my parser but I am not sure if my code is relying on yacc extensions that are not fully portable.
So I would like to know features GNU bison has that the original yacc is missing.
The reason I get worried is because I noticed that my parser failed to compile on Windows using a bison port. I would sacrifice GNU bison features and stick with the original standardized yacc if it would make my parser easier to port between different platforms.
So what are the differences between GNU bison and the original standard yacc? What features should I avoid when using GNU bison if I want my program to be as portable as possible?

Usually the way you distribute a bison-generated parser is to distribute the generated parser. That means that neither bison nor yacc even needs to be installed on the target machine, and allows you to freely chose the version of bison you are comfortable with and use its features. (The bison input file will also be in the distribution, of course; including the bison output files simply means that bison does not need to be run to compile the code.)
If you want to verify that your parser description is compatible with yacc, you could try using the --yacc flag when you generate the parser. That will make bison attempt to mimic yacc, although it doesn't prevent you from asking for features which are clearly well out of yacc's scope, like %glr-parser or Java/C++ output. But frankly, I think you would be better off with the strategy outlined in the first paragraph.
If you want to get a list of bison features which are not in yacc, you could start by searching for the word yacc in the bison manual. Or you could read the Posix yacc documentation.

Related

Is there a LL(*) parser that is not generated

I need to change the grammar rules of my parser at runtime and I would like to avoid regenerating the parser each time the rules changes.
Is there a parser that do not use code generation?
Regards,
You can use (meaning probably implement yourself, not likely to be a library lying around) an Earley parser.
You will of course pay an overhead price for this. If your grammar and the source it parses are small, this is likely to be fine.
Otherwise you might reconsider; why is it that you don't want to rebuild the parser? Most parser generators run far faster than people can edit rules.
You could use a PEG (either hand-written or something like boost:spirit)
PEGs are not a strict superset of LL grammars, but are generally more expressive, as they have various extra features such as limited negation and following context testing.

Why some compilers prefer hand-crafted parser over parser generators?

According to Vala documentation: "Before 0.3.1, Vala's parser was the classic flex scanner and Bison LALR parser combination. But as of commit eba85a, the parser is a hand-crafted recursive descent parser."
My question is: Why?
The question could be addressed to any compiler which isn't using parser generator. What are pros and cons for such a move from parser generator to hand-crafted parser? What are disadvantages of using parser generators (Bison, ANTLR) for compilers?
As a side comment: I'm interested in Vala specifically because I like the idea of having language with modern features and clean syntax but compilable into "native" and "unmanaged" high-level language (C in case of Vala). I have found only Vala so far. I'm thinking of having fun by making Vala (or similar language) compilable to C++ (backed by Qt libs). But since I don't want to invent completely new language I'm thinking of taking some existing grammar. Obviously hand-crafted parsers don't have written formal grammar I might reuse. Your comments on this idea are welcome (is the whole idea silly?).
I have written half a dozen hand crafted parsers (in most cases recursive descent parser AKA top-down parser) in my career and have seen parsers generated by parser generators and I must admit I am biased against parser generators.
Here are some pros and cons for each approach.
Parser generators
Pros:
Quickly get a working parser (at least if you do not know how to hand code it).
Cons:
Generated code is hard to understand and debug.
Difficult to implement proper error handling. The generator will create a correct parser for syntactically correct code but will choke on incorrect code and in most cases will not be able to provide proper error messages.
A bug in parser generator may halt your project. You need to fix the bug in somebody else's code (if source code is available), wait for the author to fix it or workaround the bug (if possible at all).
Hand crafted recursive descent parser
Pros:
Generated code is easy to understand. Recursive parsers usually have one function corresponding to each language construct, e.g. parseWhile to parse a 'while' statement, parseDeclaration to parse a declaration and so on. Understanding and debugging the parser is easy.
It is easy to provide meaningful error messages, to recover from errors and continue parsing in the way that makes most sense in a particular situation.
Cons:
It will take some time to hand code the parser especially if you do not have an experience with this stuff.
The parser may be somewhat slow. This applies to all recursive parsers not just hand written ones. Having one function corresponding to each language construct to parse a simple numeric literal the parser may make a dozen or more nested calls starting from e.g. parseExpression through parseAddition, parseMultiplication, etc. parseLiteral. Function calls are relatively inexpensive in a language like C but still cam sum up to a significant time.
One solution to speedup a recursive parser is to replace parts of your recursive parser by a bottom-up sub-parser which often is much faster. The natural candidates for such sub-parser are the expressions which have almost uniform syntax (i.e. binary and unary expressions) with several precedence levels. The bottom-up parser for an expression is usually also simple to hand code, it is often just one loop getting input tokens from the lexer, a stack of values and a lookup table of operator precedence’s for operator tokens.
LR(1) and LALR(1) parsers are really, really annoying for two reasons:
The parser generator isn't very good at producing useful error messages.
Certain kinds of ambiguity, like C-style if-else blocks, make writing the grammar very painful.
On the other hand, LL(1) grammar are much better at both of these things. The structure of LL(1) grammars makes them very easy to encode as recursive descent parsers, and so dealing with a parser generator is not really a win.
Also, in the case of Vala, the parser and the compiler itself as presented as a library, so you can build a custom backend for the Vala compiler using the Vala compiler library and get all the parsing and type checking and such for free.
I know this isn't going to be definitive, and if your questions weren't specifically Vala-related I wouldn't bother, but since they are...
I wasn't too heavily involved with the project back then so I'm not that clear on some of the details, but the big reason I remember from when Vala switched was dogfooding. I'm not certain it was the primary motivation, but I do remember that it was a factor.
Maintainability was also an issue. That patch replaced a larger parser written in C/Bison/YACC (relatively few people have significant experience with the latter two) with a smaller parser in Vala (which pretty much anyone interested in working on valac probably knows and is comfortable with).
Better error reporting was also a goal, IIRC.
I don't know if it was a factor at all, but the hand-written parser is a recursive descent parser. I know ANTLR generates those, the ANTLR is written in Java, which is a pretty heavy dependency (yes, I know it's not a run-time dependency, but still).
As a side comment: I'm interested in Vala specifically because I like the idea of having language with modern features and clean syntax but compilable into "native" and "unmanaged" high-level language (C in case of Vala). I have found only Vala so far. I'm thinking of having fun by making Vala (or similar language) compilable to C++ (backed by Qt libs). But since I don't want to invent completely new language I'm thinking of taking some existing grammar. Obviously hand-crafted parsers don't have written formal grammar I might reuse. Your comments on this idea are welcome (is the whole idea silly?).
A lot of Vala is really a reflection of decisions made by GObject, and things may or may not work the same way in C++/Qt. If your goal is to replace GObject/C in valac with Qt/C++, you're probably in for more work than you expect. If, however, you just want to make C++ and Qt libraries accessible from Vala, that should certainly be possible. In fact, Luca Bruno started working on that about a year ago (see the wip/cpp branch). It hasn't seen activity for a while due to lack of time, not technical issues.
According to Vala documentation: "Before 0.3.1, Vala's parser was the
classic flex scanner and Bison LALR parser combination. But as of
Commit eba85a, the parser is a hand-crafted recursive descent parser."
My question is: Why?
Here I'm specifically asking about Vala, although the question could
be addressed to any compiler which isn't using parser generator. What
are pros and cons for such a move from parser generator to
hand-crafted parser? What are disadvantages of using parser generators
(Bison, ANTLR) for compilers?
Perhaps the programmers spotted some avenues for optimisation that the parser generator didn't spot, and these avenues for optimisation required an entirely different parsing algorithm. Alternatively, perhaps the parser generator generated code in C89, and the programmers decided refactoring for C99 or C11 would improve legibility.
As a side comment: I'm interested in Vala specifically because I like
the idea of having language with modern features and clean syntax but
compilable into "native" and "unmanaged" high-level language (C in
case of Vala).
Just a quick note: C isn't native. It has it's roots in portability, as it was designed to abstract away those hardware/OS-specific details that caused programmers so much grief when porting. For example, think of the pain of using an entirely different fopen for each OS and/or filesystem; I don't just mean different in functionality, but also different in input and output expectations eg. different arguments, different return values. Similarly, C11 introduces portable threads; Code that uses threads will be able to use the same C11-compliant code to target all OSes that implement threads.
I have found Vala so far. I'm thinking of having fun by making Vala
(or similar language) compilable to C++ (backed by Qt libs). But since
I don't want to invent completely new language I'm thinking of taking
some existing grammar. Obviously hand-crafted parsers don't have
written formal grammar I might reuse. Your comments on this idea are
welcome (is the whole idea silly?).
It may be feasible to use the hand-crafted parser to produce C++ code with little effort, so I wouldn't toss this option aside so quickly; The old flex/bison parser generator may or may not be more useful. However, this isn't your only option. In any case, I'd be tempted to study the specification extensively.
It is curious that these authors went from bison to RD. Most people would go in the opposite direction.
The only real reason I can see to do what the Vala authors did would be better error recovery, or maybe their grammar isn't very clean.
I think you will find that most new languages start out with hand-written parsers as the authors get a feel for their own new language and figure out exactly what it is they want to do. Also in some cases as the authors learn how to write compilers. C is a classic example, as is C++. Later on in the evolution a generated parser may be substituted. Compilers for existing standard languages on the other hand can more rapidly be developed via a parser generator, possibly even via an existing grammar: time to market is a critical business parameter of these projects.

ANTLR grammar for Scala?

I am trying to build a static analysis tool for a demo project. We are free to choose the language to analyze. I started off by writing a Java code analyzer using ANTLR. I now want to do the same for Scala code. However, I could not find the ANTLR grammar for Scala. Does it exist?
Is there any other machine readable form of Scala grammar?
I don't believe there is such a thing.
The thing is that for any language, but especially for a library language like Scala, lexical analysis and syntactic analysis is the least interesting and most trivial part of static analysis. In order to do anything even remotely interesting, you need to perform a significant amount of semantic analysis: desugaring, type inference, type checking, kind checking, macro expansion, overload resolution, implicit resolution, name binding. In short: you need to re-implement more or less the entire Scala compiler, modulo the actual code generation part. Remember that both Scala's macro system and Scala's type system are Turing-complete (in fact, Scala's macro system is Scala!): there could be significant compile-time and type-level computation going on that is impossible to analyze without actually performing macro expansion, type inference and type checking.
That is a massive task, and there are in fact only two projects that have successfully done it: one is the Scala compiler itself, the other is the IntelliJ IDEA Scala plugin.
And let's not even talk about compiler plugins, which are able to change the syntax and semantics of Scala in almost arbitrary ways.
But behold, there is hope: The Scala compiler itself provides an API called the Presentation Compiler, which is specifically designed for use by IDEs, code highlighters, and all kinds of static analysis tools. It gives you access to the entire information the compiler has during compilation, just before the optimization and code generation phases. It is used by ScalaDoc, the Scala REPL, the Scala Eclipse Plugin, the NetBeans Scala Plugin, SimplyScala.Com, the ENSIME Plugin for Emacs, some static analysis tools, and many others.
You can find a Scala grammar for ANTLR at https://github.com/lrlucena/grammars-v4/tree/master/scala . It is based on the Scala Language Specification http://www.scala-lang.org/files/archive/spec/2.11/13-syntax-summary.html .
Is Appendix A of the Scala Language Reference useful for you? It is in EBNF format.
Scalastyle uses scalariform to do the parsing for it. With this, you get an AST of case classes. However, you only get the information which is in the file, so for instance, you don't get inferred types.
If you don't need all of the extra information, then look at Scalariform. The Scalastyle code is fairly easy to understand, start with Checker.scala.

Language parsers

I need to parse C#, Ruby and Python source code to generate some reports. I need to get a list of method names inside a class, and I need some other info such as usage of global variable or something. Just parsing using RE could be a solution, but I expect a better (systematic) solution using parsers, if it is easily possible.
What parsers for those languages are provided?
For C#, I found http://csparser.codeplex.com/Wikipage , but for the others, I found a bunch of parsers using those languages, but not the language parsers of them.
It may be worth looking into the ANTLR parser generator.
You'll find, on the ANTLR site, grammars for all 3 languages you are interested in (Although the Ruby grammar is only for a "simplified" version of the language).
The next difficulty may be to adapt these grammars for the particular target language you would like, i.e. the language in which the parsers themselves will be generated. ANTLR's grammar language is very expressive, allowing one to deal with various context-sensitive languages. This is done by inserting various snippets (in the target language) and/or semantic or syntactic predicates (also in the target language) amid the EBNF-like grammar; consequently the grammar is a bit messier and may need adapting when the target language is changed. The "native" target language of ANTLR is Java, but many other targets languages are supported.
On the whole, ANTLR represents a bit a setup/learning-curve effort, but since you need to deal with 3 languages, it may well be worth the investment, as this will allow you to have a uniform framework (over which you have "full" control), rather than trying to corral three possibly very distinct, and possibly more "locked down" parsers as you started doing.
All three languages are relatively sophisticated languages and although your goal is "merely" to identify methods within programs, you may be able to hack/simplify some of the grammars (or maybe simply "ignore" parts of them), only mapping the few parser-level rules of interest to your eventual goal.
Once these rules are identified, you can apply the same or similar actions, i.e. snippets (in the target language) which implement what you wish to accomplish when the parser encounters such rules (eg: store the method's signature for future reporting, start counting the number of lines... whatever).
A final suggestion:
As hinted in comments to the question, and depending on your goals, you may be able to reuse existing utility programs to perform directly, or indirectly these goals.
Also, because indeed messing with parsers for these sophisticated languages may be somewhat overkill for you possibly simple and possibly error-tolerant goals, the Regular Expressions approach may fit the bill, somehow; the fact of the matter is that none of these languages are regular nor context free, so success with regex will be highly dependent on the eventual goals and on the input data (programs).
Yet another suggestion!
See Larry Lustig's answer! Introspection may simplify much of you task as well. The implication is that you'd need to a) write your logic within each of the the underlying language b) integrate/load the programs to be inspected. All depends, but again, a possible way out from the -let's be fair- relatively heavy investment with formal grammar tools.
For Python, the situation is trivial: there is a Python parser in the standard library as well as a more high-level module for manipulating ASTs.
Also, Python has a somewhat simple grammar (at least if you use the trick to keep an indentation stack in your lexer and inject fake BEGIN and END tokens in your token stream, so that you can treat Python as a simple keyword delimited Algol-like language in your parser), so it is often used as an example grammar for parser generators, which means that you can find literally dozens of Python parsers for pretty much every single parser generator, programming language and platform out there. (E.g., here is a Haskell module implementing a Python lexer and parser.)
For Ruby, there are quite a number of parsers available.
Ruby is incredibly hard to parse, so if you need full fidelity, you pretty much have to use the original YACC grammar file from the YARV Ruby implementation. (parse.y in the top-level source directory.) JRuby's parser is derived from that file, and it is the only one of the implementation parsers that has been explicitly designed to also be used by other clients and not just the interpreter itself. (For example, the Eclipse RDT plugin, the Eclipse DLTK/Ruby plugin, the NetBeans Ruby plugin and the jEdit Ruby syntax highlighting all use JRuby's parser.) To facilitate that, JRuby's parser has actually been repackaged as a separate project.
Of course, there are YACC clones for pretty much every language on the planet. However, be aware that YARV does not use a lex generated scanner. It uses a hand-written scanner in C, and also the YACC grammar contains quite a bit of semantic actions in C. Those parts will have to be re-implemented (like they were in JRuby).
The XRuby compiler is the only full Ruby implementation that does not use YARV's parse.y, it uses an ANTLRv3 grammar and an ANTLRv3 tree grammar that have been developed from scratch. ANTLR can generate parsers for a whole bunch of languages, including for example Java and C#. Its Ruby backend, however, is in dire need of some work.
RedParse is a Ruby parser written in Ruby, which claims to be able to parse all Ruby syntax correctly. It is used, for example, in the YARD Ruby documentation tool to, among other things, extract method names.
ruby_parser is another Ruby parser in Ruby. It is generated from parse.y via the racc parser generator that is part of Ruby's standard library.
YARV actually contains a parser library called ripper, which allows you to parse Ruby code. Unfortunately, it is completely undocumented, so you basically have to figure it out by reading blog posts. Except of course, being undocumented, almost nobody else has figured it out yet, either and written a blog post.
However, for your purposes, you don't actually need a full-blown Ruby parser. You only need enough to extract method names and some other stuff.
RDoc, the Ruby documentation generator, contains a Ruby parser which can parse just enough Ruby to, well, extract method names and some other stuff.
Cardinal is a Ruby implementation for the Parrot Virtual Machine. It does not yet run all of Ruby, but its parser should be powerful enough to support all you need. (The parser is written in the Parrot Grammar Engine, so you will obviously have to run it in Parrot, by for example writing your reporting tool in Perl6.)
tinyrb is another Ruby implementation that does not run full Ruby but contains a better written parser than YARV. In this case, the parser uses Ian Piumarta's leg Parsing Expression Grammar parser generator.
For Ruby and Python, can't you simply introspect the class to learn the name of the methods? You'd have to write the same functionality in each language but (at least in Python) there's hardly anything to it.
The DMS Software Reengineering Toolkit has full, robust C# and Python parsers that automatically build complete ASTs. DMS offers facilities for walking the trees and collecting whatever data you might wish to collect.
Another poster's answer here suggests Ruby is really hard to parse. C++ is also famously hard to parse. DMS has been used to parse some 30 other languages, including full C++ in a number of dialects, so Ruby seems eminently doable. Howeever, DMS doesn't have an off-the-shelf parser for Ruby.

ANTLR vs. Happy vs. other parser generators

I want to write a translator between two languages, and after some reading on the Internet I've decided to go with ANTLR. I had to learn it from scratch, but besides some trouble with eliminating left recursion everything went fine until now.
However, today some guy told me to check out Happy, a Haskell based parser generator. I have no Haskell knowledge, so I could use some advice, if Happy is indeed better than ANTLR and if it's worth learning it.
Specifically what concerns me is that my translator needs to support macro substitution, which I have no idea yet how to do in ANTLR. Maybe in Happy this is easier to do?
Or if think other parser generators are even better, I'd be glad to hear about them.
People keep believing that if they just get a parser, they've got it made
when building language tools. Thats just wrong. Parsers get you to the foothills
of the Himalayas then you need start climbing seriously.
If you want industrial-strength support for building language translators, see our
DMS Software Reengineering Toolkit. DMS provides
Unicode-based lexers
full context-free parsers (left recursion? No problem! Arbitrary lookahead? No problem. Ambiguous grammars? No problem)
full front ends for C, C#, COBOL, Java, C++, JavaScript, ...
(including full preprocessors for C and C++)
automatic construction of ASTs
support for building symbol tables with arbitrary scoping rules
attribute grammar evaluation, to build analyzers that leverage the tree structure
support for control and data flow analysis (as well realization of this for full C, Java and COBOL),
source-to-source transformations using the syntax of the source AND the target language
AST to source code prettyprinting, to reproduce target language text
Regarding the OP's request to handle macros: our C, COBOL and C++ front ends handle their respective language preprocessing by a) the traditional method of full expansion or b) non-expansion (where practical) to enable post-parsing transformation of the macros themselves. While DMS as a foundation doesn't specifically implement macro processing, it can support the construction and transformation of same.
As an example of a translator built with DMS, see the discussion of
converting
JOVIAL to C for the B-2 bomber. This is 100% translation for > 1 MSLOC of hard
real time code. [It may amuse you to know that we were never allowed to see the actual program being translated (top secret).]. And yes, JOVIAL has a preprocessor, and yes we translated most JOVIAL macros into equivalent C versions.
[Haskell is a cool programming language but it doesn't do anything like this by itself.
This isn't about what's expressible in the language. Its about figuring out what machinery is required to support the task of manipulating programs, and
spending 100 man-years building it.]

Resources