Cross-platform parser development - What are the options? - parsing

I'm currently working on a project that makes use of a custom language with a simple context-free grammar.
Due to the project's characteristics the same language will have to be used on several platforms, especially mobile ones. Currently, I'm using my small hand-written Java parser (for the Android platform). Soon, I'll have to write basically the same parser for JavaScript and later possibly also for C# (Windows Phone) and Objective C (iOS). There is an additional chance that I'll also have to write it for PHP.
My question is: What options are there to simplify the parser development process? Do I really have to write basically the same parser for each platform or is there a less work-intensive way?
From a development process point of view the best alternative would enable me to write a grammar definition which would then automatically be compiled into a parser.
However, basically the only cross-platform parser generator I've found so far it the GOLD Parser which supports two of my target platforms (Java and C#). It would really be awesome if you could point me to other alternatives.
In case you don't know about other cross-platform compiler-compilers: Do you have hints how to structure the code towards future language extensibility?

I commend https://en.wikipedia.org/wiki/Comparison_of_parser_generators to your attention: if we restrict the domain to Java and C/C++, it suggests APG, GOLD, SableCC, and SLK (amongst others) as being cross-language enough for your stated goals. (I'm also requiring that the action code be separated from the grammar rather than inline, since the latter would defeat the purpose.) If you want JavaScript as well, it looks like your choices are APG (GPL-licensed) and WaxEye (MIT-licensed).
If your language is reasonably simple then I would say to just go with whichever you think will be easiest to integrate into your build environment(s) and has a reasonable match with how you think. Unless parsing time is a huge fraction of your application's total workload, parsing speed should not be an issue -- although table size and memory usage might matter in a mobile context. If your grammar is "simple enough," (i.e. not Perl, for instance) I would expect any of those tools to work.

Have a look in Antlr, I am using it for transforming java code and it is really great. Moreover you can find different grammars here.

REx parser generator supports the required targets, except for Objective C and PHP (code generators for those might be possible). It has not yet been published as open source, though, and there is no decent documentation, just sample grammars. But there are projects that are using it successfully, e.g. xqlint. Here is a paper describing the experience from that project.

Related

What front-end can I use with RPython to implement a language?

I've looked high and low for examples of implementing a language using the RPython toolchain, but the only one I've been able to find so far is this one in which the author writes a simple BF interpreter. Because the grammar is so simple, he doesn't need to use a parser/lexer generator. Is there a front-end out there that supports developing a language in RPython?
Thanks!
I'm not aware of any general lexer or parser generator targeting RPython specifically. Some with Python output may work, but I wouldn't bet on it. However, there's a set of parsing tools in rlib.parsing. It seems quite usable. OTOH, there's a warning in the documentation: It's reportedly still in development, experimental, and only used for the Prolog interpreter so far.
Alternatively, you can write the frontend by hand. Lexers can be annoying and unnatural, granted (you may be able to rip out the utility modules for DFAs used by the Python implementation). But parsers are a piece of cake if you know the right algorithms. I'm a huge fan of "Top Down Operator Precedence parsers" a.k.a. "Pratt parsers", which are reasonably simple (recursive descent) but make all expression parsing issues (nesting, precedence, associativity, etc.) a breeze. There's depressingly little information on them, but the few blog posts were sufficient for me:
One by Crockford (wouldn't recommend it though, it throws a whole lot of unrelated stuff into the parser and thus obscures it),
another one at effbot.org (uses Python),
and a third by a sadly even-less-famous guy who's developing a language himself, Robert Nystrom.
Alex Gaynor has ported David Beazley's excellent PLY to RPython. Its documentation is quite good, and he even gave a talk about using it to implement an interpreter at PyCon US 2013.

Can Xtext be used for parsing general purpose programming languages?

I'm currently developing a general-purpose agent-based programming language (its syntaxt will be somewhat inspired by Java, and we are also using object in this language).
Since the beginning of the project we were doubtful about the fact of using ANTLR or Xtext. At that time we found out that Xtext was implementing a subset of the feature of ANTLR. So we decided to use ANLTR for our language losing the possibility to have a full-fledged Eclipse editor for free for our language (such a nice features provided by Xtext).
However, as the best of my knowledge, this summer the Xtext project has done a big step forward. Quoting from the link:
What are the limitations of Xtext?
Sven: You can implement almost any kind of programming language or DSL
with Xtext. There is one exception, that is if you need to use so
called 'Semantic Predicates' which is a rather complicated thing I
don't think is worth being explained here. Very few languages really
need this concept. However the prominent example is C/C++. We want to
look into that topic for the next release.
And that is also reinforced in the Xtext documentation:
What is Xtext? No matter if you want to create a small textual domain-specific language (DSL) or you want to implement a full-blown
general purpose programming language. With Xtext you can create your
very own languages in a snap. Also if you already have an existing
language but it lacks decent tool support, you can use Xtext to create
a sophisticated Eclipse-based development environment providing
editing experience known from modern Java IDEs in a surprisingly short
amount of time. We call Xtext a language development framework.
If Xtext has got rid of its past limitations why is it still not possible to find a complex Xtext grammar for the best known programming languages (Java, C#, etc.)?
On the ANTLR website you can find tons of such grammar examples, for what concerns Xtext instead the only sample I was able to find is the one reported in the documentation. So maybe Xtext is still not mature to be used for implementing a general purpose programming language? I'm a bit worried about this... I would not start to re-write the grammar in Xtext for then to recognize that it was not suited for that.
I think nobody implemented Java or C++ because it is a lot of work (even with Xtext) and the existing tools and compilers are excellent.
However, you could have a look at Xbase and Xtend, which is the expression language we ship with Xtext. It is built with Xtext and is quite a good proof for what you can build with Xtext. We have done that in about 4 person months.
I did a couple of screencasts on Xtend:
http://blog.efftinge.de/2011/03/xtend-screencast-part-1-basics.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-2-switch.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-3-rich-strings-ie.html
Note, that you can simply embed Xbase expressions into your language.
I can't speak for what Xtext is or does well.
I can speak to the problem of developing robust tools for processing real languages, based on our experience with the DMS Software Reengineering Toolkit, which we imagine is a language manipulation framework.
First, parsing of real languages usually involves something messy in lexing and/or parsing, due to the historical ways these languages have evolved. Java is pretty clean. C# has context-dependent keywords and a rudimentary preprocessor sort of like C's. C has a full blown preprocessor. C++ is famously "hard to parse" due to ambiguities in the grammar and shenanigans with template syntax. COBOL is fairly ugly, doesn't have any reference grammars, and comes in a variety of dialects. PHP will turn you to stone if you look at it because it is so poorly defined. (DMS has parsers for all of these, used in anger on real applications).
Yet you can parse all of these with most of the available parsing technologies if you try hard enough, usually by abusing the lexer or the parser to achieve your goals (how the GNU guys abused Bison to parse C++ by tangling lexical analysis with symbol table lookup is a nice ugly case in point). But it takes a lot of effort to get the language details right, and the reference manuals are only close approximations of the truth with respect to what the compilers really accept.
If Xtext has a decent parsing engine, one can likely do this with Xtext. A brief perusal of the Xtext site sounds like the lexers and parsers are fairly decent. I didn't see anything about the "Semantic Predicate"s; we have them in DMS and they are lifesavers in some of the really dark corners of parsing. Even using the really good parsing technology (we use GLR parsers), it would be very hard to parse COBOL data declarations (extracting their nesting structure during the parse) without them.
You have an interesting problem in that your language isn't well defined yet. That will make your initial parsers somewhat messy, and you'll revise them a lot. Here's where strong parsing technology helps you: if you can revise your grammar easily you can focus on what you want your language to look like, rather than focusing on fighting the lexer and parser. The fact that you can change your language definition means in fact that if Xtext has some limitations, you can probably bend your language syntax to match without huge amounts of pain. ANTLR does have the proven ability to parse a language pretty much as you imagine it, modulo the usual amount of parser hacking.
What is never discussed is what else is needed to process a language for real. The first thing you need to be able to do is to construct ASTs, which ANTLR and YACC will help you do; I presume Xtext does also. You also need symbol tables, control and data flow analysis (both local and global), and machinery to transform your language into something else (presumably more executable). Doing just symbol tables you will find surprisingly hard; C++ has several hundred pages of "how to look up an identifier"; Java generics are a lot tougher to get right than you might expect. You might also want to prettyprint the AST back to source code, if you want to offer refactorings. (EDIT: Here both ANTLR and Xtext offer what amounts to text-template driven code generation).
Yet these are complex mechanisms that take as much time, if not more than building the parser. The reason DMS exists isn't because it can parse (we view this just as the ante in a poker game), but because all of this other stuff is very hard and we wanted to amortize the cost of doing it all (DMS has, we think, excellent support for all of these mechanisms but YMMV).
On reading the Xtext overview, it sounds like they have some support for symbol tables but it is unclear what kind of assumption is behind it (e.g., for C++ you have to support multiple inheritance and namespaces).
If you are already started down the ANTLR road and have something running, I'd be tempted to stay the course; I doubt if Xtext will offer you a lot of additional help. If you really really want Xtext's editor, then you can probably switch at the price of restructuring what grammar you have (this is a pretty typical price to pay when changing parsing paradigms). Expect most of your work to appear after you get the parser right, in an ad hoc way. I doubt you will find Xtext or ANTLR much different here.
I guess the most simple answer to your question is: Many general purpose languages can be implemented using Xtext. But since there is no general answer to which parser-capabilities a general purpose languages needs, there is no general answer to your questions.
However, I've got a few pointers:
With Xtext 2.0 (released this summer), Xtext supports syntactic predicates. This is one of the most requested features to handle ambiguous syntax without enabling antlr's backtracking.
You might want to look at the brand-new languages Xbase and Xtend, which are (judging based on their capabilities) general-purpose and which are developed using Xtext. Sven has some nice screen casts in his blog: http://blog.efftinge.de/
Regarding your question why we don't see Xtext-grammars for Java, C++, etc.:
With Xtext, a language is more than just a grammar, so just having a grammar that describes a language's syntax is a good starting point but usually not an artifact valuable enough for shipping. The reason is that with an Xtext-grammar you also define the AST's structure (Abstract Syntax Tree, and an Ecore Model in fact) including true cross references. Since this model is the main internal API of your language people usually spend a lot of thought designing it. Furthermore, to resolve cross references (aka linking) you need to implement scoping (as it is called in Xtext). Without a proper implementation of scoping you can either not have true cross references in your model or you'll get many lining errors.
A guess my point is that creating a grammar + designing the AST model + implementing scoping is just a little more effort that taking a grammar from some language-zoo and translating it to Xtext's syntax.

Source of parsers for programming languages?

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)
If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.
You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

ANTLR vs. Happy vs. other parser generators

I want to write a translator between two languages, and after some reading on the Internet I've decided to go with ANTLR. I had to learn it from scratch, but besides some trouble with eliminating left recursion everything went fine until now.
However, today some guy told me to check out Happy, a Haskell based parser generator. I have no Haskell knowledge, so I could use some advice, if Happy is indeed better than ANTLR and if it's worth learning it.
Specifically what concerns me is that my translator needs to support macro substitution, which I have no idea yet how to do in ANTLR. Maybe in Happy this is easier to do?
Or if think other parser generators are even better, I'd be glad to hear about them.
People keep believing that if they just get a parser, they've got it made
when building language tools. Thats just wrong. Parsers get you to the foothills
of the Himalayas then you need start climbing seriously.
If you want industrial-strength support for building language translators, see our
DMS Software Reengineering Toolkit. DMS provides
Unicode-based lexers
full context-free parsers (left recursion? No problem! Arbitrary lookahead? No problem. Ambiguous grammars? No problem)
full front ends for C, C#, COBOL, Java, C++, JavaScript, ...
(including full preprocessors for C and C++)
automatic construction of ASTs
support for building symbol tables with arbitrary scoping rules
attribute grammar evaluation, to build analyzers that leverage the tree structure
support for control and data flow analysis (as well realization of this for full C, Java and COBOL),
source-to-source transformations using the syntax of the source AND the target language
AST to source code prettyprinting, to reproduce target language text
Regarding the OP's request to handle macros: our C, COBOL and C++ front ends handle their respective language preprocessing by a) the traditional method of full expansion or b) non-expansion (where practical) to enable post-parsing transformation of the macros themselves. While DMS as a foundation doesn't specifically implement macro processing, it can support the construction and transformation of same.
As an example of a translator built with DMS, see the discussion of
converting
JOVIAL to C for the B-2 bomber. This is 100% translation for > 1 MSLOC of hard
real time code. [It may amuse you to know that we were never allowed to see the actual program being translated (top secret).]. And yes, JOVIAL has a preprocessor, and yes we translated most JOVIAL macros into equivalent C versions.
[Haskell is a cool programming language but it doesn't do anything like this by itself.
This isn't about what's expressible in the language. Its about figuring out what machinery is required to support the task of manipulating programs, and
spending 100 man-years building it.]

Most effective way to parse C-like definition strings?

I've got a set of function definitions written in a C-like language with some additional keywords that can be put before some arguments(the same way as "unsigned" or "register", for example) and I need to analyze these lines as well as some function stubs and generate actual C code from them.
Is that correct that Flex/Yacc are the most proper way to do it?
Will it be slower than writing a Shell or Python script using regexps(which may become big pain, as I suppose, if the number of additional keywords becomes bigger and their effects would be rather different) provided that I have zero experience with analysers/parsers(though I know how LALR does its job)?
Are there any good materials on Lex/Yacc that cover similar problems? All papers I could find use the same primitive example of a "toy" calculator.
Any help will be appreciated.
ANTLR is commonly used (as are Lex\Yacc).
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There is also the Lemon Parser, which features a less restrictive grammar. The down side is you're married to lemon, re-writing a parser's grammar to something else when you discover some limitation sucks. The up side is its really easy to use .. and self contained. You can drop it in tree and not worry about checking for the presence of others.
SQLite3 uses it, as do several other popular projects. I'm not saying use it because SQLite does, but perhaps give it a try if time permits.
That entirely depends on your definition of "effective". If you have all the time of the world, the fastest parser would be a hand-written pull parser. They take a long time to debug and develop but today, no parser generator beats hand-written code in terms of runtime performance.
If you want something that can parse valid C within a week or so, use a parser generator. The code will be fast enough and most parser generators come with a grammar for C already which you can use as a starting point (avoiding 90% of the common mistakes).
Note that regexps are not suitable for parsing recursive structures. This approach would both be slower than using a generator and more error prone than a hand-written pull parser.
actually, it depends how complex is your language and whether it's really close to C or not...
Still, you could use lex as a first step even for regular expression ....
I would go for lex + menhir and o'caml....
but any flex/yacc combination would be fine..
The main problem with regular bison (the gnu implementation of yacc) stems from the C typing.. you have to describe your whole tree (and all the manipulation functions)... Using o'caml would be really easier ...
For what you want to do, our DMS Software Reengineering Toolkit is likely a very effective solution.
DMS is designed specifically to support customer analyzers/code generators of the type you are discussing. It provides very strong facilities for defining arbitrary language parsers/analyzers (tested on 30+ real languages including several complete dialects of C, C++, Java, C#, and COBOL).
DMS automates the construction of ASTs (so you don't have to do anything but get the grammar right to have a usable AST), enables the construction of custom analyses of exactly the pattern-directed inspection you indicated, can construct new C-specific ASTs representing the code you want to generate, and spit them out as compilable C source text. The pre-existing definitions of C for DMS can likely be bent to cover your C-like language.

Resources