Source of parsers for programming languages? - parsing

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)

If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.

You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

Related

What are common properties in an Abstract Syntax Tree (AST)?

I'm new to compiler design and have been watching a series of youtube videos by Ravindrababu Ravula.
I am creating my own language for fun and I'm parsing it to an Abstract Syntax Tree (AST). My understanding is that these trees can be portable given they follow the same structure as other languages.
How can I create an AST that will be portable?
Side notes:
My parser is currently written in javascript but I might move it to C#.
I've been looking at SpiderMonkey's specs for guidance. Is that a good approach?
Portability (however defined) is not likely to be your primary goal in building an AST. Few (if any) compiler frameworks provide a clear interface which allows the use of an external AST, and particular AST structures tend to be badly-documented and subject to change without notice. (Even if they are well-documented, the complexity of a typical AST implementation is challenging.)
An AST is very tied to the syntactic details of a language, as well as to the particular parsing strategy being used. While it is useful to be able to repurpose ASTs for multiple tasks -- compiling, linting, pretty-printing, interactive editing, static analysis, etc. -- the conflicting demands of these different use cases tends to increase complexity. Particularly at the beginning stages of language development, you'll want to give yourself a lot of scope for rapid prototyping.
The most tempting reason for portable ASTs would be to use some other language as a target, thereby saving the cost of writing code-generation, etc. However, in practice it is usually easier to generate the textual representation of the other language from your own AST than to force your parser to use a foreign AST. Even better is to target a well-documented virtual machine (LLVM, .Net IL, JVM, etc.), which is often not much more work than generating, say, C code.
You might want to take a look at the LLVM Kaleidoscope tutorial (the second section covers ASTs, although implemented in C++). Also, you might find this question on a sister site interesting reading. And finally, if you are going to do your implementation in Javascript, you should at least take a look at the jison parser generator, which takes a lot of the grunt-work out of maintaining a parser and scanner (and thus allows for easier experimentation.)

abstract syntax tree for imperative languages

I am looking for an abstract syntax tree representation that can be used for common imperative languages (Java, C, python, ruby, etc). I would like this to be as close to source as possible (as opposed to something like LLVM). I found Rose online but it is only able to handle C and Fortran. Does this exist?
You won't find "one" universal AST that can represent many languages. People have been searching for 50 years.
The essential reason is that an AST node implicitly represents the precise language semantics of the operator it encodes, and different languages have different semantics for what are apparently the same operators.
For example, the "+" operator in modern Fortran will add integers, reals, complex values, and slices of arrays of such things. Java "+" will add integers, reals, and glue strings together. If I wrote "a+b" in "universal AST", how would you know which semantic effect the corresponding AST encoded?
What you can do is build a system in which the ASTs for different languages are represented uniformly, so that you can share tool infrastructure across many languages. This is done by many Program Transformation Systems (PTS), where you provide the grammar (or pick one from an available library), and the PTS parses and builds an AST using its uniform representation. Most PTS provide additional support to analyze and transform the code.
So, all you need is a PTS and some sweat to define a grammar. That's really not true; getting a grammar right for a real language is actually pretty hard. Worse, there's a lot to Life After Parsing because you need the meaning of symbols and additional inferences such as control and data flow analysis. So you need full front ends (e.g., parsing, name/type resolution, flow analysis, ...), or as much as you can get, if you don't want to be distracted for months before beginning your real work.
What this means in practice is you want to find a tool that handles the languages of interest to you, with mature front ends already available:
Rose (you already found this) handle C, C++ and Fortran. It has no built-in parsing capability of its own; its front ends are custom built. So it is apparantly hard to extend to other languages. But it has good flow analysis capabilities and provides means to transform the code via hand-write AST walks/smashes.
Clang handles C and C++. Clang also uses hand-built front ends. It can also transform code, again by hand-written AST walks/smashes, with a small amount of pattern matching support. As I understand it, you have to use the LLVM part of Clang to do flow analysis.
Our DMS Software Reengineering Toolkit has full front ends for C, C++, Java and COBOL, and full parsers for many more languages such as Python. DMS provides pattern-based analysis and source-to-source transformation. It operates directly from a grammar (see one for Oberon, Nicklaus Wirth's latest language). (I don't know of any tool that handles Ruby, which is famously hard to parse; I understand its grammar is ambiguous, and DMS is good at handling ambiguous grammars).

Can Xtext be used for parsing general purpose programming languages?

I'm currently developing a general-purpose agent-based programming language (its syntaxt will be somewhat inspired by Java, and we are also using object in this language).
Since the beginning of the project we were doubtful about the fact of using ANTLR or Xtext. At that time we found out that Xtext was implementing a subset of the feature of ANTLR. So we decided to use ANLTR for our language losing the possibility to have a full-fledged Eclipse editor for free for our language (such a nice features provided by Xtext).
However, as the best of my knowledge, this summer the Xtext project has done a big step forward. Quoting from the link:
What are the limitations of Xtext?
Sven: You can implement almost any kind of programming language or DSL
with Xtext. There is one exception, that is if you need to use so
called 'Semantic Predicates' which is a rather complicated thing I
don't think is worth being explained here. Very few languages really
need this concept. However the prominent example is C/C++. We want to
look into that topic for the next release.
And that is also reinforced in the Xtext documentation:
What is Xtext? No matter if you want to create a small textual domain-specific language (DSL) or you want to implement a full-blown
general purpose programming language. With Xtext you can create your
very own languages in a snap. Also if you already have an existing
language but it lacks decent tool support, you can use Xtext to create
a sophisticated Eclipse-based development environment providing
editing experience known from modern Java IDEs in a surprisingly short
amount of time. We call Xtext a language development framework.
If Xtext has got rid of its past limitations why is it still not possible to find a complex Xtext grammar for the best known programming languages (Java, C#, etc.)?
On the ANTLR website you can find tons of such grammar examples, for what concerns Xtext instead the only sample I was able to find is the one reported in the documentation. So maybe Xtext is still not mature to be used for implementing a general purpose programming language? I'm a bit worried about this... I would not start to re-write the grammar in Xtext for then to recognize that it was not suited for that.
I think nobody implemented Java or C++ because it is a lot of work (even with Xtext) and the existing tools and compilers are excellent.
However, you could have a look at Xbase and Xtend, which is the expression language we ship with Xtext. It is built with Xtext and is quite a good proof for what you can build with Xtext. We have done that in about 4 person months.
I did a couple of screencasts on Xtend:
http://blog.efftinge.de/2011/03/xtend-screencast-part-1-basics.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-2-switch.html
http://blog.efftinge.de/2011/03/xtend-screencast-part-3-rich-strings-ie.html
Note, that you can simply embed Xbase expressions into your language.
I can't speak for what Xtext is or does well.
I can speak to the problem of developing robust tools for processing real languages, based on our experience with the DMS Software Reengineering Toolkit, which we imagine is a language manipulation framework.
First, parsing of real languages usually involves something messy in lexing and/or parsing, due to the historical ways these languages have evolved. Java is pretty clean. C# has context-dependent keywords and a rudimentary preprocessor sort of like C's. C has a full blown preprocessor. C++ is famously "hard to parse" due to ambiguities in the grammar and shenanigans with template syntax. COBOL is fairly ugly, doesn't have any reference grammars, and comes in a variety of dialects. PHP will turn you to stone if you look at it because it is so poorly defined. (DMS has parsers for all of these, used in anger on real applications).
Yet you can parse all of these with most of the available parsing technologies if you try hard enough, usually by abusing the lexer or the parser to achieve your goals (how the GNU guys abused Bison to parse C++ by tangling lexical analysis with symbol table lookup is a nice ugly case in point). But it takes a lot of effort to get the language details right, and the reference manuals are only close approximations of the truth with respect to what the compilers really accept.
If Xtext has a decent parsing engine, one can likely do this with Xtext. A brief perusal of the Xtext site sounds like the lexers and parsers are fairly decent. I didn't see anything about the "Semantic Predicate"s; we have them in DMS and they are lifesavers in some of the really dark corners of parsing. Even using the really good parsing technology (we use GLR parsers), it would be very hard to parse COBOL data declarations (extracting their nesting structure during the parse) without them.
You have an interesting problem in that your language isn't well defined yet. That will make your initial parsers somewhat messy, and you'll revise them a lot. Here's where strong parsing technology helps you: if you can revise your grammar easily you can focus on what you want your language to look like, rather than focusing on fighting the lexer and parser. The fact that you can change your language definition means in fact that if Xtext has some limitations, you can probably bend your language syntax to match without huge amounts of pain. ANTLR does have the proven ability to parse a language pretty much as you imagine it, modulo the usual amount of parser hacking.
What is never discussed is what else is needed to process a language for real. The first thing you need to be able to do is to construct ASTs, which ANTLR and YACC will help you do; I presume Xtext does also. You also need symbol tables, control and data flow analysis (both local and global), and machinery to transform your language into something else (presumably more executable). Doing just symbol tables you will find surprisingly hard; C++ has several hundred pages of "how to look up an identifier"; Java generics are a lot tougher to get right than you might expect. You might also want to prettyprint the AST back to source code, if you want to offer refactorings. (EDIT: Here both ANTLR and Xtext offer what amounts to text-template driven code generation).
Yet these are complex mechanisms that take as much time, if not more than building the parser. The reason DMS exists isn't because it can parse (we view this just as the ante in a poker game), but because all of this other stuff is very hard and we wanted to amortize the cost of doing it all (DMS has, we think, excellent support for all of these mechanisms but YMMV).
On reading the Xtext overview, it sounds like they have some support for symbol tables but it is unclear what kind of assumption is behind it (e.g., for C++ you have to support multiple inheritance and namespaces).
If you are already started down the ANTLR road and have something running, I'd be tempted to stay the course; I doubt if Xtext will offer you a lot of additional help. If you really really want Xtext's editor, then you can probably switch at the price of restructuring what grammar you have (this is a pretty typical price to pay when changing parsing paradigms). Expect most of your work to appear after you get the parser right, in an ad hoc way. I doubt you will find Xtext or ANTLR much different here.
I guess the most simple answer to your question is: Many general purpose languages can be implemented using Xtext. But since there is no general answer to which parser-capabilities a general purpose languages needs, there is no general answer to your questions.
However, I've got a few pointers:
With Xtext 2.0 (released this summer), Xtext supports syntactic predicates. This is one of the most requested features to handle ambiguous syntax without enabling antlr's backtracking.
You might want to look at the brand-new languages Xbase and Xtend, which are (judging based on their capabilities) general-purpose and which are developed using Xtext. Sven has some nice screen casts in his blog: http://blog.efftinge.de/
Regarding your question why we don't see Xtext-grammars for Java, C++, etc.:
With Xtext, a language is more than just a grammar, so just having a grammar that describes a language's syntax is a good starting point but usually not an artifact valuable enough for shipping. The reason is that with an Xtext-grammar you also define the AST's structure (Abstract Syntax Tree, and an Ecore Model in fact) including true cross references. Since this model is the main internal API of your language people usually spend a lot of thought designing it. Furthermore, to resolve cross references (aka linking) you need to implement scoping (as it is called in Xtext). Without a proper implementation of scoping you can either not have true cross references in your model or you'll get many lining errors.
A guess my point is that creating a grammar + designing the AST model + implementing scoping is just a little more effort that taking a grammar from some language-zoo and translating it to Xtext's syntax.

Text parsing library

A colleague of mine works on an universal text parsing library, based on C# lambdas. The core looks cool, but unfortunately to me he has hardcoded a grammar, specifical to his private task -- math expression evaluating. So, I will not use it as I had intended before I saw the API. And now I'm looking for another lib, that meets at least some of my requirements. It has to:
Be able to load a grammar from an external file -- say, XML, YML or JSON.
Return AST from grammar and parsed tree that is built from any text.
Work fast enough to load C# grammar then parse a large code file.
I'd prefer the library that has grammar format file simple enough for easy writing a grammar for math expressions, is open source and written in C# or C++.
Regards,
--
UPDATED: point 2 has been corrected.
You might check out Text Transformer which claims to be some kind of universal text processing language. I have no specific experience with it.
Building robust langauge front ends and usable processing tools is actually a lot of work.
If you want to process computer languages in a generic way, you might consider our DMS Software Reengineering Toolkit, a kind of generalized compiler technology for parsing, analyzing, transforming, and/or generating code (or any other kind of formal document).
DMS will accept arbitrary context free grammars for langauges, automatically builds an AST with no additional specification effort on your part, and is designed to handle not only large files but very large sets of files in a single computation. Normally people
that want to process code need pattern recognition, code analysis and code transformation capabilities; DMS has all of these built in. It also has a variety of predefined, mature grammars for a wide variety of computer langauges, well-known (C, C++, C#, COBOL, Java, JavaScript, ... ) and otherwise (Natural, EGL, Python, MATLAB, ...), and has been used to carry out massive automated analyses and transformations on programs in these various langauges.
DMS does not meet your open-source or C#/C++ implementation requirements. It is implemented as a set of domain-specific langauges for describing grammars, analyzers, transformations, prettyprinters, and scripting that allows parallel execution to enable complex analyses to run faster than single-threaded programs.

Most effective way to parse C-like definition strings?

I've got a set of function definitions written in a C-like language with some additional keywords that can be put before some arguments(the same way as "unsigned" or "register", for example) and I need to analyze these lines as well as some function stubs and generate actual C code from them.
Is that correct that Flex/Yacc are the most proper way to do it?
Will it be slower than writing a Shell or Python script using regexps(which may become big pain, as I suppose, if the number of additional keywords becomes bigger and their effects would be rather different) provided that I have zero experience with analysers/parsers(though I know how LALR does its job)?
Are there any good materials on Lex/Yacc that cover similar problems? All papers I could find use the same primitive example of a "toy" calculator.
Any help will be appreciated.
ANTLR is commonly used (as are Lex\Yacc).
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There is also the Lemon Parser, which features a less restrictive grammar. The down side is you're married to lemon, re-writing a parser's grammar to something else when you discover some limitation sucks. The up side is its really easy to use .. and self contained. You can drop it in tree and not worry about checking for the presence of others.
SQLite3 uses it, as do several other popular projects. I'm not saying use it because SQLite does, but perhaps give it a try if time permits.
That entirely depends on your definition of "effective". If you have all the time of the world, the fastest parser would be a hand-written pull parser. They take a long time to debug and develop but today, no parser generator beats hand-written code in terms of runtime performance.
If you want something that can parse valid C within a week or so, use a parser generator. The code will be fast enough and most parser generators come with a grammar for C already which you can use as a starting point (avoiding 90% of the common mistakes).
Note that regexps are not suitable for parsing recursive structures. This approach would both be slower than using a generator and more error prone than a hand-written pull parser.
actually, it depends how complex is your language and whether it's really close to C or not...
Still, you could use lex as a first step even for regular expression ....
I would go for lex + menhir and o'caml....
but any flex/yacc combination would be fine..
The main problem with regular bison (the gnu implementation of yacc) stems from the C typing.. you have to describe your whole tree (and all the manipulation functions)... Using o'caml would be really easier ...
For what you want to do, our DMS Software Reengineering Toolkit is likely a very effective solution.
DMS is designed specifically to support customer analyzers/code generators of the type you are discussing. It provides very strong facilities for defining arbitrary language parsers/analyzers (tested on 30+ real languages including several complete dialects of C, C++, Java, C#, and COBOL).
DMS automates the construction of ASTs (so you don't have to do anything but get the grammar right to have a usable AST), enables the construction of custom analyses of exactly the pattern-directed inspection you indicated, can construct new C-specific ASTs representing the code you want to generate, and spit them out as compilable C source text. The pre-existing definitions of C for DMS can likely be bent to cover your C-like language.

Resources