How do languages test their parsing logic? - parsing

I can see that the OpenJDK project has a large number of regression tests for functionality in the language libraries, such as this. However, it is not obvious to me whether the project contains tests for the actual parsing of java code itself. I have found similar lack of coverage in the CPython repository, where the parser just seems to exist without explicit testing.
Is it common to just assume that the transformation of program text to AST objects is 'given' in a language? Are there examples of languages with explicit test harness for their parsers?

Related

Programmatic access to fslex and fsyacc

The fslex and fsyacc tools currently require 2-stage compilation, generating files that are then compiled by fsc. It seems to me that these tools would be much easier to use if the source files were embedded resources, fed to fslex and fsyacc programmatically and the generated code compiled on-the-fly using the CodeDom.
Is this feasible and, if so, what would be required to implement this?
Jon, this is a great question; in fact, one of the design goals I have for fsharp-tools (new lexer- and parser-generator implementations for F#) is for them to be embeddable, specifically to enable scenarios like this.
As of now, I haven't implemented (yet) the functionality which would let you do this easily in fsharplex, but don't let that deter you; I've written fsharplex (and the other tools in fsharp-tools) in a more-or-less purely-functional style, so there shouldn't be any issues with global state or anything like that. It should be relatively straightforward to hack up the compiler code so you can build a regex AST using some combinators, run the compiler to get a compiled DFA, then emit IL for your state machine into a dynamic assembly (which you could then "bake" and execute).
fsharpyacc currently uses an approach where I've put the bulk of the compilation logic into a purely-functional library, Graham; the idea there is that the grammar analysis/manipulation and parser DFA compilation algorithms should be generic, reusable, and easy to test, so anyone else wanting to build language tools with F# will have a common framework on which to build them. Likewise, contributions/improvements to Graham can easily flow back to fsharpyacc. Eventually, I will modify fsharplex to use this same approach, which will allow you to embed the regex compiler in your own code simply by referencing the NuGet package (you'd just need to write the code to generate IL from the DFA).
fsharplex and fsharpyacc use MEF to allow various backends to be plugged in; for now, they're only targetting fslex and fsyacc for compatibility reasons, but I'd like to implement code-based backends (as opposed to the current table-based backends) to get better performance in the future.
Update -- I just re-read your question and noticed you want to embed the *.fsl and *.fsy files themselves and invoke the respective compilers at run-time. You could accomplish this by compiling the tools and referencing the assemblies from your own projects. IIRC, I exposed an entry point in both compilers so they could be called from outside code; the main entry points (e.g., what gets executed when you invoke the tools from a console) simply parse the command-line arguments then pass them into this "external" entry point.
There is one problem with directly embedding the *.fsl and *.fsy files though; if you embed them, then run them through fsharplex and fsharpyacc at run-time, your user-defined actions (e.g., the code executed when a lexer or parser rule is matched) will still be specified as F# source code -- you'd need to decide how you want to compile them into executable code.
It should be feasible to provide a parser combinator-like interface with a backend that uses expression trees (the LISP "eval" of F#) or something similar, for full integration with the language. Or else a TypeProvider. There are many options. If table generation is an expensive computation, it could be cached by providing a Cache, for example a disk cache.
I think nothing except lack of time, dedication and expertise, prevents us from having tools with (non-monadic) parser combinator-like interface, yet efficient compiled implementation.
Sometimes I get back to this pet project of mine, playing with an algebraic approach to optimizing regular expressions (and lexers) specified in source using combinators and then compiled to a state machine. It still lacks a few key pieces for efficiency, but there it is:
https://github.com/toyvo/ocaml-regex-algebraic

Cross-platform parser development - What are the options?

I'm currently working on a project that makes use of a custom language with a simple context-free grammar.
Due to the project's characteristics the same language will have to be used on several platforms, especially mobile ones. Currently, I'm using my small hand-written Java parser (for the Android platform). Soon, I'll have to write basically the same parser for JavaScript and later possibly also for C# (Windows Phone) and Objective C (iOS). There is an additional chance that I'll also have to write it for PHP.
My question is: What options are there to simplify the parser development process? Do I really have to write basically the same parser for each platform or is there a less work-intensive way?
From a development process point of view the best alternative would enable me to write a grammar definition which would then automatically be compiled into a parser.
However, basically the only cross-platform parser generator I've found so far it the GOLD Parser which supports two of my target platforms (Java and C#). It would really be awesome if you could point me to other alternatives.
In case you don't know about other cross-platform compiler-compilers: Do you have hints how to structure the code towards future language extensibility?
I commend https://en.wikipedia.org/wiki/Comparison_of_parser_generators to your attention: if we restrict the domain to Java and C/C++, it suggests APG, GOLD, SableCC, and SLK (amongst others) as being cross-language enough for your stated goals. (I'm also requiring that the action code be separated from the grammar rather than inline, since the latter would defeat the purpose.) If you want JavaScript as well, it looks like your choices are APG (GPL-licensed) and WaxEye (MIT-licensed).
If your language is reasonably simple then I would say to just go with whichever you think will be easiest to integrate into your build environment(s) and has a reasonable match with how you think. Unless parsing time is a huge fraction of your application's total workload, parsing speed should not be an issue -- although table size and memory usage might matter in a mobile context. If your grammar is "simple enough," (i.e. not Perl, for instance) I would expect any of those tools to work.
Have a look in Antlr, I am using it for transforming java code and it is really great. Moreover you can find different grammars here.
REx parser generator supports the required targets, except for Objective C and PHP (code generators for those might be possible). It has not yet been published as open source, though, and there is no decent documentation, just sample grammars. But there are projects that are using it successfully, e.g. xqlint. Here is a paper describing the experience from that project.

ANTLR grammar for Scala?

I am trying to build a static analysis tool for a demo project. We are free to choose the language to analyze. I started off by writing a Java code analyzer using ANTLR. I now want to do the same for Scala code. However, I could not find the ANTLR grammar for Scala. Does it exist?
Is there any other machine readable form of Scala grammar?
I don't believe there is such a thing.
The thing is that for any language, but especially for a library language like Scala, lexical analysis and syntactic analysis is the least interesting and most trivial part of static analysis. In order to do anything even remotely interesting, you need to perform a significant amount of semantic analysis: desugaring, type inference, type checking, kind checking, macro expansion, overload resolution, implicit resolution, name binding. In short: you need to re-implement more or less the entire Scala compiler, modulo the actual code generation part. Remember that both Scala's macro system and Scala's type system are Turing-complete (in fact, Scala's macro system is Scala!): there could be significant compile-time and type-level computation going on that is impossible to analyze without actually performing macro expansion, type inference and type checking.
That is a massive task, and there are in fact only two projects that have successfully done it: one is the Scala compiler itself, the other is the IntelliJ IDEA Scala plugin.
And let's not even talk about compiler plugins, which are able to change the syntax and semantics of Scala in almost arbitrary ways.
But behold, there is hope: The Scala compiler itself provides an API called the Presentation Compiler, which is specifically designed for use by IDEs, code highlighters, and all kinds of static analysis tools. It gives you access to the entire information the compiler has during compilation, just before the optimization and code generation phases. It is used by ScalaDoc, the Scala REPL, the Scala Eclipse Plugin, the NetBeans Scala Plugin, SimplyScala.Com, the ENSIME Plugin for Emacs, some static analysis tools, and many others.
You can find a Scala grammar for ANTLR at https://github.com/lrlucena/grammars-v4/tree/master/scala . It is based on the Scala Language Specification http://www.scala-lang.org/files/archive/spec/2.11/13-syntax-summary.html .
Is Appendix A of the Scala Language Reference useful for you? It is in EBNF format.
Scalastyle uses scalariform to do the parsing for it. With this, you get an AST of case classes. However, you only get the information which is in the file, so for instance, you don't get inferred types.
If you don't need all of the extra information, then look at Scalariform. The Scalastyle code is fairly easy to understand, start with Checker.scala.

Text parsing library

A colleague of mine works on an universal text parsing library, based on C# lambdas. The core looks cool, but unfortunately to me he has hardcoded a grammar, specifical to his private task -- math expression evaluating. So, I will not use it as I had intended before I saw the API. And now I'm looking for another lib, that meets at least some of my requirements. It has to:
Be able to load a grammar from an external file -- say, XML, YML or JSON.
Return AST from grammar and parsed tree that is built from any text.
Work fast enough to load C# grammar then parse a large code file.
I'd prefer the library that has grammar format file simple enough for easy writing a grammar for math expressions, is open source and written in C# or C++.
Regards,
--
UPDATED: point 2 has been corrected.
You might check out Text Transformer which claims to be some kind of universal text processing language. I have no specific experience with it.
Building robust langauge front ends and usable processing tools is actually a lot of work.
If you want to process computer languages in a generic way, you might consider our DMS Software Reengineering Toolkit, a kind of generalized compiler technology for parsing, analyzing, transforming, and/or generating code (or any other kind of formal document).
DMS will accept arbitrary context free grammars for langauges, automatically builds an AST with no additional specification effort on your part, and is designed to handle not only large files but very large sets of files in a single computation. Normally people
that want to process code need pattern recognition, code analysis and code transformation capabilities; DMS has all of these built in. It also has a variety of predefined, mature grammars for a wide variety of computer langauges, well-known (C, C++, C#, COBOL, Java, JavaScript, ... ) and otherwise (Natural, EGL, Python, MATLAB, ...), and has been used to carry out massive automated analyses and transformations on programs in these various langauges.
DMS does not meet your open-source or C#/C++ implementation requirements. It is implemented as a set of domain-specific langauges for describing grammars, analyzers, transformations, prettyprinters, and scripting that allows parallel execution to enable complex analyses to run faster than single-threaded programs.

Language parsers

I need to parse C#, Ruby and Python source code to generate some reports. I need to get a list of method names inside a class, and I need some other info such as usage of global variable or something. Just parsing using RE could be a solution, but I expect a better (systematic) solution using parsers, if it is easily possible.
What parsers for those languages are provided?
For C#, I found http://csparser.codeplex.com/Wikipage , but for the others, I found a bunch of parsers using those languages, but not the language parsers of them.
It may be worth looking into the ANTLR parser generator.
You'll find, on the ANTLR site, grammars for all 3 languages you are interested in (Although the Ruby grammar is only for a "simplified" version of the language).
The next difficulty may be to adapt these grammars for the particular target language you would like, i.e. the language in which the parsers themselves will be generated. ANTLR's grammar language is very expressive, allowing one to deal with various context-sensitive languages. This is done by inserting various snippets (in the target language) and/or semantic or syntactic predicates (also in the target language) amid the EBNF-like grammar; consequently the grammar is a bit messier and may need adapting when the target language is changed. The "native" target language of ANTLR is Java, but many other targets languages are supported.
On the whole, ANTLR represents a bit a setup/learning-curve effort, but since you need to deal with 3 languages, it may well be worth the investment, as this will allow you to have a uniform framework (over which you have "full" control), rather than trying to corral three possibly very distinct, and possibly more "locked down" parsers as you started doing.
All three languages are relatively sophisticated languages and although your goal is "merely" to identify methods within programs, you may be able to hack/simplify some of the grammars (or maybe simply "ignore" parts of them), only mapping the few parser-level rules of interest to your eventual goal.
Once these rules are identified, you can apply the same or similar actions, i.e. snippets (in the target language) which implement what you wish to accomplish when the parser encounters such rules (eg: store the method's signature for future reporting, start counting the number of lines... whatever).
A final suggestion:
As hinted in comments to the question, and depending on your goals, you may be able to reuse existing utility programs to perform directly, or indirectly these goals.
Also, because indeed messing with parsers for these sophisticated languages may be somewhat overkill for you possibly simple and possibly error-tolerant goals, the Regular Expressions approach may fit the bill, somehow; the fact of the matter is that none of these languages are regular nor context free, so success with regex will be highly dependent on the eventual goals and on the input data (programs).
Yet another suggestion!
See Larry Lustig's answer! Introspection may simplify much of you task as well. The implication is that you'd need to a) write your logic within each of the the underlying language b) integrate/load the programs to be inspected. All depends, but again, a possible way out from the -let's be fair- relatively heavy investment with formal grammar tools.
For Python, the situation is trivial: there is a Python parser in the standard library as well as a more high-level module for manipulating ASTs.
Also, Python has a somewhat simple grammar (at least if you use the trick to keep an indentation stack in your lexer and inject fake BEGIN and END tokens in your token stream, so that you can treat Python as a simple keyword delimited Algol-like language in your parser), so it is often used as an example grammar for parser generators, which means that you can find literally dozens of Python parsers for pretty much every single parser generator, programming language and platform out there. (E.g., here is a Haskell module implementing a Python lexer and parser.)
For Ruby, there are quite a number of parsers available.
Ruby is incredibly hard to parse, so if you need full fidelity, you pretty much have to use the original YACC grammar file from the YARV Ruby implementation. (parse.y in the top-level source directory.) JRuby's parser is derived from that file, and it is the only one of the implementation parsers that has been explicitly designed to also be used by other clients and not just the interpreter itself. (For example, the Eclipse RDT plugin, the Eclipse DLTK/Ruby plugin, the NetBeans Ruby plugin and the jEdit Ruby syntax highlighting all use JRuby's parser.) To facilitate that, JRuby's parser has actually been repackaged as a separate project.
Of course, there are YACC clones for pretty much every language on the planet. However, be aware that YARV does not use a lex generated scanner. It uses a hand-written scanner in C, and also the YACC grammar contains quite a bit of semantic actions in C. Those parts will have to be re-implemented (like they were in JRuby).
The XRuby compiler is the only full Ruby implementation that does not use YARV's parse.y, it uses an ANTLRv3 grammar and an ANTLRv3 tree grammar that have been developed from scratch. ANTLR can generate parsers for a whole bunch of languages, including for example Java and C#. Its Ruby backend, however, is in dire need of some work.
RedParse is a Ruby parser written in Ruby, which claims to be able to parse all Ruby syntax correctly. It is used, for example, in the YARD Ruby documentation tool to, among other things, extract method names.
ruby_parser is another Ruby parser in Ruby. It is generated from parse.y via the racc parser generator that is part of Ruby's standard library.
YARV actually contains a parser library called ripper, which allows you to parse Ruby code. Unfortunately, it is completely undocumented, so you basically have to figure it out by reading blog posts. Except of course, being undocumented, almost nobody else has figured it out yet, either and written a blog post.
However, for your purposes, you don't actually need a full-blown Ruby parser. You only need enough to extract method names and some other stuff.
RDoc, the Ruby documentation generator, contains a Ruby parser which can parse just enough Ruby to, well, extract method names and some other stuff.
Cardinal is a Ruby implementation for the Parrot Virtual Machine. It does not yet run all of Ruby, but its parser should be powerful enough to support all you need. (The parser is written in the Parrot Grammar Engine, so you will obviously have to run it in Parrot, by for example writing your reporting tool in Perl6.)
tinyrb is another Ruby implementation that does not run full Ruby but contains a better written parser than YARV. In this case, the parser uses Ian Piumarta's leg Parsing Expression Grammar parser generator.
For Ruby and Python, can't you simply introspect the class to learn the name of the methods? You'd have to write the same functionality in each language but (at least in Python) there's hardly anything to it.
The DMS Software Reengineering Toolkit has full, robust C# and Python parsers that automatically build complete ASTs. DMS offers facilities for walking the trees and collecting whatever data you might wish to collect.
Another poster's answer here suggests Ruby is really hard to parse. C++ is also famously hard to parse. DMS has been used to parse some 30 other languages, including full C++ in a number of dialects, so Ruby seems eminently doable. Howeever, DMS doesn't have an off-the-shelf parser for Ruby.

Resources