Programmatic access to fslex and fsyacc - f#

The fslex and fsyacc tools currently require 2-stage compilation, generating files that are then compiled by fsc. It seems to me that these tools would be much easier to use if the source files were embedded resources, fed to fslex and fsyacc programmatically and the generated code compiled on-the-fly using the CodeDom.
Is this feasible and, if so, what would be required to implement this?

Jon, this is a great question; in fact, one of the design goals I have for fsharp-tools (new lexer- and parser-generator implementations for F#) is for them to be embeddable, specifically to enable scenarios like this.
As of now, I haven't implemented (yet) the functionality which would let you do this easily in fsharplex, but don't let that deter you; I've written fsharplex (and the other tools in fsharp-tools) in a more-or-less purely-functional style, so there shouldn't be any issues with global state or anything like that. It should be relatively straightforward to hack up the compiler code so you can build a regex AST using some combinators, run the compiler to get a compiled DFA, then emit IL for your state machine into a dynamic assembly (which you could then "bake" and execute).
fsharpyacc currently uses an approach where I've put the bulk of the compilation logic into a purely-functional library, Graham; the idea there is that the grammar analysis/manipulation and parser DFA compilation algorithms should be generic, reusable, and easy to test, so anyone else wanting to build language tools with F# will have a common framework on which to build them. Likewise, contributions/improvements to Graham can easily flow back to fsharpyacc. Eventually, I will modify fsharplex to use this same approach, which will allow you to embed the regex compiler in your own code simply by referencing the NuGet package (you'd just need to write the code to generate IL from the DFA).
fsharplex and fsharpyacc use MEF to allow various backends to be plugged in; for now, they're only targetting fslex and fsyacc for compatibility reasons, but I'd like to implement code-based backends (as opposed to the current table-based backends) to get better performance in the future.
Update -- I just re-read your question and noticed you want to embed the *.fsl and *.fsy files themselves and invoke the respective compilers at run-time. You could accomplish this by compiling the tools and referencing the assemblies from your own projects. IIRC, I exposed an entry point in both compilers so they could be called from outside code; the main entry points (e.g., what gets executed when you invoke the tools from a console) simply parse the command-line arguments then pass them into this "external" entry point.
There is one problem with directly embedding the *.fsl and *.fsy files though; if you embed them, then run them through fsharplex and fsharpyacc at run-time, your user-defined actions (e.g., the code executed when a lexer or parser rule is matched) will still be specified as F# source code -- you'd need to decide how you want to compile them into executable code.

It should be feasible to provide a parser combinator-like interface with a backend that uses expression trees (the LISP "eval" of F#) or something similar, for full integration with the language. Or else a TypeProvider. There are many options. If table generation is an expensive computation, it could be cached by providing a Cache, for example a disk cache.
I think nothing except lack of time, dedication and expertise, prevents us from having tools with (non-monadic) parser combinator-like interface, yet efficient compiled implementation.
Sometimes I get back to this pet project of mine, playing with an algebraic approach to optimizing regular expressions (and lexers) specified in source using combinators and then compiled to a state machine. It still lacks a few key pieces for efficiency, but there it is:
https://github.com/toyvo/ocaml-regex-algebraic

Related

What makes libadalang special?

I have been reading about libadalang 1 2 and I am very impressed by it. However, I was wondering if this technique has already been used and another language supports a library for syntactically and semantically analyzing its code. Is this a unique approach?
C and C++: libclang "The C Interface to Clang provides a relatively small API that exposes facilities for parsing source code into an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical source locations with elements within the AST, and other facilities that support Clang-based development tools." (See libtooling for a C++ API)
Python: See the ast module in the Python Language Services section of the Python Library manual. (The other modules can be useful, as well.)
Javascript: The ongoing ESTree effort is attempting to standardize parsing services over different Javascript engines.
C# and Visual Basic: See the .NET Compiler Platform ("Roslyn").
I'm sure there are lots more; those ones just came off the top of my head.
For a practical and theoretical grounding, you should definitely (re)visit the classical textbook Structure and Interpretation of Computer Programs by Abelson & Sussman (1st edition 1985, 2nd edition 1996), which helped popularise the idea of Metacircular Interpretation -- that is, interpreting a computer program as a formal datastructure which can be interpreted (or otherwise analysed) programmatically.
You can see "libadalang" as ASIS Mark II. AdaCore seems to be attempting to rethink ASIS in a way that will support both what ASIS already can do, and more lightweight operations, where you don't require the source to compile, to provide an analysis of it.
Hopefully the final API will be nicer than that of ASIS.
So no, it is not a unique approach. It has already been done for Ada. (But I'm not aware of similar libraries for other languages.)

What are common properties in an Abstract Syntax Tree (AST)?

I'm new to compiler design and have been watching a series of youtube videos by Ravindrababu Ravula.
I am creating my own language for fun and I'm parsing it to an Abstract Syntax Tree (AST). My understanding is that these trees can be portable given they follow the same structure as other languages.
How can I create an AST that will be portable?
Side notes:
My parser is currently written in javascript but I might move it to C#.
I've been looking at SpiderMonkey's specs for guidance. Is that a good approach?
Portability (however defined) is not likely to be your primary goal in building an AST. Few (if any) compiler frameworks provide a clear interface which allows the use of an external AST, and particular AST structures tend to be badly-documented and subject to change without notice. (Even if they are well-documented, the complexity of a typical AST implementation is challenging.)
An AST is very tied to the syntactic details of a language, as well as to the particular parsing strategy being used. While it is useful to be able to repurpose ASTs for multiple tasks -- compiling, linting, pretty-printing, interactive editing, static analysis, etc. -- the conflicting demands of these different use cases tends to increase complexity. Particularly at the beginning stages of language development, you'll want to give yourself a lot of scope for rapid prototyping.
The most tempting reason for portable ASTs would be to use some other language as a target, thereby saving the cost of writing code-generation, etc. However, in practice it is usually easier to generate the textual representation of the other language from your own AST than to force your parser to use a foreign AST. Even better is to target a well-documented virtual machine (LLVM, .Net IL, JVM, etc.), which is often not much more work than generating, say, C code.
You might want to take a look at the LLVM Kaleidoscope tutorial (the second section covers ASTs, although implemented in C++). Also, you might find this question on a sister site interesting reading. And finally, if you are going to do your implementation in Javascript, you should at least take a look at the jison parser generator, which takes a lot of the grunt-work out of maintaining a parser and scanner (and thus allows for easier experimentation.)

Cross-platform parser development - What are the options?

I'm currently working on a project that makes use of a custom language with a simple context-free grammar.
Due to the project's characteristics the same language will have to be used on several platforms, especially mobile ones. Currently, I'm using my small hand-written Java parser (for the Android platform). Soon, I'll have to write basically the same parser for JavaScript and later possibly also for C# (Windows Phone) and Objective C (iOS). There is an additional chance that I'll also have to write it for PHP.
My question is: What options are there to simplify the parser development process? Do I really have to write basically the same parser for each platform or is there a less work-intensive way?
From a development process point of view the best alternative would enable me to write a grammar definition which would then automatically be compiled into a parser.
However, basically the only cross-platform parser generator I've found so far it the GOLD Parser which supports two of my target platforms (Java and C#). It would really be awesome if you could point me to other alternatives.
In case you don't know about other cross-platform compiler-compilers: Do you have hints how to structure the code towards future language extensibility?
I commend https://en.wikipedia.org/wiki/Comparison_of_parser_generators to your attention: if we restrict the domain to Java and C/C++, it suggests APG, GOLD, SableCC, and SLK (amongst others) as being cross-language enough for your stated goals. (I'm also requiring that the action code be separated from the grammar rather than inline, since the latter would defeat the purpose.) If you want JavaScript as well, it looks like your choices are APG (GPL-licensed) and WaxEye (MIT-licensed).
If your language is reasonably simple then I would say to just go with whichever you think will be easiest to integrate into your build environment(s) and has a reasonable match with how you think. Unless parsing time is a huge fraction of your application's total workload, parsing speed should not be an issue -- although table size and memory usage might matter in a mobile context. If your grammar is "simple enough," (i.e. not Perl, for instance) I would expect any of those tools to work.
Have a look in Antlr, I am using it for transforming java code and it is really great. Moreover you can find different grammars here.
REx parser generator supports the required targets, except for Objective C and PHP (code generators for those might be possible). It has not yet been published as open source, though, and there is no decent documentation, just sample grammars. But there are projects that are using it successfully, e.g. xqlint. Here is a paper describing the experience from that project.

What's the easiest way to build an F# compiler that runs on the JVM and generates Java bytecode?

The current F# Compiler is written in F#, is open source and runs on .Net and Mono, allowing it to execute on many platforms including Windows, Mac and Linux. F#'s Code Quotations mechanism has been used to compile F# to JavaScript in projects like WebSharper, Pit and FunScript. There also appears to be some interest in running F# code on the JVM.
I believe a version of the OCaml compiler was used to originally Bootstrap the F# compiler.
If someone wanted to build an F# compiler that runs on the JVM would it be easier to:
Modify the existing F# compiler to emit Java bytecode and then compile the F# compiler with it?
Use a JVM based ML compiler like Yeti to Bootstrap a minimal F# compiler on the JVM?
Re-write the F# compiler from scratch in Java as the fjord project appears to be attempting?
Something else?
Another option that should probably be considered is to convert the .NET CLR byte code into JVM byte-code like http://www.ikvm.net does with JVM > CLR byte codes. Although this approach has been considered and dismissed by the fjord owner.
Getting buy-in from the top with option 1) and have the F# compiler team have pluggable backends that could emit Java bytecode sounds in theory like it would produce the most polished solution.
But if you look at other languages that have been ported to different platforms this is rarely the case. Most of the time it's been a rewrite from scratch. But this is also likely due to the original language team having no interest in supporting alternative platforms themselves and that the original host implementation might've not been able to support multiple backends and it's already too slow for this to be a feasible option to start with.
My hunch is a combination of re-writing from scratch and being able to do as much code sharing and automation as possible from the original implementation. E.g. if the test suites could be re-used for both implementations it would take a lot of the burden off the JVM port and go a long way in ensuring language parity.
If I really had to do this, I would probably start with the #1 approach - add JVM backend to the existing compiler. But I would also try to argue for a different target VM.
Quotations are not very relevant - as an author of WebSharper I can assure you that while quotations can give you a nice F#-like language to program with, they are restrictive, and not optimized. I imagine that for potential JVM F# users the bar would be a lot higher - full language compatibility and comparable performance. This is very hard.
Take tail calls, for example. In WebSharper we apply heuristics to optimize some local tail calls to loops in JavaScript, but that is not enough - you cannot in general rely on TCO, as you do in general F# libraries. This is ok for WebSharper as our users do not expect to have full F#, but will not be ok for a JVM F# port. I believe most JVM implementations do not do TCO, so it will have to be implemented with some indirection, introducing a performance hit.
An bytecode re-compilation approach mentioned by #mythz sounds very attractive as it allows more than just porting F# - ideally it allows porting more .NET software to the JVM. I worked quite a bit with .NET bytecode analysis on an internal WebSharper 3.0 project - we are looking at the option of compiling .NET bytecode instead of F# quotations to JavaScript. But there are huge challenges there:
A lot of code in BCL is opaque (native) - and you cannot decompile it
The generics model is fairly complicated. I have implemented a JavaScript runtime that models class and method generics, instantiation, type generation, and basic reflection with some precision and reasonable performance. This was difficult enough in dynamic JavaScript with closures and is seems quite difficult to do in a performant way on the JVM - but maybe I just do not see a simple solution.
Value types create significant complications in the bytecode. I am yet to figure this one out for WebSharper 3.0. They cannot be ignored either, as they are used extensively by many libraries you would want ported.
Similarly, basic reflection is used in many real-world .NET libraries - and it is a nightmare to cross-compile in terms of both lots of native code and proper support for generics and value types.
Also, the bytecode approach does not remove the question on how to implement tail calls. AFAIK, Scala does not implement tailcalls. They have certainly the talent and the funding to do that - the fact that they do not, tells me a lot about how practical it is to do TCO on the JVM. For our .NET->JavaScript port I will probably go a similar route - no TCO guarantees unless you specifically ask for trampolining which will work but cost you an order of magnitude (or two) in performance.
There is a project that compiles OCaml to the JVM, OCaml-Java: it's pretty complete and in particular can compile the OCaml's compiler (written in OCaml) sources. I'm not sure which aspects of the F# language you're interested in, but if you're mainly looking at getting a mature strict typed functional language to the JVM, that may be a good option.
I suspect any approach would be a lot of work, but I think your first suggestion is the only one that would avoid introducing lots of additional incompatibilities and bugs. The compiler's pretty complex and there are a lot of corner cases around overload resolution, etc. (and the spec probably has gaps too), so it seems very unlikely that a new implementation would have consistently compatible semantics.
Modify the existing F# compiler to emit Java bytecode and then compile the F# compiler with it?
Use a JVM based ML compiler like Yeti to Bootstrap a minimal F# compiler on the JVM?
Porting the compiler shouldn't be that hard if it is written in F#.
I'd probably go the first way, because this is the only way one could hope to keep the new compiler in sync with the .net F# compiler.
Re-write the F# compiler from scratch in Java as the fjord project appears to be attempting?
This is certainly the least elegant approach, IMHO.
Something else?
When the compiler is done, you'll have 90% of the work left to do.
For example, not knowing much F#, but I assume it is easy to use any .NET libraries out there.
That means, the basic problem is to port the .NET ecosystem, somehow.
I was looking for something in similar lines, though it was more like a F# to Akka translator/compiler. As far as F# -> JVM is concerned, I came across two not quite production ready options:
1. F# -> [Fjord][1] -> JVM.
2. F# -> [Funscript][2] -> [Vert.X][3] -> JVM

Creating a Haskell REPL within a Haskell application

I'm trying to embed a Haskell REPL within one of my Haskell applications. The idea would be that only a subset of the Haskell libraries would be loaded by default, plus my own set of functions, and the user would use those in order to interact with the environment.
To solve this problem, I know one way would be to create a (mini-)Haskell parser + evaluator and map my mini-Haskell parser's functions to actual Haskell functions, but I'm sure there is a better way to do this.
Is there a nice and clean way to build a REPL for Haskell using Haskell?
A few things that already exist:
GHCi, of course, both in the sense of being able to look at how it's implemented or being able to use it directly (i.e., have your REPL just talk to GHCi via stdin/stdout).
The full GHC API, which lets you hook into GHC and let it do all the heavy lifting for you--loading files, chasing dependencies, parsing, type checking, etc.
hint, which is a wrapper around a subset of the GHC API, with a focus on interactive interpretation rather than compilation--which seems to fit what you want to do.
mueval, an evaluator with limits on loaded modules, resource use, etc, basically a "safe" interactive mode. It's what lambdabot uses, if you've ever been in the #haskell IRC channel.
All of the above are assuming that you don't want to deal with writing a Haskell interpreter yourself, which is probably the case.

Resources