Implementing Macro in a Rascal language project - rascal

Any idea on how to implement macro syntax with Rascal and also how to implement the typing and expansion(translation) of the macro syntax in Rascal? Any link to projects or repositories on this problem would also be appreciated.

Macro's are definitions of code substitutions in syntax trees, which is definitely one of the main features of Rascal. Questions I would have before advicing specific techniques:
adding macro's to an existing languages, or to a new language?
macro's at refactoring time, at compile-time or a run-time?
which would inform the question whether or not to implement macro's on concrete syntaxt trees or abstract syntax trees.
I would not say macros are a "problem" per se. The raw substitutions in syntax trees are trivial with Rascal. However, "hygienic macros" are more involved. Here we have to consider the capturing of variables by the expanded macro bodies, and what we can do about this (renaming) to avoid it. The literature on how to make macros hygienic is plenty. The complexity of hygienic macros depends on the type and name analysis (scoping) system of the base language that macros are added to.
If you have a DSL that you want to translate in stages to the target code, that can also be called "macros", but you will not find that name in the documentation. Here is an example: https://github.com/usethesource/flybytes/blob/main/src/lang/flybytes/macros/ControlFlow.rsc where "macro" is used to rewrite an additional AST node to its semantics in the "core" language.
The basic mechanisms are:
pattern matching: detects what you want to expand, with macros this is often a single ADT constructor but it can also be a more complex special case like matching i+=1 to substitute it with i++ .
substitution: at the location where the match was found, we create a new AST value in a simpler language but with the same semantics. This is done with AST expressions in Rascal, the => operator in visit and insert statements, and return and = in functions.
traversal: guiding the pattern matching and substitution without having to write to much boilerplate recursive functions.
Small example:
data Bool(loc src=|unknown:///|)
= \and(Bool l, Bool r)
| \or(Bool r, Bool r)
| \true()
| \false()
| \not(Bool a)
;
I extend the language with a "macro":
data Bool = impl(Bool l, Bool r)
A first option is to rewrite the constructor immediately and always with an overloaded function:
Bool impl(Bool l, Bool r) = or(not(l), r);
However, we lose some information here for debugging purposes, so let's try to keep the information intact:
Bool impl(Bool l, Bool r, src=loc s) = or(not(l), r, src=s);
Sometimes we want to delay the expansion for a specific stage in the compiler. In particular with the above "rewrite rule" a type-checker will not see the different anymore between ==> and || which sometimes creates usability issues with error messages.
In that case we wrap the expansion in a visit and stage it as a function:
Bool macroExpansion(Bool input) = visit(input) {
case impl(Bool l, Bool r, src=loc s) => or(not(l), r, src=s)
// add more rules here
}
It is also possible to encapsulate rewrite rules as reusable functions:
Bool expand1(impl(Bool l, Bool r, src=loc s) = or(not(l), r, src=s);
Bool expand2(not(not(Bool b))) = b;
and then pass those or apply those: (expand1 + expand2)(myBool)
So to wrap this up:
pattern matching is the key to macro expansion, patterns can be wrapped in functions or visit cases or both, and functions can be passed around and combined.
watch out to do some "origin tracking" and forward src fields to the right-hand sides of rewrite rules, otherwise the generated code does not know where it comes from.

Related

What is the point of op_Quotation if it cannot be used?

According the F# specification for operator overloading
<# #> op_Quotation
<## ##> op_QuotationUntyped
is given as with many other operators. Unless I'm missing something I don't believe that I can use this for custom types, so why is it listed?
I think you are right that there is no way of actually using those as custom operators. I suspect those are treated as operators in case this was useful, at some point in the future of the language, for some clever new feature.
The documentation really merely explains how the names of the operators get encoded. For non-special operator names, F# encodes those in a systematic way. For the ones listed in the page, it has a special nicer name. Consider this type:
type X() =
static member (<^><>) (a:int,b:int) = a + b
static member (<# #>) (a:int,b:int) = a + b
If you look at the names of those members:
[ for m in typeof<X>.GetMembers() -> m.Name ]
You see that the first operator got compiled as op_LessHatGreaterLessGreater, while the second one as op_Quotation. So this is where the name memntioned in the table comes in - it is probably good this is documented somewhere, but I think you're right, that this is not particularly useful!

Starting a parser for scheme language

I am writing a basic parser for a Scheme interpreter and here are the definitions I have set up to define the various type of tokens:
# 1. Parens
Type:
PAREN
Subtype:
LEFT_PAREN
Value:
'('
# 2. Operators (<=, =, +, ...)
Type:
OPERATOR
Subtype:
EQUALS
Value:
'='
Arity:
2
# 3. Types (2.5, "Hello", #f, etc.)
Type:
DATA
Subtype:
NUMBER
Value:
2.4
# 4. Procedures, builtins, and such
Type:
KEYWORD
Subtype:
BUILTIN
Value:
"set"
Arity:
2
PROCEDURE:
... // probably need a new class for this
Does the above seem like it's a good starting place? Are there some obvious things I'm missing here, or does this give me a "good-enough" foundation?
Your approach makes distinctions which really don't exist in the syntax of the language, and also makes decisions far too early. For example consider this program:
(let ((x 1))
(with-assignment-notes
(set! x 2)
(set! x 3)
x))
When I run this:
> (let ((x 1))
(with-assignment-notes
(set! x 2)
(set! x 3)
x))
setting x to 2
setting x to 3
3
In order for this to work with-assignment-notes has to somehow redefine what (set! ...) means in its body. Here's a hacky and probably incorrect (Racket) implementation of that:
(define-syntax with-assignment-notes
(syntax-rules (set!)
[(_ form ...)
(let-syntax ([rewrite/maybe
(syntax-rules (set!)
[(_ (set! var val))
(let ([r val])
(printf "setting ~A to ~A~%" 'var r)
(set! var r))]
[(_ thing)
thing])])
(rewrite/maybe form) ...)]))
So the critical features of any parser for a Lisp-family language are:
it should not make any decision about the semantics of the language that it can avoid making;
the structure it constructs must be available to the language itself as first-class objects;
(and optionally) the parser should be modifiable from the language itself.
As examples:
it is probably inevitable that the parser needs to make decisions about what is and is not a number and what sort of number it is;
it would be nice if it had default handling for strings, but this should ideally be controllable by the user;
it should make no decision at all about what, say (< x y) means but rather should return a structure representing it for interpretation by the language.
The reason for the last, optional, requirement is that Lisp-family languages are used by people who are interested in using them for implementing languages. Allowing the reader to be altered from within the language makes that hugely easier, since you don't have to start from scratch each time you want to make a language which is a bit like the one you started with but not completely.
Parsing Lisp
The usual approach to parsing Lisp-family languages is to have machinery which will turn a sequence of characters into a sequence of s-expressions consisting of objects which are defined by the language itself, notably symbols and conses (but also numbers, strings &c). Once you have this structure you then walk over it to interpret it as a program: either evaluating it on the fly or compiling it. Critically, you can also write programs which manipulate this structure itself: macros.
In 'traditional' Lisps such as CL this process is explicit: there is a 'reader' which turns a sequence of characters into a sequence of s-expressions, and macros explicitly manipulate the list structure of these s-expressions, after which the evaluator/compiler processes them. So in a traditional Lisp (< x y) would be parsed as (a cons of a symbol < and (a cons of a symbol x and (a cons of a symbol y and the empty list object)), or (< . (x . (y . ()))), and this structure gets handed to the macro expander and hence to the evaluator or compiler.
In Scheme it is a little more subtle: macros are specified (portably, anyway) in terms of rules which turn a bit of syntax into another bit of syntax, and it's not (I think) explicit whether such objects are made of conses & symbols or not. But the structure which is available to syntax rules needs to be as rich as something made of conses and symbols, because syntax rules get to poke around inside it. If you want to write something like the following macro:
(define-syntax with-silly-escape
(syntax-rules ()
[(_ (escape) form ...)
(call/cc (λ (c)
(define (escape) (c 'escaped))
form ...))]
[(_ (escape val ...) form ...)
(call/cc (λ (c)
(define (escape) (c val ...))
form ...))]))
then you need to be able to look into the structure of what came from the reader, and that structure needs to be as rich as something made of lists and conses.
A toy reader: reeder
Reeder is a little Lisp reader written in Common Lisp that I wrote a little while ago for reasons I forget (but perhaps to help me learn CL-PPCRE, which it uses). It is emphatically a toy, but it is also small enough and simple enough to understand: certainly it is much smaller and simpler than the standard CL reader, and it demonstrates one approach to solving this problem. It is driven by a table known as a reedtable which defines how parsing proceeds.
So, for instance:
> (with-input-from-string (in "(defun foo (x) x)")
(reed :from in))
(defun foo (x) x)
Reeding
To read (reed) something using a reedtable:
look for the next interesting character, which is the next character not defined as whitespace in the table (reedtables have a configurable list of whitespace characters);
if that character is defined as a macro character in the table, call its function to read something;
otherwise call the table's token reader to read and interpret a token.
Reeding tokens
The token reader lives in the reedtable and is responsible for accumulating and interpreting a token:
it accumulates a token in ways known to itself (but the default one does this by just trundling along the string handling single (\) and multiple (|) escapes defined in the reedtable until it gets to something that is whitespace in the table);
at this point it has a string and it asks the reedtable to turn this string into something, which it does by means of token parsers.
There is a small kludge in the second step: as the token reader accumulates a token it keeps track of whether it is 'denatured' which means that there were escaped characters in it. It hands this information to the token parsers, which allows them, for instance, to interpret |1|, which is denatured, differently to 1, which is not.
Token parsers are also defined in the reedtable: there is a define-token-parser form to define them. They have priorities, so that the highest priority one gets to be tried first and they get to say whether they should be tried for denatured tokens. Some token parser should always apply: it's an error if none do.
The default reedtable has token parsers which can parse integers and rational numbers, and a fallback one which parses a symbol. Here is an example of how you would replace this fallback parser so that instead of returning symbols it returns objects called 'cymbals' which might be the representation of symbols in some embedded language:
Firstly we want a copy of the reedtable, and we need to remove the symbol parser from that copy (having previously checked its name using reedtable-token-parser-names).
(defvar *cymbal-reedtable* (copy-reedtable nil))
(remove-token-parser 'symbol *cymbal-reedtable*)
Now here's an implementation of cymbals:
(defvar *namespace* (make-hash-table :test #'equal))
(defstruct cymbal
name)
(defgeneric ensure-cymbal (thing))
(defmethod ensure-cymbal ((thing string))
(or (gethash thing *namespace*)
(setf (gethash thing *namespace*)
(make-cymbal :name thing))))
(defmethod ensure-cymbal ((thing cymbal))
thing)
And finally here is the cymbal token parser:
(define-token-parser (cymbal 0 :denatured t :reedtable *cymbal-reedtable*)
((:sequence
:start-anchor
(:register (:greedy-repetition 0 nil :everything))
:end-anchor)
name)
(ensure-cymbal name))
An example of this. Before modifying the reedtable:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(x y . z)
After:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(#S(cymbal :name "x") #S(cymbal :name "y") . #S(cymbal :name "z"))
Macro characters
If something isn't the start of a token then it's a macro character. Macro characters have associated functions and these functions get called to read one object, however they choose to do that. The default reedtable has two-and-a-half macro characters:
" reads a string, using the reedtable's single & multiple escape characters;
( reads a list or a cons.
) is defined to raise an exception, as it can only occur if there are unbalanced parens.
The string reader is pretty straightforward (it has a lot in common with the token reader although it's not the same code).
The list/cons reader is mildly fiddly: most of the fiddliness is dealing with consing dots which it does by a slightly disgusting trick: it installs a secret token parser which will parse a consing dot as a special object if a dynamic variable is true, but otherwise will raise an exception. The cons reader then binds this variable appropriately to make sure that consing dots are parsed only where they are allowed. Obviously the list/cons reader invokes the whole reader recursively in many places.
And that's all the macro characters. So, for instance in the default setup, ' would read as a symbol (or a cymbal). But you can just install a macro character:
(defvar *qr-reedtable* (copy-reedtable nil))
(setf (reedtable-macro-character #\' *qr-reedtable*)
(lambda (from quote table)
(declare (ignore quote))
(values `(quote ,(reed :from from :reedtable table))
(inch from nil))))
And now 'x will read as (quote x) in *qr-reedtable*.
Similarly you could add a more compllicated macro character on # to read objects depending on their next character in the way CL does.
An example of the quote reader. Before:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
\'
The object it has returned is a symbol whose name is "'", and it didn't read beyond that of course. After:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
`(x y . z)
Other notes
Everything works one-character-ahead, so all of the various functions get the stream being read, the first character they should be interested in and the reedtable, and return both their value and the next character. This avoids endlessly unreading characters (and probably tells you what grammar class it can handle natively (obviously macro character parsers can do whatever they like so long as things are sane when they return).
It probably doesn't use anything which isn't moderately implementable in non-Lisp languages. Some
Macros will cause pain in the usual way, but the only one is define-token-parser. I think the solution to that is the usual expand-the-macro-by-hand-and-write-that-code, but you could probably help a bit by having an install-or-replace-token-parser function which dealt with the bookkeeping of keeping the list sorted etc.
You'll need a language with dynamic variables to implement something like the cons reeder.
it uses CL-PPCRE's s-expression representation of regexps. I'm sure other languages have something like this (Perl does) because no-one wants to write stringy regexps: they must have died out decades ago.
It's a toy: it may be interesting to read but it's not suitable for any serious use. I found at least one bug while writing this: there will be many more.

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

Erlang vs Elixir Macros

I have came across some Erlang code which I am trying to convert to Elixir to help me learn both of the languages and understand the differences. Macros and metaprogramming in general is a topic I am still trying to get my head around, so hopefully you will understand my confusion.
The Erlang code
-define(p2(MAT, REP),
p2(W = MAT ++ STM) -> m_rep(0, W, STM, REP))
% where m_rep is a function already defined.
To me, it seems that in the above code, there is two separate definitions of the p2 macro that map to a private function called m_rep. In Elixir though, it seems that it is only possible to have one pattern matching definition. Is it possible to have different ones in Elixir too?
These are not two definitions. The first line is the macro, the second line is the replacement. The confusing bit is that the macro has the same name as the function for which it is generating clauses. For example when using your macro like this:
?p2("a", "b");
?p2("c", "d").
the above will be expanded to:
p2(w = "a" ++ stm) -> m_rep(0, w, stm, "b");
p2(w = "c" ++ stm) -> m_rep(0, w, stm, "d").
You can use erlc -P to produce a .P file that will show you the effects of macro expansion on your code. Check out this slightly simpler, compilable example:
-module(macro).
-export([foo/1]).
-define(foo(X),
foo(X) -> X).
?foo("bar");
?foo("baz");
?foo("qux").
Using erlc -P macro.erl you will get the following output to macro.P:
-file("macro.erl", 1).
-module(macro).
-export([foo/1]).
foo("bar") ->
"bar";
foo("baz") ->
"baz";
foo("qux") ->
"qux".
In Elixir you can define multiple function clauses using macros as well. It is more verbose, but I think it is also much clearer. The Elixir equivalent would be:
defmodule MyMacros do
defmacro p2(mat, rep) do
quote do
def p2(w = unquote(mat) ++ stm) do
m_rep(0, w, stm, unquote(rep))
end
end
end
end
which you can use to define multiple function clauses, just like the erlang counterpart:
defmodule MyModule do
require MyMacros
MyMacros.p2('a', 'b')
MyMacros.p2('c', 'd')
end
I can't help myself here. :-) If it's the macros you are after then using LFE (Lisp Flavoured Erlang) gives you much better macro handling than either erlang or elixir. It also is compatible with both.
-define(p2(MAT, REP),
p2(w = MAT ++ stm) -> m_rep(0, w, stm, REP))
% where m_rep is a function already defined.
The code above has a number of issues.
There's no such thing as a macro with multiple clauses in Erlang. The above code doesn't define two separate definitions of the p2 macro that map to a private function called m_rep. What it does is it defines a 2-argument macro, which defines a p2 function taking some parameters and calling m_rep. However, the parameter definition of the internal p2 function is incorrect:
it tries to use ++ with the second argument not being a list
it tries to assign a value to an atom (did you mean a capital W, a variable, instead of a small w, an atom?)
it tries the assignment in a place where an assignment is not allowed - in a function head.
Did you try to test for equality (== instead of =), not to do an assignment? If so, you have to use a guard.
Moreover, it seems to me you're trying to use w and stm as though they were variables and pass them to m_rep, but they're not! Variables in Erlang have to start with a capital letter. Variables in Elixir, on the other hand, do not. It might be you're confusing concepts from the two similar but still different languages.
My general advice would be to pick one language and learn it well and only later with that knowledge under your belt try a different language. Pick Erlang if you're completely new to programming - it's simpler, there are less things to learn upfront. Pick Elixir if you already know Ruby or are more into immediate marketability of your skills.
Please say more about your intention and I might be able to come up with code expressing it. The above snippet is too ambiguous.

How can I handle previously declared constants with a parser expression grammar?

Say I have the following, in a toy DSL:
int foo(int bar = 0);
With a tool such as rust-peg, I could define some simple parser expression grammar (PEG) rules to match it (assume appropriate structs FnProto and 'Arg'):
function -> FnProto
= t:type " " n:name "(" v:arglist ");"
{ FnProto { return_type:t, name:n, args:v } }
arglist -> Vec<Arg>
= arg ** ","
arg -> Arg
= t:type " " n:name " = " z:integer { Arg { typename:t, name:n, value:z } }
type -> String
= "int" { match_str.to_string() }
name -> String
= [a-zA-Z_]+[a-zA-Z0-9_] { match_str.to_string() }
integer -> i64
= "-"? [0-9]+ { match_str.parse().unwrap() }
In practice such simple rules are insufficient, but they will serve to illustrate my point.
Now consider the following situation, where the default value of bar is a constant defined previously in the same file:
int BAZ = 0xDEADBEEF;
int foo(int bar = BAZ);
Now the rule for parsing functions needs to accept not only integer literals as default argument values, but also any previously declared constants.
I could do one pass to parse constants and substitute the appropriate values in a second pass, but do I really have to resort to two passes? Is there some way I can refer to previously parsed data from within a rule?
You are confusing "parsing" (the recognition of a valid program, perhaps including capture of a representation of it [e.g, as an AST]) and semantic analysis and/or execution.
Your parser should define what is legal to say, syntactically, in the language. Nothing less, and nothing more. You might be able to write some programs that are semantic nonsense that the parser will not complain about.
Having parsed the text, you now need "other passes" over the parsed data (not the source text) to build classic compiler structures such as symbol tables, and to check that all uses of symbols are valid. To do those other passes, you could arguably reparse the text but you've done that already once by assumption. The standard solution here is to have the first parse build an abstract syntax tree (AST) representing the essential details of the program. Those "other passes" operate by walking the AST rather than parsing the source text again.
This is all classic and taught in standard compiler classes and books. If you are serious about building a programming language, you will need this background.

Resources