How to parse and translate DSL using Red or Rebol - parsing

I'm trying to see if I can use Red (or Rebol) to implement a simple DSL. I want to compile my DSL to source code for another language, perhaps Red or C# or both - rather than directly interpreting and executing it.
The DSL has only a couple of simple statements, plus an if/else statement.
Statements can be grouped into rules. A rule would get translated into a function definition, with each statement the equivalent statement in the target language.
The parse capability in Red/Rebol is great and lets me implement a parser very easily - in effect it's basically just the definition of the grammar itself.
However I haven't been able to find any examples of how to take the next steps, specifically handling an if statement and translating it to other source code.
Translating an if statement seems a good example of something minimal but still slightly tricky - because in Red having an else means you need to change the if to an either, rather than just an extra optional else.
Traditionally, during parsing I would build an abstract syntax tree, and then have functions to operate on the AST and generate the new source code. Should I be following this same approach or is there some other more idiomatic way in Red ?
I've experimented with using collect/keep in my parse rules to return a block of nested blocks, which in effect forms the AST. Another approach would be to save data into specific objects representing the different statements etc.
I'm still getting to grips with collect/keep, as to when a new block will be created and what will be kept. I'd also like to keep my parser rules as "clean" as possible, with as little other code intertwined in it. So I'm still not sure how best to add in Red code in round brackets in the parse rules. Adding code too early can cause the Red code to get executed, even if the rule eventually fails. Adding code too late means the code may not be executed in the order you expect, especially when dealing with multi-level statements like if, which can contain other statements.
So, specifically, any help on how to translate my example DSL to Red source code would be appreciated. Also any links to implementing DSLs like this in Red or Rebol would be great ! :)
Here are my parse rules :-
Red [
Purpose: example rules for parsing a simple language
]
SimpleLanguageParser: make object! [
Expr: [string! | integer! | block!]
Data: ['Person.AGE | 'Person.INCOME]
WriteMessageToLog: ['write 'message 'to 'log Expr]
SetData: ['set 'data Data '= Expr]
IfStatement: ['if Expr [any Statement] opt ['else [any Statement]] 'endif]
Statement: [WriteMessageToLog | SetData | IfStatement]
Rule: [
'rule word!
[any Statement]
'endrule
]
AnySimpLeLanguage: [Rule | [any Statement]]
]
SL: function [slInput] [
parse slInput SimpleLanguageParser/AnySimpleLanguage
]
An example of some source in the DSL :-
RULE TooYoung
IF [Person.Age < 15]
WRITE MESSAGE TO LOG "too young to earn an income"
SET DATA Person.Income = 0
ELSE
WRITE MESSAGE TO LOG "old enough"
ENDIF
ENDRULE
Translated to Red source code :-
TooYoung: function [] [
either Person.Age < 15 [
WriteMessageToLog "too young to earn an income"
Person.Income: 0
] [
WriteMessageToLog "old enough"
]
]
The data, ie Person.Age, Person.Income, and the function WriteMessageToLog are all things which would have been previously defined.
Note, for simplicity I've left Expr as block! etc, rather than defining Expr in any more detail in the DSL itself. Also, setting Person.Income in the function doesn't work as coded as it sets a local - but that's ok for now :)

Always nice to see someone digging language-oriented programming, keep it up and welcome to Red! ;)
Specifying correct grammar rules is the trickiest part of the job, and you've already nailed that. What's left is to intersperse your PEG (parsing expression grammar) with set, copy, collect/keep combo and paren! expressions in the right places, and then either create an AST from that or, in simplier cases, emit code directly.
Example
Here's a quickly baked (and definitely buggy!) example of how I'd tackled your task. Basically, it's slightly reworked code of yours, where matched patterns are setted, copyed or collected, and then bounded to specific words, which then just pasted into "template" (compose function inside emit-rule) to produce a Red code.
It's not the only way, I believe. #rebolek might come up with more industrial-strength solution, as he has experience with sophisticated parsers, which I'm lacking :P
Followup
As for if/else dilemma, I followed the approach proposed above -- instead of using opt I wrapped rule for else-branch into block and added an alternative match, which just sets false-block to none.
What to use for AST -- anything that allow to express a hierarchical structure, which is either a block! (though for performance gain you might want to use hash! or map!) or an object!. The advantage of object! is that it provides a context to be bound to, but here we're approaching a realm of so-called Bindology ("scoping" rules of Red language), which is another beast :)
Emitting C# code would be harder, but doable -- you'll need to assemble a string instead of Red code. I think, however, that, as a newcomer, you should stick with parsing directly at block-level (the way you done in your example), because it a lot easier and much expressive.
Another interesting (but much hairy) approach would be to re-define all words used in your DSL-block to work as you want.
Resources
We have a wiki entry about Red/Rebol dialects on github, which you might find if not useful, but interesting to read.
Also, two articles (this and this) in Red blog, I think you skimmed over first one already (if not, you should!).
Last, but not least, an exhaustive review of Parse principles and keywords (which has a couple of wrong parts in it though, so, caveat emptor). It's written for Rebol, but you should adapt examples to Red rather easily.
As a relative newcomer to the language, I do agree that there's a lack of examples and tutorials about DSL development, but we're working on that (at least in our heads) :)

Taking 9214's answer as a starting point, I've coded one possible solution. My approach has been :-
try to keep the parse rules as "clean" as possible
use collect and keep to return a block as the result, rather than trying to build a more complex AST
do some minimal translation in the keeps
the resulting block should be valid Red code
which uses predefined functions, where any more complex processing needs to happen
Most simple statements are easily translated to functions eg WRITE MESSAGE TO LOG becomes SL_WriteMessageToLog which can then do whatever it needs to do.
More complicated statements with structure, eg If/Else become functions which take block parameters which can then process the blocks as required.
For the If/Else complication, I've made this into two separate functions, SL_If and SL_Else. SL_If stores the result of the condition in a sequence, and SL_Else checks the latest result and removes it. This allows for nested If/Elses.
The presence of the final endrule can be checked for to ensure the input was correctly parsed. Once this is removed, we should have a valid function definition.
Here's the code :-
Red [
Purpose: example rules for parsing and translating a simple language
]
; some data
Person.AGE: 0
Person.INCOME: 0
; functions to implement some simple SL statements
SL_WriteMessageToLog: function [value] [
print value
]
SL_SetData: function [parmblock] [
field: parmblock/1
value: parmblock/2
if type? value = word! [
value: do value
]
print ["old value" field "=" do field]
set field value
print ["new value" field "=" do field]
]
; hold the If condition results, to be used to determine whether or not to do Else
IfConditionResults: []
SL_If: function [cond stats] [
cond_result: do cond
head insert IfConditionResults cond_result
if cond_result stats
]
SL_Else: function [stats] [
cond_result: first IfConditionResults
remove IfConditionResults
if not cond_result stats
]
; parsing rules
SimpleLanguageParser: make object! [
Expr: [logic! | string! | integer! | block!]
Data: ['Person.AGE | 'Person.INCOME]
WriteMessageToLog: ['write 'message 'to 'log set x Expr keep ('SL_WriteMessageToLog) keep (x)]
SetData: ['set 'data set d Data '= set x Expr keep ('SL_SetData) keep (reduce [d x])]
IfStatement: ['if keep ('SL_If) keep Expr collect [any Statement] opt ['else keep ('SL_Else) collect [any Statement]] 'endif]
Statement: [WriteMessageToLog | SetData | IfStatement]
Rule: [collect [
'rule set fname word! keep (to set-word! fname) keep ('does)
collect [any Statement]
keep 'endrule
]
]
AnySimpLeLanguage: [Rule | [any Statement]]
]
SL: function [slInput] [
parse slInput SimpleLanguageParser/Rule
]
For the example in the original post, the output is :-
TooYoung: does [
SL_If [Person.Age < 15] [
SL_WriteMessageToLog "too young to earn an income"
SL_SetData [Person.Income 0]
]
SL_Else [
SL_WriteMessageToLog "old enough"
]
]
ENDRULE
Thanks for your help to get this far.
Feedback on this approach and solution would be appreciated :)

Related

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

whitespace in flex patterns leads to "unrecognized rule"

The flex info manual provides allows whitespace in regular expressions using the "x" modifier in the (?r-s:pattern) form. It specifically offers a simple example (without whitespace)
(?:foo) same as (foo)
but the following program fails to compile with the error "unrecognized rule":
BAD (?:foo)
%%
{BAD} {}
I cannot find any form of (? that is acceptable as a rule pattern. Is the manual in error, or do I misunderstand?
The example in your question does not seem to reflect the question itself, since it shows neither the use of whitespace nor a x flag. So I'm going to assume that the pattern which is failing for you is something like
BAD (?x:two | lines |
of | words)
%%
{BAD} { }
And, indeed, that will not work. Although you can use extended format in a pattern, you can only use it in a definition if it doesn't contain a newline. The definition terminates at the last non-whitespace character on the definition line.
Anyway, definitions are overused. You could write the above as
%%
(?x:two | lines |
of | words ) { }
Which saves anyone reading your code from having to search for a definition.
I do understand that you might want to use a very long pattern in a rule, which is awkward, particularly if you want to use it twice. Regardless of the issue with newlines, this tends to run into problems with Flex's definition length limit (2047 characters). My approach has been to break the very long pattern into a series of definitions, and then define another symbol which concatenates the pieces.
Before v2.6, Flex did not chop whitespace off the end of the definition line, which also leads to mysterious "unrecognized rule" errors. The manual seems to still reflect the v2.5 behaviour:
The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line.

What is perl experimental feature, postderef?

I see use experimental 'postderef' being used in Moxie here on line 8. I'm just confused at what it does. The man pages for experimental are pretty vague too,
allow the use of postfix dereferencing expressions, including in interpolating strings
Can anyone show what you would have to do without the pragma, and what the pragma makes easier or possible?
What is it
It's simple. It's syntactic sugar with ups and downs. The pragma is no longer needed as the feature is core in 5.24. But in order for the feature to be supported in between 5.20 and 5.24, it had to be enabled with: use experimental 'postderef'. In the provided example, in Moxie, it's used in one line which has $meta->mro->#*; without it you'd have to write #{$meta->mro}.
Synopsis
These are straight from D Foy's blog, along with Idiomatic Perl for comparison that I've written.
D Foy example Idiomatic Perl
$gimme_a_ref->()->#[0]->%* %{ $gimme_a_ref->()[0] }
$array_ref->#* #{ $array_ref }
get_hashref()->#{ qw(cat dog) } #{ get_hashref() }{ qw(cat dog) }
These examples totally provided by D Foy,
D Foy example Idiomatic Perl
$array_ref->[0][0]->#* #{ $array_ref->[0][0] }
$sub->&* &some_sub
Arguments-for
postderef allows chaining.
postderef_qq makes complex interpolation into scalar strings easier.
Arguments-against
not at all provided by D Foy
Loses sigil significance. Whereas before you knew what the "type" was by looking at the sigil on the left-most side. Now, you don't know until you read the whole chain. This seems to undermine any argument for the sigil, by forcing you to read the whole chain before you know what is expected. Perhaps the days of arguing that sigils are a good design decision are over? But, then again, perl6 is still all about them. Lack of consistency here.
Overloads -> to mean, as type. So now you have $type->[0][1]->#* to mean dereference as $type, and also coerce to type.
Slices do not have an similar syntax on primitives.
my #foo = qw/foo bar baz quz quuz quuuz/;
my $bar = \#foo;
# Idiomatic perl array-slices with inclusive-range slicing
say #$bar[2..4]; # From reference; returns bazquzquuz
say #foo[2..4]; # From primitive; returns bazquzquuz
# Whizbang thing which has exclusive-range slicing
say $bar->#[2,4]; # From reference; returns bazquz
# Nothing.
Sources
Brian D Foy in 2014..
Brian D Foy in 2016..

Does anyone have an efficient R3 function that mimics the behaviour of find/any in R2?

Rebol2 has an /ANY refinement on the FIND function that can do wildcard searches:
>> find/any "here is a string" "s?r"
== "string"
I use this extensively in tight loops that need to perform well. But the refinement was removed in Rebol3.
What's the most efficient way of doing this in Rebol3? (I'm guessing a parse solution of some sort.)
Here's a stab at handling the "*" case:
like: funct [
series [series!]
search [series!]
][
rule: copy []
remove-each s b: parse/all search "*" [empty? s]
foreach s b [
append rule reduce ['to s]
]
append rule [to end]
all [
parse series rule
find series first b
]
]
used as follows:
>> like "abcde" "b*d"
== "bcde"
I had edited your question for "clarity" and changed it to say 'was removed'. That made it sound like it was a deliberate decision. Yet it actually turns out it may just not have been implemented.
BUT if anyone asks me, I don't think it should be in the box...and not just because it's a lousy use of the word "ALL". Here's why:
You're looking for patterns in strings...so if you're constrained to using a string to specify that pattern you get into "meta" problems. Let's say I want to extract the word *Rebol* or ?Red?, now there has to be escaping and things get ugly all over again. Back to RegEx. :-/
So what you might actually want isn't a STRING! pattern like s?r but a BLOCK! pattern like ["s" ? "r"]. This would permit constructs like ["?" ? "?"] or [{?} ? {?}]. That's better than rehashing the string hackery that every other language uses.
And that's what PARSE does, albeit in a slightly-less-declarative way. It also uses words instead of symbols, as Rebol likes to do. [{?} skip {?}] is a match rule where skip is an instruction that moves the parse position past any single element of the parse series between the question marks. It could also do so if it were parsing a block as input, and would match [{?} 12-Dec-2012 {?}].
I don't know entirely what the behavior of /ALL would-or-should be with something like "ab??cd e?*f"... if it provided alternate pattern logic or what. I'm assuming the Rebol2 implementation is brief? So likely it only matches one pattern.
To set a baseline, here's a possibly-lame PARSE solution for the s?r intent:
>> parse "here is a string" [
some [ ; match rule repeatedly
to "s" ; advance to *before* "s"
pos: ; save position as potential match
skip ; now skip the "s"
[ ; [sub-rule]
skip ; ignore any single character (the "?")
"r" ; match the "r", and if we do...
return pos ; return the position we saved
| ; | (otherwise)
none ; no-op, keep trying to match
]
]
fail ; have PARSE return NONE
]
== "string"
If you wanted it to be s*r you would change the skip "r" return pos into a to "r" return pos.
On an efficiency note, I'll mention that it is indeed the case that characters are matched against characters faster than strings. So to #"s" and #"r" to end make a measurable difference in the speed when parsing strings in general. Beyond that, I'm sure others can do better.
The rule is certainly longer than "s?r". But it's not that long when comments are taken out:
[some [to #"s" pos: skip [skip #"r" return pos | none]] fail]
(Note: It does leak pos: as written. Is there a USE in PARSE, implemented or planned?)
Yet a nice thing about it is that it offers hook points at all the moments of decision, and without the escaping defects a naive string solution has. (I'm tempted to give my usual "Bad LEGO alligator vs. Good LEGO alligator" speech.)
But if you don't want to code in PARSE directly, it seems the real answer would be some kind of "Glob Expression"-to-PARSE compiler. It might be the best interpretation of glob Rebol would have, because you could do a one-off:
>> parse "here is a string" glob "s?r"
== "string"
Or if you are going to be doing the match often, cache the compiled expression. Also, let's imagine our block form uses words for literacy:
s?r-rule: glob ["s" one "r"]
pos-1: parse "here is a string" s?r-rule
pos-2: parse "reuse compiled RegEx string" s?r-rule
It might be interesting to see such a compiler for regex as well. These also might accept not only string input but also block input, so that both "s.r" and ["s" . "r"] were legal...and if you used the block form you wouldn't need escaping and could write ["." . "."] to match ".A."
Fairly interesting things would be possible. Given that in RegEx:
(abc|def)=\g{1}
matches abc=abc or def=def
but not abc=def or def=abc
Rebol could be modified to take either the string form or compile into a PARSE rule with a form like:
regex [("abc" | "def") "=" (1)]
Then you get a dialect variation that doesn't need escaping. Designing and writing such compilers is left as an exercise for the reader. :-)
I've broken this into two functions: one that creates a rule to match the given search value, and the other to perform the search. Separating the two allows you to reuse the same generated parse block where one search value is applied over multiple iterations:
expand-wildcards: use [literal][
literal: complement charset "*?"
func [
{Creates a PARSE rule matching VALUE expanding * (any characters) and ? (any one character)}
value [any-string!] "Value to expand"
/local part
][
collect [
parse value [
; empty search string FAIL
end (keep [return (none)])
|
; only wildcard return HEAD
some #"*" end (keep [to end])
|
; everything else...
some [
; single char matches
#"?" (keep 'skip)
|
; textual match
copy part some literal (keep part)
|
; indicates the use of THRU for the next string
some #"*"
; but first we're going to match single chars
any [#"?" (keep 'skip)]
; it's optional in case there's a "*?*" sequence
; in which case, we're going to ignore the first "*"
opt [
copy part some literal (
keep 'thru keep part
)
]
]
]
]
]
]
like: func [
{Finds a value in a series and returns the series at the start of it.}
series [any-string!] "Series to search"
value [any-string! block!] "Value to find"
/local skips result
][
; shortens the search a little where the search starts with a regular char
skips: switch/default first value [
#[none] #"*" #"?" ['skip]
][
reduce ['skip 'to first value]
]
any [
block? value
value: expand-wildcards value
]
parse series [
some [
; we have our match
result: value
; and return it
return (result)
|
; step through the string until we get a match
skips
]
; at the end of the string, no matches
fail
]
]
Splitting the function also gives you a base to optimize the two different concerns: finding the start and matching the value.
I went with PARSE as even though *? are seemingly simple rules, there is nothing quite as expressive and quick as PARSE to effectively implementing such a search.
It might yet as per #HostileFork to consider a dialect instead of strings with wildcards—indeed to the point where Regex is replaced by a compile-to-parse dialect, but is perhaps beyond the scope of the question.

REBOL path operator vs division ambiguity

I've started looking into REBOL, just for fun, and as a fan of programming languages, I really like seeing new ideas and even just alternative syntaxes. REBOL is definitely full of these. One thing I noticed is the use of '/' as the path operator which can be used similarly to the '.' operator in most object-oriented programming languages. I have not programmed in REBOL extensively, just looked at some examples and read some documentation, but it isn't clear to me why there's no ambiguity with the '/' operator.
x: 4
y: 2
result: x/y
In my example, this should be division, but it seems like it could just as easily be the path operator if x were an object or function refinement. How does REBOL handle the ambiguity? Is it just a matter of an overloaded operator and the type system so it doesn't know until runtime? Or is it something I'm missing in the grammar and there really is a difference?
UPDATE Found a good piece of example code:
sp: to-integer (100 * 2 * length? buf) / d/3 / 1024 / 1024
It appears that arithmetic division requires whitespace, while the path operator requires no whitespace. Is that it?
This question deserves an answer from the syntactic point of view. In Rebol, there is no "path operator", in fact. The x/y is a syntactic element called path. As opposed to that the standalone / (delimited by spaces) is not a path, it is a word (which is usually interpreted as the division operator). In Rebol you can examine syntactic elements like this:
length? code: [x/y x / y] ; == 4
type? first code ; == path!
type? second code
, etc.
The code guide says:
White-space is used in general for delimiting (for separating symbols).
This is especially important because words may contain characters such as + and -.
http://www.rebol.com/r3/docs/guide/code-syntax.html
One acquired skill of being a REBOler is to get the hang of inserting whitespace in expressions where other languages usually do not require it :)
Spaces are generally needed in Rebol, but there are exceptions here and there for "special" characters, such as those delimiting series. For instance:
[a b c] is the same as [ a b c ]
(a b c) is the same as ( a b c )
[a b c]def is the same as [a b c] def
Some fairly powerful tools for doing introspection of syntactic elements are type?, quote, and probe. The quote operator prevents the interpreter from giving behavior to things. So if you tried something like:
>> data: [x [y 10]]
>> type? data/x/y
>> probe data/x/y
The "live" nature of the code would dig through the path and give you an integer! of value 10. But if you use quote:
>> data: [x [y 10]]
>> type? quote data/x/y
>> probe quote data/x/y
Then you wind up with a path! whose value is simply data/x/y, it never gets evaluated.
In the internal representation, a PATH! is quite similar to a BLOCK! or a PAREN!. It just has this special distinctive lexical type, which allows it to be treated differently. Although you've noticed that it can behave like a "dot" by picking members out of an object or series, that is only how it is used by the DO dialect. You could invent your own ideas, let's say you make the "russell" command:
russell [
x: 10
y: 20
z: 30
x/y/z
(
print x
print y
print z
)
]
Imagine that in my fanciful example, this outputs 30, 10, 20...because what the russell function does is evaluate its block in such a way that a path is treated as an instruction to shift values. So x/y/z means x=>y, y=>z, and z=>x. Then any code in parentheses is run in the DO dialect. Assignments are treated normally.
When you want to make up a fun new riff on how to express yourself, Rebol takes care of a lot of the grunt work. So for example the parentheses are guaranteed to have matched up to get a paren!. You don't have to go looking for all that yourself, you just build your dialect up from the building blocks of all those different types...and hook into existing behaviors (such as the DO dialect for basics like math and general computation, and the mind-bending PARSE dialect for some rather amazing pattern matching muscle).
But speaking of "all those different types", there's yet another weirdo situation for slash that can create another type:
>> type? quote /foo
This is called a refinement!, and happens when you start a lexical element with a slash. You'll see it used in the DO dialect to call out optional parameter sets to a function. But once again, it's just another symbolic LEGO in the parts box. You can ascribe meaning to it in your own dialects that is completely different...
While I didn't find any written definitive clarification, I did also find that +,-,* and others are valid characters in a word, so clearly it requires a space.
x*y
Is a valid identifier
x * y
Performs multiplication. It looks like the path operator is just another case of this.

Resources