Creating a recursive LPeg pattern - lua

In a normal PEG (parsing expression grammar) this is a valid grammar:
values <- number (comma values)*
number <- [0-9]+
comma <- ','
However, if I try to write this using LPeg the recursive nature of that rule fails:
local lpeg = require'lpeg'
local comma = lpeg.P(',')
local number = lpeg.R('09')^1
local values = number * (comma * values)^-1
--> bad argument #2 to '?' (lpeg-pattern expected, got nil)
Although in this simple example I could rewrite the rule to not use recursion, I have some existing grammars that I'd prefer not to rewrite.
How can I write a self-referencing rule in LPeg?

Use a grammar.
With the use of Lua variables, it is possible to define patterns incrementally, with each new pattern using previously defined ones. However, this technique does not allow the definition of recursive patterns. For recursive patterns, we need real grammars.
LPeg represents grammars with tables, where each entry is a rule.
The call lpeg.V(v) creates a pattern that represents the nonterminal (or variable) with index v in a grammar. Because the grammar still does not exist when this function is evaluated, the result is an open reference to the respective rule.
A table is fixed when it is converted to a pattern (either by calling lpeg.P or by using it wherein a pattern is expected). Then every open reference created by lpeg.V(v) is corrected to refer to the rule indexed by v in the table.
When a table is fixed, the result is a pattern that matches its initial rule. The entry with index 1 in the table defines its initial rule. If that entry is a string, it is assumed to be the name of the initial rule. Otherwise, LPeg assumes that the entry 1 itself is the initial rule.
As an example, the following grammar matches strings of a's and b's that have the same number of a's and b's:
equalcount = lpeg.P{
"S"; -- initial rule name
S = "a" * lpeg.V"B" + "b" * lpeg.V"A" + "",
A = "a" * lpeg.V"S" + "b" * lpeg.V"A" * lpeg.V"A",
B = "b" * lpeg.V"S" + "a" * lpeg.V"B" * lpeg.V"B",
} * -1
It is equivalent to the following grammar in standard PEG notation:
S <- 'a' B / 'b' A / ''
A <- 'a' S / 'b' A A
B <- 'b' S / 'a' B B

I know this is a late answer but here is an idea how to back-reference a rule
local comma = lpeg.P(',')
local number = lpeg.R('09')^1
local values = lpeg.P{ lpeg.C(number) * (comma * lpeg.V(1))^-1 }
local t = { values:match('1,10,20,301') }
Basically a primitive grammar is passed to lpeg.P (grammar is just a glorified table) that references the first rule by number instead of name i.e. lpeg.V(1).
The sample just adds a simple lpeg.C capture on number terminal and collects all these results in local table t for further usage. (Notice that no lpeg.Ct is used which is not a big deal but still... part of the sample I guess.)

Related

ANTLR grammar to recognize DIGIT keys and INTEGERS too

I'm trying to create an ANTLR grammar to parse sequences of keys that optionally have a repeat count. For example, (a b c r5) means "repeat keys a, b, and c five times."
I have the grammar working for KEYS : ('a'..'z'|'A'..'Z').
But when I try to add digit keys KEYS : ('a'..'z'|'A'..'Z'|'0'..'9') with an input expression like (a 5 r5), the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY. (Or so I think; the error messages are difficult to interpret "NoViableAltException").
I have tried these grammatical forms, which work ('r' means "repeatcount"):
repeat : '(' LETTERKEYS INTEGER ')' - works for a-zA-Z
repeat : '(' LETTERKEYS 'r' INTEGER ')'; - works for a-zA-Z
But I fail with
repeat : '(' LETTERSandDIGITKEYS INTEGER ')' - fails on '(a 5 r5)'
repeat : '(' LETTERSandDIGITKEYS 'r' INTEGER ')'; - fails on '(a 5 r5)'
Maybe the grammar can't do the recognition; maybe I need to recognize all the 5's keys in the same way (as KEYS or DIGITS or INTEGERS) and in the parse tree visitor interpret the middle DIGIT instances as keys, and the last set of DIGITS as an INTEGER count?
Is it possible to define a grammar that allows me to repeat digit keys as well as letter keys so that expressions like (a 5 123 r5) will be recognized correctly? (That is, "repeat keys a,5,1,2,3 five times.") I'm not tied to that specific syntax, although it would be nice to use something similar.
Thank you.
the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY.
If you have defined the following rules:
INTEGER : [0-9]+;
KEY : [a-zA-Z0-9];
then a single digit, like 5 in your example, will always become an INTEGER token. Even if
the parser is trying to match a KEY token, the 5 will become an INTEGER. There is nothing
you can do about that: this is the way ANTLR's lexer works. The lexer works in the following way:
try to consume as many characters as possible (the longest match wins)
if 2 or more rules match the same characters (like INTEGER and KEY in case of 5), let the rule defined first "win"
If you want a 5 to be an INTEGER, but sometimes a KEY, do something like this instead:
key : KEY | SINGLE_DIGIT | R;
integer : INTEGER | SINGLE_DIGIT;
repeat : R integer;
SINGLE_DIGIT : [0-9];
INTEGER : [0-9]+;
R : 'r';
KEY : [a-zA-Z];
and in your parser rules, you use key and integer instead of KEY and INTEGER.
You can split your grammar into two parts. One to be the lexer grammar, one to be the parser grammar. The lexer grammar splits the input characters into tokens. The parser grammar uses the string of tokens to parse and build a syntax tree. I work on Tunnel Grammar Studio (TGS) that can generate parsers with this two ABNF (RFC 5234) like grammars:
key = 'a'-'z' / 'A'-'Z' / '0'-'9'
repeater = 'r' 1*('0'-'9')
That is the lexer grammar that has two rules. Each character that is not processed by the lexer grammar, is converted to a token, made from the character itself. Meaning that a is a key, r11 is a repeater and ( for example will be transferred to the parser as a token (.
document = *ws repeat
repeat = '(' *ws *({key} *ws) [{repeater} *ws] ')' *ws
ws = ' ' / %x9 / %xA / %xD
This is the parser grammar, that has 3 rules. The document rule accepts (recognizes) white space at first, then one repeat rule. The repeat rule starts with a scope, followed by any number of white space. After that is a list of keys maybe separated by white space and after all keys there is an optional repeater token followed by optional white space, closing scope and again optional white space. The white space is space tab carriage return and line feed in that order.
The runtime of this parsing is linear for both the lexer and the parser because both grammars are LL(1). Bellow is the generic parse tree from the TGS online laboratory, where you can run this grammars for input (a 5 r5) and you will get this tree:
If you want to have more complex key, then you may use this:
key = 1*('a'-'z' / 'A'-'Z' / '0'-'9')
In this case however, the key and repeater lexer rules will both recognize the sequence r7, but because the repeater rule is defined later it will take precedence (i.e. overwrites the previous rule). With other words r7 will be a repeater token, and the parsing will still be linear. You will get a warning from TGS if your lexer rules overwrite one another.

Convert Arithmetic Expression Tree to string without unnecessary parenthesis?

Assume I build a Abstract Syntax Tree of simple arithmetic operators, like Div(left,right), Add(left,right), Prod(left,right),Sum(left,right), Sub(left,right).
However when I want to convert the AST to string, I found it is hard to remove those unnecessary parathesis.
Notice the output string should follow the normal math operator precedence.
Examples:
Prod(Prod(1,2),Prod(2,3)) let's denote this as ((1*2)*(2,3))
make it to string, it should be 1*2*2*3
more examples:
(((2*3)*(3/5))-4) ==> 2*3*3/5 - 4
(((2-3)*((3*7)/(1*5))-4) ==> (2-3)*3*7/(1*5) - 4
(1/(2/3))/5 ==> 1/(2/3)/5
((1/2)/3))/5 ==> 1/2/3/5
((1-2)-3)-(4-6)+(1-3) ==> 1-2-3-(4-6)+1-3
I find the answer in this question.
Although the question is a little bit different from the link above, the algorithm still applies.
The rule is: if the children of the node has lower precedence, then a pair of parenthesis is needed.
If the operator of a node one of -, /, %, and if the right operand equals its parent node's precedence, it also needs parenthesis.
I give the pseudo-code (scala like code) blow:
def toString(e:Expression, parentPrecedence:Int = -1):String = {
e match {
case Sub(left2,right2) =>
val p = 10
val left = toString(left2, p)
val right = toString(right, p + 1) // +1 !!
val op = "-"
lazy val s2 = left :: right :: Nil mkString op
if (parentPrecedence > p )
s"($s2)"
else s"$s2"
//case Modulus and divide is similar to Sub except for p
case Sum(left2,right2) =>
val p = 10
val left = toString(left2, p)
val right = toString(right, p) //
val op = "-"
lazy val s2 = left :: right :: Nil mkString op
if (parentPrecedence > p )
s"($s2)"
else s"$s2"
//case Prod is similar to Sum
....
}
}
For a simple expression grammar, you can eliminate (most) redundant parentheses using operator precedence, essentially the same way that you parse the expression into an AST.
If you're looking at a node in an AST, all you need to do is to compare the precedence of the node's operator with the precedence of the operator for the argument, using the operator's associativity in the case that the precedences are equal. If the node's operator has higher precedence than an argument, the argument does not need to be surrounded by parentheses; otherwise it needs them. (The two arguments need to be examined independently.) If an argument is a literal or identifier, then of course no parentheses are necessary; this special case can easily be handled by making the precedence of such values infinite (or at least larger than any operator precedence).
However, your example includes another proposal for eliminating redundant parentheses, based on the mathematical associativity of the operator. Unfortunately, mathematical associativity is not always applicable in a computer program. If your expressions involving floating point numbers, for example, a+(b+c) and (a+b)+c might have very different values:
(gdb) p (1000000000000000000000.0 + -1000000000000000000000.0) + 2
$1 = 2
(gdb) p 1000000000000000000000.0 + (-1000000000000000000000.0 + 2)
$2 = 0
For this reason, it's pretty common for compilers to avoid rearranging the order of application of multiplication and addition, at least for floating point arithmetic, and also for integer arithmetic in the case of languages which check for integer overflow.
But if you do really want to rearrange based on mathematical associativity, you'll need an additional check during the walk of the AST; before checking precedence, you'll want to check whether the node you're visiting and its left argument use the same operator, where that operator is known to be mathematically associative. (This assumes that only operators which group to the left are mathematically associative. In the unlikely case that you have a mathematically associative operator which groups to the right, you'll want to check the visited node and its right-hand argument.)
If that condition is met, you can rotate the root of the AST, turning (for example) PROD(PROD(a,b),□)) into PROD(a,PROD(b,□)). That might lead to additional rotations in the case that a is also a PROD node.

Does a priority declaration disambiguate between alternative lexicals?

In my previous question, there was a priority > declaration in the example. It turned out not to matter because the solution there did not actually invoke priority but rather avoided it by making the alternatives disjoint. In this question, I'm asking whether priority can be used to select one lexical production over another. In the example below, the language of the production WordInitialDigit is intentionally a subset of that of WordAny. The production Word looks like it should disambiguate between the two properly, but the resulting parse tree has an ambiguity node at the top. Is a priority declaration able to decide between different lexical reductions, or does it require there to be a basis of common lexical elements? Or something else?
The example is contrived (there are no actions in the grammar), but the situations it arises from are not. For example, I'd like to use something like this for error recovery, where I can recognize a natural boundary for a unit of syntax and write a production for it. This generic production would be the last element in a priority chain; if it reduces, it means that there was no valid parse. More generally, I need to be able to select lexical elements based on syntactic context. I had hoped, since Rascal is scannerless, that this would be seamless. Perhaps it is, though I don't see it at the moment.
I'm on the unstable branch, version 0.10.0.201807050853.
EDIT: This question is not about > for defining an expression grammar. The documentation for priority declarations talks mostly about expressions, but the very first sentence provides what looks like a perfectly clear definition:
Priority declarations define a partial ordering between the productions within a single non-terminal.
So the example has two productions, an ordering declared between them, and yet the parser is still generating an ambiguity node in the clear presence of a disambiguation rule. So to put a finer point on my question, it looks like I don't know which of two situations pertains. Either (1) if this isn't supposed to work, then there's a defect in the language definition as documented, a deficiency in error reporting of the compiler, and a language design decision that's somewhere between counter-intuitive and user-hostile. Or (2) if this is supposed to work, there's a defect in the compiler and/or parser (presumably because the focus was initially on expressions) and at some point the example will pass its tests.
module ssce
import analysis::grammars::Ambiguity;
import ParseTree;
import IO;
import String;
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
> WordAny
;
test bool WordInitialDigit_0() = parseAccept( #Word, "4foo" );
test bool WordInitialDigit_1() = parseAccept( #WordInitialDigit, "4foo" );
test bool WordInitialDigit_2() = parseAccept( #WordAny, "4foo" );
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] #<a>, \"<b>\"");
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}
Excellent questions.
TL;DR:
the rule priority mechanism is not capable of an algorithmic ordering of a non-terminal's alternatives. Although some kind of partial order is involved in the additional grammatical constraints that a priority declaration generates, there is no "trying" one rule first, before the other. So it simply can't do that. The good news is that the priority mechanism has a formal semantics independent of any parsing algorithm, it's just defined in terms of context-free grammar rules and reduction traces.
using ambiguous rules for error recovery or "robust parsing", is a good idea. However, if there are too many such rules, the parser will eventually start showing quadratic or even cubic behavior, and tree building after parsing might even have higher polynomials. I believe the generated parser algorithm should have a (parameterized) mode for error recovery rather then expressing this at the grammar level.
Accepting ambiguity at parse time, and filtering/choosing trees after parsing is the recommended way to go.
All this talk of "ordering" in the documentation is misleading. Disambiguation is minefield of confusing terminology. For now, I recommend this SLE paper which has some definitions: https://homepages.cwi.nl/~jurgenv/papers/SLE2013-1.pdf
Details
priority mechanism not capable of choosing among alternatives
The use of the > operator and left, right generates a partial order between mutually recursive rules, such as found in expression languages, and limited to specific item positions in each rule: namely the left-most and right-most recursive positions which overlap. Rules which are lower in the hierarchy are not allowed to be grammatically expanded as "children" of rules which are higher in the hierarchy. So in E "*" E, neither E may be expaned to E "+" E if E "*" E > E "+" E.
The additional constraints do not choose for any E which alternative to try first. No they simply disallow certain expansions, assuming the other expansion is still valid and thus the ambiguity is solved.
The reason for the limitation at specific positions is that for these positions the parser generator can "prove" that they will generate ambiguity, and thus filtering one of the two alternatives by disallowing certain nestings will not result in additional parse errors. (consider a rule for array indexing: E "[" E "]" which should not have additional constraints for the second E. This is a so-called "syntax-safe" disambiguation mechanism.
All and all it is a pretty weak mechanism algorithmically, and specifically tailored for mutually recursive combinator/expression-like languages. The end-goal of the mechanism is to make sure we use have to use only 1 non-terminal for the entire expression language, and the parse trees looking very much akin in shape to abstract syntax trees. Rascal inherited all these considerations from SDF, via SDF2, by the way.
Current implementations actually "factor" the grammar or the parse table in some fashion invisibly to get the same effect, as-if somebody would have factored the grammar completely; however these implementations under-the-hood are very specific to the parsing algorithm in question. the GLR version is quite different from the GLL version, which again is quite different from the DataDependent version.
Post-parse filtering
Of course any tree, including ambiguous parse forests produced by the parser, can be manipulated by Rascal programs using pattern matching, visit, etc. You could write any algorithm to remove the trees you want. However, this requires the entire forest to be constructed first. It's possible and often fast enough, but there is a faster alternative.
Since the tree is built in a bottom-up fashion from the parse graph after parsing, we can also apply "rewrite rules" during the construction of the tree, and remove certain alternatives.
For example:
Tree amb({Tree a, *Tree others}) = amb(others) when weDoNotWant(a);
Tree amb({Tree a}) = a;
This first rule would match on the ambiguity cluster for all trees, and remove all alternatives which weDoNotWant. The second rule removes the cluster if only one alternative is left and let's the last tree "win".
If you want to choose among alternatives:
Tree amb({Tree a, Tree b, *Tree others}) = amb({a, others} when weFindPeferable(a, b);
If you don't want to use Tree but a more specific non-terminal like Statement that should also work.
This example module uses #prefer tags in syntax definitions to "prefer" rules which have been tagged over the other rules, as post-parse rewrite rules:
https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/sdf2/filters/PreferAvoid.rsc
Hacking around with additional lexical constraints
Next to priority disambiguation and post-parse rewriting, we still have the lexical level disambiguation mechanisms in the toolkit:
`NT \ Keywords" - rejecting finite (keyword) languages from a non-terminals
CC << NT, NT >> CC, CC !<< NT, NT !>> CC follow and preceede restrictions (where CC stands for character-class and NT for non-terminal)
Solving other kinds of ambiguity apart from the operator precedence stuff can be tried with these, in particular if the length of different sub-sentences is shorter/longer between the different alternatives, !>> can do the "maximal munch" or "longest match" thing. So I was thinking out loud:
lexical C = A? B?;
where A is one lexical alternative and B is the other. With the proper !>> restrictions on A and !<< restrictions on B the grammar might be tricked into always wanting to put all characters in A, unless they don't fit into A as a language, in which case they would default to B.
The obvious/annoying advice
Think harder about an unambiguous and simpler grammar.
Sometimes this means to abstract and allow more sentences in the grammar, avoiding use of the grammar for "type checking" the tree. It's often better to over-approximate the syntax of the language and then use (static) semantic analysis (over simpler trees) to get what you want, rather then staring at a complex ambiguous grammar.
A typical example: C blocks with declarations only at the start are much harder to define unambiguously then C blocks where declarations are allowed everywhere. And for a C90 mode, all you have to do is flag declarations which are not at the start of a block.
This particular example
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
| [0-9] !<< WordAny // this would help!
;
wrap up
Great question, thanks for the patience. Hope this helps!
The > disambiguation mechanism is for recursive definitions, like for example a expression grammar.
So it's to solve the following ambiguity:
syntax E
= [0-9]+
| E "+" E
| E "-" E
;
The string 1 + 3 - 4 can not be parsed as 1 + (3 - 4) or (1 + 3) - 4.
The > gives an order to this grammar, which production should be at the top of the tree.
layout L = " "*;
syntax E
= [0-9]+
| E "+" E
> E "-" E
;
this now only allows the (1 + 3) - 4 tree.
To finish this story, how about 1 + 1 + 1? That could be 1 + (1 + 1) or (1 + 1) + 1.
This is what we have left, right, and non-assoc for. They define how recursion in the same production should be handled.
syntax E
= [0-9]+
| left E "+" E
> left E "-" E
;
will now enforce: 1 + (1 + 1).
When you take an operator precendence table, like for example this c operator precedance table you can almost literally copy them.
note that these two disambiguation features are not exactly opposite to each other. the first ambiguitity could also have been solved by putting both productions in a left group like this:
syntax E
= [0-9]+
| left (
E "+" E
| E "-" E
)
;
As the left side of the tree is favored, you will now get a different tree 1 + (3 - 4). So it makes a difference, but it all depends on what you want.
More details can be found in the tutor pages on disambiguation

Using midaction rules in Lemon to interpret "let" expression

I'm trying to write a "toy" interpreter using Flex + Lemon that supports a very basic "let" syntax where a variable X is temporarily bound to an expression. For example, "letx 3 + 4 in x + 8" should evaluate to 15.
In essence, what I'd "like" the rule to say is:
expr(E) ::= LETX expr(N) IN expr(O). {
environment->X = N;
E = O;
}
But that won't work since O is evaluated before the X = N assignment is made.
I understand that the usual solution for this would be a mid-rule action. Lemon doesn't explicitly support this, but I've read elsewhere that would just be syntactic sugar in any event.
So I've tried to put together a mid-rule action that would do my assignment of X = N before interpreting O:
midruleaction ::= /* mid rule */. { environment->X = N; }
expr(E) ::= LETX expr(N) IN midruleaction expr(O). { E = O; }
But that won't work because there's no way for the midruleaction rule to access N, or at least none I can see in the lemon docs/examples.
I think I'm missing something here. I know I could build up a tree and then walk it in a second pass. And I might end up doing that, but I'd like to understand how to solve this more directly first.
Any suggestions?
It's really not a very scalable solution to evaluate immediately in a parser. See below.
It is true that mid-rule actions are (mostly) syntactic sugar. However, in most cases they are not syntactic sugar for "markers" (non-terminals with empty right-hand sides) but rather for non-terminals representing production prefixes. For example, you could write your letx rule like this:
expr(E) ::= letx_prefix IN expr(O). { E = O; }
letx_prefix ::= LETX expr(N). { environment->X = N; }
Or you could do this:
expr(E) ::= LETX assigned_expr IN expr(O). { E = O; }
assigned_expr ::= expr(N). { environment->X = N; }
The first one is the prefix desugaring; the second one is the one I'd use because I feel that it separates concerns better. The important point is that the environment->X = N; action requires access to the semantic values of the prefix of the RHS, so it must be part of a prefix rule (which includes at least the symbols whose semantic values are requires), rather than a marker, which has access to no semantic values at all.
Having said all that, immediate evaluation during parsing is a very limited strategy. It cannot cope with a large range of constructs which require deferred evaluation, such as loops and function definitions. It cannot cleanly cope with constructs which may suppress evaluation, such as conditionals and short-circuit operators. (These can be handled using MRAs and a stateful environment which contains an evaluation-suppressed flag, but that's very ugly.)
Another problem is that syntactically incorrect expressions may be partially evaluated before the syntax error is discovered, and it may not be immediately obvious to the user which parts of the expression have and have not been evaluated.
On the whole, you're better off building an easily-evaluated AST during the parse, and evaluating the AST when the parse successfully completes.

Binary trees, construct a tree based on preorder

constructing a tree given it's inorder is easy enough.
But, let's say you are supposed to construct a tree based on it's preorder (+ + y z + * x y z for example).
It's easy to see that + is the root, and how to continue in the left subtree from there.
But.. how do you know when you are supposed to "switch" to the right subtree?
Usually, inorder is considered the difficult case.
For preorder, you'll just have a grammar like this.
expr ::= operator expr expr | var
An operator is followed by exactly two well-formed expressions. This can be parsed easily using recursion
If you parse a tree and get a variable, return the variable.
If you parse a tree and get an operator, parse the two following trees as right/left subtrees.

Resources