What does the syntax condition `!` (labelled "except") do? - rascal

The Rascal grammar contains a production for syntax rules that's not documented:
| except: Sym symbol "!" NonterminalLabel
It acts syntactically like a follow condition and is in a section commented as "conditional". I see that it's used in the Rascal grammar itself. The NonterminalLabel is for individual production rules (not a production with all its alternates). So what does this condition do?

if ‘E ! add’ occurs in a rule then it simply means E but the rule labeled "add" for E will be as-if not existing.
This restriction goes only one level deep, so recursive E's will have the "add" rule again.
For example:
syntax E
= id: Id
| app: E "(" {E!comma ","}* ")"
> left mul: E "*" E
> left add: E "+" E
> right comma: E "," E
;
The E in {E ","}* of the function application rule is restricted to not be the comma expression, in order to avoid syntactic ambiguity.
The benefit here is that you do not have to introduce yet another non-terminal which also logically represents expressions.
The pitfall is that the ! operator is a hard restriction and thus may make the accepted language actually smaller: use of ! may lead to unexpected parse errors if not used carefully.
Side note: The ! operator is the primitive constraint type used to model the priority (>) and associative (left, right) semantics as well, but those are transitive and checked for safety, etc. This on the other hand is a brutal removal of a definition and thus may change the accepted sentences of your language.

Related

Validating expressions in the parser

I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".

Does a priority declaration disambiguate between alternative lexicals?

In my previous question, there was a priority > declaration in the example. It turned out not to matter because the solution there did not actually invoke priority but rather avoided it by making the alternatives disjoint. In this question, I'm asking whether priority can be used to select one lexical production over another. In the example below, the language of the production WordInitialDigit is intentionally a subset of that of WordAny. The production Word looks like it should disambiguate between the two properly, but the resulting parse tree has an ambiguity node at the top. Is a priority declaration able to decide between different lexical reductions, or does it require there to be a basis of common lexical elements? Or something else?
The example is contrived (there are no actions in the grammar), but the situations it arises from are not. For example, I'd like to use something like this for error recovery, where I can recognize a natural boundary for a unit of syntax and write a production for it. This generic production would be the last element in a priority chain; if it reduces, it means that there was no valid parse. More generally, I need to be able to select lexical elements based on syntactic context. I had hoped, since Rascal is scannerless, that this would be seamless. Perhaps it is, though I don't see it at the moment.
I'm on the unstable branch, version 0.10.0.201807050853.
EDIT: This question is not about > for defining an expression grammar. The documentation for priority declarations talks mostly about expressions, but the very first sentence provides what looks like a perfectly clear definition:
Priority declarations define a partial ordering between the productions within a single non-terminal.
So the example has two productions, an ordering declared between them, and yet the parser is still generating an ambiguity node in the clear presence of a disambiguation rule. So to put a finer point on my question, it looks like I don't know which of two situations pertains. Either (1) if this isn't supposed to work, then there's a defect in the language definition as documented, a deficiency in error reporting of the compiler, and a language design decision that's somewhere between counter-intuitive and user-hostile. Or (2) if this is supposed to work, there's a defect in the compiler and/or parser (presumably because the focus was initially on expressions) and at some point the example will pass its tests.
module ssce
import analysis::grammars::Ambiguity;
import ParseTree;
import IO;
import String;
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
> WordAny
;
test bool WordInitialDigit_0() = parseAccept( #Word, "4foo" );
test bool WordInitialDigit_1() = parseAccept( #WordInitialDigit, "4foo" );
test bool WordInitialDigit_2() = parseAccept( #WordAny, "4foo" );
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] #<a>, \"<b>\"");
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}
Excellent questions.
TL;DR:
the rule priority mechanism is not capable of an algorithmic ordering of a non-terminal's alternatives. Although some kind of partial order is involved in the additional grammatical constraints that a priority declaration generates, there is no "trying" one rule first, before the other. So it simply can't do that. The good news is that the priority mechanism has a formal semantics independent of any parsing algorithm, it's just defined in terms of context-free grammar rules and reduction traces.
using ambiguous rules for error recovery or "robust parsing", is a good idea. However, if there are too many such rules, the parser will eventually start showing quadratic or even cubic behavior, and tree building after parsing might even have higher polynomials. I believe the generated parser algorithm should have a (parameterized) mode for error recovery rather then expressing this at the grammar level.
Accepting ambiguity at parse time, and filtering/choosing trees after parsing is the recommended way to go.
All this talk of "ordering" in the documentation is misleading. Disambiguation is minefield of confusing terminology. For now, I recommend this SLE paper which has some definitions: https://homepages.cwi.nl/~jurgenv/papers/SLE2013-1.pdf
Details
priority mechanism not capable of choosing among alternatives
The use of the > operator and left, right generates a partial order between mutually recursive rules, such as found in expression languages, and limited to specific item positions in each rule: namely the left-most and right-most recursive positions which overlap. Rules which are lower in the hierarchy are not allowed to be grammatically expanded as "children" of rules which are higher in the hierarchy. So in E "*" E, neither E may be expaned to E "+" E if E "*" E > E "+" E.
The additional constraints do not choose for any E which alternative to try first. No they simply disallow certain expansions, assuming the other expansion is still valid and thus the ambiguity is solved.
The reason for the limitation at specific positions is that for these positions the parser generator can "prove" that they will generate ambiguity, and thus filtering one of the two alternatives by disallowing certain nestings will not result in additional parse errors. (consider a rule for array indexing: E "[" E "]" which should not have additional constraints for the second E. This is a so-called "syntax-safe" disambiguation mechanism.
All and all it is a pretty weak mechanism algorithmically, and specifically tailored for mutually recursive combinator/expression-like languages. The end-goal of the mechanism is to make sure we use have to use only 1 non-terminal for the entire expression language, and the parse trees looking very much akin in shape to abstract syntax trees. Rascal inherited all these considerations from SDF, via SDF2, by the way.
Current implementations actually "factor" the grammar or the parse table in some fashion invisibly to get the same effect, as-if somebody would have factored the grammar completely; however these implementations under-the-hood are very specific to the parsing algorithm in question. the GLR version is quite different from the GLL version, which again is quite different from the DataDependent version.
Post-parse filtering
Of course any tree, including ambiguous parse forests produced by the parser, can be manipulated by Rascal programs using pattern matching, visit, etc. You could write any algorithm to remove the trees you want. However, this requires the entire forest to be constructed first. It's possible and often fast enough, but there is a faster alternative.
Since the tree is built in a bottom-up fashion from the parse graph after parsing, we can also apply "rewrite rules" during the construction of the tree, and remove certain alternatives.
For example:
Tree amb({Tree a, *Tree others}) = amb(others) when weDoNotWant(a);
Tree amb({Tree a}) = a;
This first rule would match on the ambiguity cluster for all trees, and remove all alternatives which weDoNotWant. The second rule removes the cluster if only one alternative is left and let's the last tree "win".
If you want to choose among alternatives:
Tree amb({Tree a, Tree b, *Tree others}) = amb({a, others} when weFindPeferable(a, b);
If you don't want to use Tree but a more specific non-terminal like Statement that should also work.
This example module uses #prefer tags in syntax definitions to "prefer" rules which have been tagged over the other rules, as post-parse rewrite rules:
https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/sdf2/filters/PreferAvoid.rsc
Hacking around with additional lexical constraints
Next to priority disambiguation and post-parse rewriting, we still have the lexical level disambiguation mechanisms in the toolkit:
`NT \ Keywords" - rejecting finite (keyword) languages from a non-terminals
CC << NT, NT >> CC, CC !<< NT, NT !>> CC follow and preceede restrictions (where CC stands for character-class and NT for non-terminal)
Solving other kinds of ambiguity apart from the operator precedence stuff can be tried with these, in particular if the length of different sub-sentences is shorter/longer between the different alternatives, !>> can do the "maximal munch" or "longest match" thing. So I was thinking out loud:
lexical C = A? B?;
where A is one lexical alternative and B is the other. With the proper !>> restrictions on A and !<< restrictions on B the grammar might be tricked into always wanting to put all characters in A, unless they don't fit into A as a language, in which case they would default to B.
The obvious/annoying advice
Think harder about an unambiguous and simpler grammar.
Sometimes this means to abstract and allow more sentences in the grammar, avoiding use of the grammar for "type checking" the tree. It's often better to over-approximate the syntax of the language and then use (static) semantic analysis (over simpler trees) to get what you want, rather then staring at a complex ambiguous grammar.
A typical example: C blocks with declarations only at the start are much harder to define unambiguously then C blocks where declarations are allowed everywhere. And for a C90 mode, all you have to do is flag declarations which are not at the start of a block.
This particular example
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
| [0-9] !<< WordAny // this would help!
;
wrap up
Great question, thanks for the patience. Hope this helps!
The > disambiguation mechanism is for recursive definitions, like for example a expression grammar.
So it's to solve the following ambiguity:
syntax E
= [0-9]+
| E "+" E
| E "-" E
;
The string 1 + 3 - 4 can not be parsed as 1 + (3 - 4) or (1 + 3) - 4.
The > gives an order to this grammar, which production should be at the top of the tree.
layout L = " "*;
syntax E
= [0-9]+
| E "+" E
> E "-" E
;
this now only allows the (1 + 3) - 4 tree.
To finish this story, how about 1 + 1 + 1? That could be 1 + (1 + 1) or (1 + 1) + 1.
This is what we have left, right, and non-assoc for. They define how recursion in the same production should be handled.
syntax E
= [0-9]+
| left E "+" E
> left E "-" E
;
will now enforce: 1 + (1 + 1).
When you take an operator precendence table, like for example this c operator precedance table you can almost literally copy them.
note that these two disambiguation features are not exactly opposite to each other. the first ambiguitity could also have been solved by putting both productions in a left group like this:
syntax E
= [0-9]+
| left (
E "+" E
| E "-" E
)
;
As the left side of the tree is favored, you will now get a different tree 1 + (3 - 4). So it makes a difference, but it all depends on what you want.
More details can be found in the tutor pages on disambiguation

Bison: how to fix reduce/reduce conflict

Below is a a Bison grammar which illustrates my problem. The actual grammar that I'm using is more complicated.
%glr-parser
%%
s : e | p '=' s;
p : fp | p ',' fp;
fp : 'x';
e : te | e ';' te;
te : fe | te ',' fe;
fe : 'x';
Some examples of input would be:
x
x = x
x,x = x,x
x,x = x;x
x,x,x = x,x;x,x
x = x,x = x;x
What I'm after is for the x's on the left side of an '=' to be parsed differently than those on the right. However, the set of legal "expressions" which may appear on the right of an '='-sign is larger than those on the left (because of the ';').
Bison prints the message (input file was test.y):
test.y: conflicts: 1 reduce/reduce.
There must be some way around this problem. In C, you have a similar situation. The program below passes through gcc with no errors.
int main(void) {
int x;
int *px;
x;
*px;
*px = x = 1;
}
In this case, the 'px' and 'x' get treated differently depending on whether they appear to the left or right of an '='-sign.
You're using %glr-parser, so there's no need to "fix" the reduce/reduce conflict. Bison just tells you there is one, so that you know you grammar might be ambiguous, so you might need to add ambiguity resolution with %dprec or %merge directives. But in your case, the grammar is not ambiguous, so you don't need to do anything.
A conflict is NOT an error, its just an indication that your grammar is not LALR(1).
The reduce-reduce conflict in your grammar comes from the context:
... = ... x ,
At this point, the parser has to decide whether x is an fe or an fp, and it cannot know with one symbol lookahead. Indeed, it cannot know with any finite lookahead, you could have any number of repetitions of x , following that point without encountering a =, ; or the end of the input, any of which would reveal the answer.
This is not quite the same as the C issue, which can be resolved with single symbol lookahead. However, the C example is a classic illustration of why SLR(1) grammars are less powerful than LALR(1) grammars -- it's used for that purpose in the dragon book -- and a similarly problematic grammar is an example of the difference between LALR(1) and LR(1); it can be found in the bison manual (here):
def: param_spec return_spec ',';
param_spec: type | name_list ':' type;
return_spec: type | name ':' type;
type: "id";
name: "id";
name_list: name | name ',' name_list;
(The bison manual explains how to resolve this issue for LALR(1) grammars, although using a GLR grammar is always a possibility.)
The key to resolving such conflicts without using a GLR grammar is to avoid forcing the parser to make premature decisions.
For example, it is traditional to distinguish syntactically between lvalues and rvalues, and some languages continue to do so. C and C++ do not, however; and this turns out to be an extremely powerful feature in C++ because it allows the definition of functions which can act as lvalues.
In C, I think it's just to simplify the grammar a bit: the C grammar allows the result of any unary operator to appear on the left hand side of an assignment operator, but unary operators are actually a mix of lvalues (*v, v[expr]) and rvalues (sizeof v, f(expr)). The grammar could have distinguished between the two kinds of unary operators, but it could not resolve the actual restriction, which is that only modifiable lvalues may appears on the left side of an assignment operator.
C++ allows an arbitrary expression to appear on the left-hand side of an assignment operator (although some need to be parenthesized); consequently, the following is totally legal:
(predicate(x) ? *some_pointer : some_variable) = 42;
In your case, you could resolve the conflict syntactically by replacing te with p, since both non-terminals produce the same set of derivations. That's probably not the general solution, unless it is really the case in your full grammar that left-side expressions are a strict subset of right-side expressions. In a full grammar, you might end up with three types of expression (left-only, right-only, common), which could considerably complicated the grammar, and leaving the resolution for semantic analysis might prove to be easier (and even, as in the case of C++, surprisingly useful).

Producing Expressions from This Grammar with Recursive Descent

I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.

Bison Shift/Reduce conflict for simple grammar

I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.

Resources