parsing maxscript - problem with newlines - parsing

I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.

My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot

Related

Solve shift/reduce conflict across rules

I'm trying to learn bison by writing a simple math parser and evaluator. I'm currently implementing variables. A variable can be part of a expression however I'd like do something different when one enters only a single variable name as input, which by itself is also a valid expression and hence the shift reduce conflict. I've reduced the language to this:
%token <double> NUM
%token <const char*> VAR
%nterm <double> exp
%left '+'
%precedence TWO
%precedence ONE
%%
input:
%empty
| input line
;
line:
'\n'
| VAR '\n' %prec ONE
| exp '\n' %prec TWO
;
exp:
NUM
| VAR %prec TWO
| exp '+' exp { $$ = $1 + $3; }
;
%%
As you can see, I've tried solving this by adding the ONE and TWO precedences manually to some rules, however it doesn't seem to work, I always get the exact same conflict. The goal is to prefer the line: VAR '\n' rule for a line consisting of nothing but a variable name, otherwise parse it as expression.
For reference, the conflicting state:
State 4
4 line: VAR . '\n'
7 exp: VAR . ['+', '\n']
'\n' shift, and go to state 8
'\n' [reduce using rule 7 (exp)]
$default reduce using rule 7 (exp)
Precedence comparisons are always, without exception, between a production and a token. (At least, on Yacc/Bison). So you can be sure that if your precedence level list does not contain a real token, it will have no effect whatsoever.
For the same reason, you cannot resolve reduce-reduce conflicts with precedence. That doesn't matter in this case, since it's a shift-reduce conflict, but all the same it's useful to know.
To be even more specific, the precedence comparison is between a reduction (using the precedence of the production to be reduced) and that of the incoming lookahead token. In this case, the lookahead token is \n and the reduction is exp: VAR. The precedence level of that production is the precedence of VAR, since that is the last terminal symbol in the production. So if you want the shift to win out over the reduction, you need to declare your precedences so that the shift is higher:
%precedence VAR
%precedence '\n'
No pseudotokens (or %prec modifiers) are needed.
This will not change the parse, because Bison always prefers shift if there are no applicable precedence rules. But it will suppress the warning.

Bison subscript expression unexpected error

With the following grammar:
program: /*empty*/ | stmt program;
stmt: var_decl | assignment;
var_decl: type ID '=' expr ';';
assignment: expr '=' expr ';';
type: ID | ID '[' NUMBER ']';
expr: ID | NUMBER | subscript_expr;
subscript_expr: expr '[' expr ']';
I'd expect the following to be valid:
array[5] = 0;
That's just an assignment with a subscript_expr on the left-hand-side. However the generated parser gives an error for that statement:
syntax error, unexpected '=', expecting ID
Generating the parser also warns that there's 1 shift/reduce conflict. Removing subscript_expr makes it go away.
Why does this happen and how can I get it to parse array[5] = 0; as an assignment with a subscript_expr?
I'm using Bison 2.3.
The following two statements are both valid in your language:
x [ 3 ] = 42;
x [ 3 ] y = 42;
The first is an assignment of an element of the array variable x, while the second is a declaration and initialization of the array variable y whose elements are of type x.
But from the parser's viewpoint, x and y are both just IDs; it has no way of knowing that x is a variable in the first case and a type in the second case. All it can do is notice that the two statements match the productions assignment and var_decl, respectively.
Unfortunately, it cannot do that until it sees the token after the ]. If that token is an ID, then the statement must be a var_decl; otherwise, it's an assignment (assuming the statement is valid, of course).
But in order to parse the statement as an assignment, the parser must be able to produce
expr '=' expr
which in this case is the result of expr: subsciprt_expr, which in turn is subscript_expr: expr[expr]`.
So the set of reductions for the first statement will be as follows: (Note: I didn't write the shifts; rather, I mark the progress of the parse by putting a • at the end of each reduction. To get to the next step, just shift the • until you reach the end of the handle.)
ID • [ NUMBER ] = NUMBER ; expr: ID
expr [ NUMBER • ] = NUMBER ; expr: NUMBER
expr [ expr ] • = NUMBER ; subscript_expr: expr '[' expr ']'
subscript_expr • = NUMBER ; expr: subscript_expr
expr = NUMBER • ; expr: NUMBER
expr = expr ; • assignment: expr '=' expr ';'
assignment
The second statement must be parsed as follows:
ID [ NUMBER ] • ID = NUMBER ; type: ID '[' NUMBER ']'
type ID = NUMBER • ; expr: NUMBER
type ID = expr ; • var_decl: type ID '=' expr ';'
var_decl
That's a shift/reduce conflict, because the crucial decision must be made immediately after the first ID. In the first statement, we need to reduce the identifier to an expr. In the second statement, we must continue shifting until we are ready to reduce a type.
Of course, this problem wouldn't exist if we could lexically distinguish type IDs from variable name IDs, but that may not be possible (or, if possible, it may not be desirable because it requires feedback from the parser to the lexer).
As written, the shift/reduce prediction can be made with fixed lookahead, since the fourth token after the ID will determine the possibilities. That makes the grammar LALR(4), but that doesn't help much since bison only implements LALR(1) parsers. In any case, it is likely that a less simplified grammar will not be fixed-lookahead, for example if constant expressions are allowed for array sizes, or if arrays can have multiple dimensions.
Even so, the grammar is not ambiguous, so it is possible to use a GLR parser to parse it. Bison does implement GLR parsers; it is only necessary to insert
%glr-parser
into the prologue. (The shift/reduce warning will still be produced, but the parser will correctly identify both kinds of statement.)
It's worth noting that C doesn't have this particular parsing problem precisely because it puts the array size after the name of the variable being declared. I don't believe this was done to avoid parsing problems (although who knows?) but rather because it was believed that it is more natural to write declarations the way variables are used. Hence, we write int a[3] and char *p, because in the program we will dereference using a[i] and *p.
It is possible to write an LALR(1) grammar for this syntax, but it's a bit annoying. The key is to delay the reduction of the syntax ID [ NUMBER ] until we know for sure which production it will be the start of. That means we need to include the production expr: ID '[' NUMBER ']'. That will result in a larger number of shift/reduce warnings (since it makes the grammar ambiguous), but since bison always prefers to shift, it should produce a correct parser.
Adding %glr-parser solves this.

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

ANTLR4 - Syntax error on '#' (alternative rule label)

I have made a grammar that will be used with ANTLR4 with the following definition for expressions:
// Expressions
Expr : Integer # Expr_Integer
| Float # Expr_Float
| Double # Expr_Double
| String # Expr_String
| Variable # Expr_Variable
| FuncCall # Expr_FuncCall
| Expr Op_Infix Expr # Expr_Infix
| Op_Prefix Expr # Expr_Prefix
| Expr Op_Postfix # Expr_Postfix
| Expr 'is' Id # Expr_Is
| 'this' # Expr_This
| Expr '?' Expr ':' Expr # Expr_Ternary
| '(' Expr ')' # Expr_Bracketed
;
I added the labels so that I could easily differentiate between the different expression types when analysing the generated syntax tree. However, ANTLR4 throws the following error for every single one of the above lines (excluding the one with the comment):
error(50): Ash.g4:88:19: syntax error: '#' came as a complete surprise to me while looking for lexer rule element
Line 88 is the final rule alternative ( '(' Expr ')' )
I have look through the documentation and various online examples and my syntax seems correct.
What could be causing the error to be thrown?
In Antlr, rules beginning with an uppercase letter are lexer rules, and those beginning with an lowercase letter are parser rules. Antlr uses these definitions a lot to define what you can and cannot do. Usually, the lexer is faster to proccess but less powerful than the parser.
In your case, Expr should definitely be a parser rule, as basically every other rule you have referenced there. Changing it to expr should match the expected behavior.
As a rule of thumb, lexer rules are to be used only when there is no context, it doesn't matter what is next to the generated token. Things like numeric constants, string constants, identifiers and such.

Reduce/reduce conflict in grammar

Let's imagine I want to be able to parse values like this (each line is a separate example):
x
(x)
((((x))))
x = x
(((x))) = x
(x) = ((x))
I've written this YACC grammar:
%%
Line: Binding | Expr
Binding: Pattern '=' Expr
Expr: Id | '(' Expr ')'
Pattern: Id | '(' Pattern ')'
Id: 'x'
But I get a reduce/reduce conflict:
$ bison example.y
example.y: warning: 1 reduce/reduce conflict [-Wconflicts-rr]
Any hint as to how to solve it? I am using GNU bison 3.0.2
Reduce/reduce conflicts often mean there is a fundamental problem in the grammar.
The first step in resolving is to get the output file (bison -v example.y produces example.output). Bison 2.3 says (in part):
state 7
4 Expr: Id .
6 Pattern: Id .
'=' reduce using rule 6 (Pattern)
')' reduce using rule 4 (Expr)
')' [reduce using rule 6 (Pattern)]
$default reduce using rule 4 (Expr)
The conflict is clear; after the grammar reads an x (and reduces that to an Id) and a ), it doesn't know whether to reduce the expression as an Expr or as a Pattern. That presents a problem.
I think you should rewrite the grammar without one of Expr and Pattern:
%%
Line: Binding | Expr
Binding: Expr '=' Expr
Expr: Id | '(' Expr ')'
Id: 'x'
Your grammar is not LR(k) for any k. So you either need to fix the grammar or use a GLR parser.
Suppose the input starts with:
(((((((((((((x
Up to here, there is no problem, because every character has been shifted onto the parser stack.
But now what? At the next step, x must be reduced and the lookahead is ). If there is an = somewhere in the future, x is a Pattern. Otherwise, it is an Expr.
You can fix the grammar by:
getting rid of Pattern and changing Binding to Expr | Expr '=' Expr;
getting rid of all the definitions of Expr and replacing them with Expr: Pattern
The second alternative is probably better in the long run, because it is likely that in the full grammar which you are imagining (or developing), Pattern is a subset of Expr, rather than being identical to Expr. Factoring Expr into a unit production for Pattern and the non-Pattern alternatives will allow you to parse the grammar with an LALR(1) parser (if the rest of the grammar conforms).
Or you can use a GLR grammar, as noted above.

Resources