I'm new to ANTLR and am trying to understand how to do some things with it. I need it to throw an error when a statement is missing things, like a semicolon or an end bracket. It's been called negative test cases by the problem set that I'm working through.
For example, the below code returns true, which is correct.
val program = """
1 + 2;
"""
recognize(program)
However, this code also returns true, despite it missing the semicolon at the end. It should return false ([PARSER error at line=1]: missing ';' at '').
val program = """
1 + 2
""".trimIndent()
recognize(program)
The grammar is as follows:
program: (expression ';')* | EOF;
expression: INT PLUS INT | OPENBRAC INT PLUS INT CLOSEBRAC | QUOTE IDENT QUOTE PLUS QUOTE IDENT QUOTE;
IDENT: [A-Za-z0-9]+;
INT: [-][0-9]+ | ('0'..'9')+;
PLUS: '+';
OPENBRAC: '(';
CLOSEBRAC: ')';
QUOTE: '"';
program: (expression ';')* | EOF;
This means a program can either be zero or more instances of expression ';' followed by whatever else is in the input stream or it can be empty. Since (expression ';')* can already match the empty input by itself, the | EOF is just redundant.
What you want is program: (expression ';')* EOF, which means that a program consists of zero or more instances of expression ';', followed by the end of input, meaning there must be nothing left in the input afterwards.
I am implementing a parser for a subset of Java using Java CUP.
The grammar is like
vardecl ::= type ID
type ::= ID | INT | FLOAT | ...
exp ::= ID | exp LBRACKET exp RBRACKET | ...
stmt ::= ID ASSIGN exp SEMI
This works fine, but when I add
stmt ::= ID ASSIGN exp SEMI
|ID LBRACKET exp RBRACKET ASSIGN exp SEMI
CUP won't work, the warnings are:
Warning : *** Shift/Reduce conflict found in state #122
between exp ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
Warning : *** Reduce/Reduce conflict found in state #42
between type ::= identifier (*)
and exp ::= identifier (*)
under symbols: {}
Resolved in favor of the first production.
Warning : *** Shift/Reduce conflict found in state #42
between type ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #42
between exp ::= identifier (*)
and statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI
under symbol LBRACKET
Resolved in favor of shifting.
I think there are two problems:
1. type ::= ID and exp ::= ID, when the parser sees an ID, it wants to reduce it, but it doesn't know which to reduce, type or exp.
stmt ::= ID LBRACKET exp RBRACKET ASSIGN exp SEMI is for assignment of an element in array, such as arr[key] = value;
exp :: exp LBRACKET exp RBRACKET is for expression of get an element from array, such as arr[key]
So in the case arr[key], when the parser sees arr, it knows that it is an ID, but it doesn't know if it should shift or reduce to exp.
However, I have no idea of how to fix this, please give me some advice if you have, thanks a lot.
Your analysis is correct. The grammar is LR(2) because declarations cannot be identified until the ] token is seen, which will be the second-next token from the ID which could be a type.
One simple solution is to hack the lexer to return [] as a single token when the brackets appear as consecutive tokens. (The lexer should probably allow whitespace between the brackets, too, so it's not quite trivial but it's not complicated.) If a [ is not immediately followed by a ], the lexer will return it as an ordinary [. That makes it easy for the parser to distinguish between assignment to an array (which will have a [ token) and declaration of an array (which will have a [] token).
It's also possible to rewrite the grammar, but that's a real nuisance.
The second problem -- array indexing assignment versus array indexed expressions. Normally programming languages allow assignment of the form:
exp [ exp ] = exp
and not just ID [ exp ]. Making this change will delay the need to reduce until late enough for the parser to identify the correct reduction. Depending on the language, it's possible that this syntax is not semantically meaningful but checking that is in the realm of type checking (semantics) not syntax. If there is some syntax of that form which is meaningful, however, there is no obvious reason to prohibit it.
Some parser generators implement GLR parsers. A GLR parser would have no problem with this grammar because it is no ambiguous. But CUP isn't such a generator.
I'm working on a GLR-parser in GNU bison and I have the following problem:
the language I'm trying to parse allows boolean expressions including relations (<,>,<=,...) and boolean composition (and, or, not). Now the problem is that the language also allows to have multiple arithmetic expressions on the right side of a relation... and they are composed using the same AND token that is used for boolean composition! This is a very dumb language-design, but I can't change it.
So you can have a > b and c which is supposed to be equivalent to (a > b) and (a > c) and you can also have a > b and c > d which is supposed to be equivalent to (a > b) and (c > d)
The S/R conflict this causes is already obvious in this example: after reading a > b with lookahead and you could either reduce the a > b to a boolean expression and wait for another boolean expression or you could shift the and and wait for another arithmetic expression.
My grammar currently looks like this:
booleanexpression
: relation
| booleanexpression TOK_AND booleanexpression
...
;
relation
: arithmeticexpression TOK_GT maxtree
...
;
maxtree
: arithmeticexpression
| maxtree TOK_AND maxtree
...
;
The language is clearly not LR(k) for any k, since the S/R conflict can't be resolved using any constant k-lookahead, because the arithmeticexpression in between can have arbitrarily many tokens. Because of that, I turned GLR-parsing on.
But when I try to parse a > b and c with this, I can see in my debug outputs, that the parser behaves like this:
it reads the a and at lookahead > it reduces the a to an arithmeticexpression
it reads the b and at lookahead and it reduces the b to an arithmeticexpression and then already to a maxtree
it reduces the a > b to a relation
it reads the c and reduces it to an arithmeticexpression
then nothing happens! The and c are apparently discarded - the debug outputs don't show any action for these tokens. Not even an error message. The corresponding if-statement doesn't exist in my AST (I still get an AST because I have error recovery).
I would think that, after reading the b, there should be 2 stacks. But then the b shouldn't be reduced. Or at least it should give me some error message ("language is ambiguous" would be okay and I have seen that message before - I don't see why it wouldn't apply here). Can anyone make sense of this?
From looking at the grammar for a while, you can tell that the main question here is whether after the next arithmeticexpression there comes
another relation token (then you should reduce)
another boolean composition (then you should shift)
a token outside of the boolean/arithmetic -expression syntax (like THEN) which would terminate the expression and you should also shift
Can you think of a different grammar that captures the situation in a better / more deterministic way? How would you approach the problem? I'm currently thinking about making the grammar more right-to-left, like
booleanexpression : relation AND booleanexpression
maxtree : arithmeticexpression AND maxtree
etc.
I think that would make bison prefer shifting and only reduce on the right first. Maybe by using different non-terminals it would allow a quasi-"lookahead" behind the arithmeticexpression...
Side note: GnuCOBOL handles this problem by just collecting all the tokens, pushing them on an intermediate stack and manually building the expression from there. That discourages me, but I cling to the hope that they did it this way because bison didn't support GLR-parsing when they started...
EDIT:
a small reproducible example
%{
#include <stdio.h>
int yylex ();
void yyerror(const char* msg);
%}
%glr-parser
%left '&'
%left '>'
%%
input: %empty | input bool '\n' {printf("\n");};
arith : 'a' | 'b' | 'c';
maxtree : arith { printf("[maxtree : arith] "); }
| maxtree '&' maxtree { printf("[maxtree : maxtree & maxtree] "); } ;
rel : arith '>' maxtree { printf("[rel : arith > maxtree] "); } ;
bool : rel { printf("[bool : rel] "); }
| bool '&' bool { printf("[bool : bool & bool] "); } ;
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex () {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
return yyparse();
}
this one strangely does print the error message "syntax error" on input a>b&c.
Being able to simplify grammars by using precedence declarations is really handy (sometimes) [Note 1] but it doesn't play well with using GLR parsers because it can lead to early rejection of an unambiguous parse.
The idea behind precedence declarations is that they resolve ambiguities (or, more accurately, shift/reduce conflicts) using a simple one-token lookahead and a configured precedence between the possible reduction and the possible shift. If a grammar has no shift/reduce conflict, the precedence declarations won't be used, but if they are used they will be used to suppress either the shift or the reduce, depending on the (static) precedence relationship.
A Bison-generated GLR parser does not actually resolve ambiguity, but it allows possibly incorrect parses to continue to be developed until the ambiguity is resolved by the grammar. Unlike the use of precedence, this is a delayed resolution; a bit slower but a lot more powerful. (GLR parsers can produce a "parse forest" containing all possible parses. But Bison doesn't implement this feature, since it expects to be parsing programming languages and unlike human languages, programming languages cannot be ambiguous.)
In your language, it is impossible to resolve the non-determinism of the shift/reduce conflict statically, as you note yourself in the question. Your grammar is simply not LR(1), much less operator precedence, and GLR parsing is therefore a practical solution. But you have to allow GLR to do its work. Prematurely eliminating one of the plausible parses with a precedence comparison will prevent the GLR algorithm from considering it later. This will be particularly serious if you manage to eliminate the only parse which could have been correct.
In your grammar, it is impossible to define a precedence relationship between the rel productions and the & symbol, because no precedence relationship exists. In some sentences, the rel reduction needs to win; in other sentences, the shift should win. Since the grammar is not ambiguous, GLR will eventually figure out which is which, as long as both the shift and the reduce are allowed to proceed.
In your full language, both boolean and arithmetic expressions have something akin to operator precedence, but only within their respective domains. An operator precedence parser (and, equivalently, yacc/bison's precedence declarations) works by erasing the difference between different non-terminals; it cannot handle a grammar like yours in which some operator has different precedences in different domains (or between different domains).
Fortunately, this particular use of precedence declarations is only a shortcut; it does not give any additional power to the grammar and can easily and mechanically be implemented by creating new non-terminals, one for each precedence level. The alternative grammar will not be ambiguous. The classic example, which you can find in pretty well any textbook or tutorial which includes parsing arithmetic expressions, is the expr/term/factor grammar. Here I've also provided the precedence grammar for comparison:
%left '+' '-'
%left '*' '/'
%% %%
expr : term
| expr '+' term expr: expr '+' expr
| expr '-' term | expr '-' expr
term : factor
| term '*' factor | expr '*' expr
| term '/' factor | expr '/' expr
factor: ID | ID
| '(' expr ')' | '(' expr ')'
In your minimal example, there are already enough non-terminals that no new ones need to be invented, so I've just rewritten it according to the above model.
I've left the actions as I wrote them, in case the style is useful to you. Note that this style leaks memory like a sieve, but that's ok for quick tests:
%code top {
#define _GNU_SOURCE 1
}
%{
#include <ctype.h>
#include <stdio.h>
#include <string.h>
int yylex(void);
void yyerror(const char* msg);
%}
%define api.value.type { char* }
%glr-parser
%token ID
%%
input : %empty
| input bool '\n' { puts($2); }
arith : ID
maxtree : arith
| maxtree '&' arith { asprintf(&$$, "[maxtree& %s %s]", $1, $3); }
rel : arith '>' maxtree { asprintf(&$$, "[COMP %s %s]", $1, $3); }
bool : rel
| bool '&' rel { asprintf(&$$, "[AND %s %s]", $1, $3); }
%%
void yyerror(const char* msg) { printf("%s\n", msg); }
int yylex(void) {
int c;
while ((c = getchar ()) == ' ' || c == '\t');
if (isalpha(c)) {
*(yylval = strdup(" ")) = c;
return ID;
}
else return c == EOF ? 0 : c;
}
int main (int argc, char** argv) {
#if YYDEBUG
if (argc > 1 && strncmp(argv[1], "-d", 2) == 0) yydebug = 1;
#endif
return yyparse();
}
Here's a sample run. Note the warning from bison about a shift/reduce conflict. If there had been no such warning, the GLR parser would probably be unnecessary, since a grammar without conflicts is deterministic. (On the other hand, since bison's GLR implementation optimises for determinism, there is not too much cost for using a GLR parser on a deterministic language.)
$ bison -t -o glr_prec.c glr_prec.y
glr_prec.y: warning: 1 shift/reduce conflict [-Wconflicts-sr]
$ gcc -Wall -o glr_prec glr_prec.c
$ ./glr_prec
a>b
[COMP a b]
a>b & c
[COMP a [maxtree& b c]]
a>b & c>d
[AND [COMP a b] [COMP c d]]
a>b & c & c>d
[AND [COMP a [maxtree& b c]] [COMP c d]]
a>b & c>d & e
[AND [COMP a b] [COMP c [maxtree& d e]]]
$
Notes
Although precedence declarations are handy when you understand what's actually going on, there is a huge tendency for people to just cargo-cult them from some other grammar they found on the internet, and not infrequently a grammar which was also cargo-culted from somewhere else. When the precedence declarations don't work as expected, the next step is to randomly modify them in the hopes of finding a configuration which works. Sometimes that succeeds, often leaving behind unnecessary detritus which will go on to be cargo-culted again.
So, although there are circumstances in which precedence declarations really simplify grammars and the unambiguous implementation would be quite a lot more complicated (such as dangling-else resolution in languages which have many different compound statement types), I've still found myself recommending against their use.
In a recent answer to a different question, I wrote what I hope is a good explanation of the precedence algorithm (and if it isn't, please let me know how it falls short).
Welcome to the wonderful world of COBOL. I could be wrong, but you may have a few
additional problems here. An expression such as A > B AND C in COBOL is ambiguous
until you know how C was declared. Consider the following program:
IDENTIFICATION DIVISION.
PROGRAM-ID EXAMPLE.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 A PIC 9 VALUE 2.
01 B PIC 9 VALUE 1.
01 W PIC 9 VALUE 3.
88 C VALUE 3.
PROCEDURE DIVISION.
IF A > B AND C
DISPLAY 'A > B AND 88 LEVEL C is TRUE because W = ' W
ELSE
DISPLAY 'A not > B or 88 LEVEL C is not TRUE'
END-IF
DISPLAY 'A: ' A ' B: ' B ' W:' W
GOBACK
.
Output from this program is:
A > B AND 88 LEVEL C is TRUE because W = 3
A: 2 B: 1 W: 3
In essence the expression: A > B AND C is equivalent to: A > B AND W = 3. Had C
been defined in a manner similar to A and B, the semantics would
have been: A > B AND A > C, which in this case, is FALSE.
The code mentioned above works well, but I had never gotten it to work in my real project, even though I couldn't see a difference between my real project and this code.
This drove me crazy, but I just found another problem in my code, which prevented this method from working:
I had an (admittedly cargo-culted) %skeleton "lalr1.cc" in my prologue, which disabled the GLR parsing again!
I needed to replace this with
%skeleton "glr.cc"
I've been trying to tackle a seemingly simple shift/reduce conflict with no avail. Naturally, the parser works fine if I just ignore the conflict, but I'd feel much safer if I reorganized my rules. Here, I've simplified a relatively complex grammar to the single conflict:
statement_list
: statement_list statement
|
;
statement
: lvalue '=' expression
| function
;
lvalue
: IDENTIFIER
| '(' expression ')'
;
expression
: lvalue
| function
;
function
: IDENTIFIER '(' ')'
;
With the verbose option in yacc, I get this output file describing the state with the mentioned conflict:
state 2
lvalue -> IDENTIFIER . (rule 5)
function -> IDENTIFIER . '(' ')' (rule 9)
'(' shift, and go to state 7
'(' [reduce using rule 5 (lvalue)]
$default reduce using rule 5 (lvalue)
Thank you for any assistance.
The problem is that this requires 2-token lookahead to know when it has reached the end of a statement. If you have input of the form:
ID = ID ( ID ) = ID
after parser shifts the second ID (lookahead is (), it doesn't know whether that's the end of the first statement (the ( is the beginning of a second statement), or this is a function. So it shifts (continuing to parse a function), which is the wrong thing to do with the example input above.
If you extend function to allow an argument inside the parenthesis and expression to allow actual expressions, things become worse, as the lookahead required is unbounded -- the parser needs to get all the way to the second = to determine that this is not a function call.
The basic problem here is that there's no helper punctuation to aid the parser in finding the end of a statement. Since text that is the beginning of a valid statement can also appear in the middle of a valid statement, finding statement boundaries is hard.
I'm trying to parse a grammar in ocamlyacc (pretty much the same as regular yacc) which supports function application with no operators (like in Ocaml or Haskell), and the normal assortment of binary and unary operators. I'm getting a reduce/reduce conflict with the '-' operator, which can be used both for subtraction and negation. Here is a sample of the grammar I'm using:
%token <int> INT
%token <string> ID
%token MINUS
%start expr
%type <expr> expr
%nonassoc INT ID
%left MINUS
%left APPLY
%%
expr: INT
{ ExprInt $1 }
| ID
{ ExprId $1 }
| expr MINUS expr
{ ExprSub($1, $3) }
| MINUS expr
{ ExprNeg $2 }
| expr expr %prec APPLY
{ ExprApply($1, $2) };
The problem is that when you get an expression like "a - b" the parser doesn't know whether this should be reduced as "a (-b)" (negation of b, followed by application) or "a - b" (subtraction). The subtraction reduction is correct. How do I resolve the conflict in favor of that rule?
Unfortunately, the only answer I can come up with means increasing the complexity of the grammar.
split expr into simple_expr and expr_with_prefix
allow only simple_expr or (expr_with_prefix) in an APPLY
The first step turns your reduce/reduce conflict into a shift/reduce conflict, but the parentheses resolve that.
You're going to have the same problem with 'a b c': is it a(b(c)) or (a(b))(c)? You'll need to also break off applied_expression and required (applied_expression) in the grammar.
I think this will do it, but I'm not sure:
expr := INT
| parenthesized_expr
| expr MINUS expr
parenthesized_expr := ( expr )
| ( applied_expr )
| ( expr_with_prefix )
applied_expr := expr expr
expr_with_prefix := MINUS expr
Well, this simplest answer is to just ignore it and let the default reduce/reduce resolution handle it -- reduce the rule that appears first in the grammar. In this case, that means reducing expr MINUS expr in preference to MINUS expr, which is exactly what you want. After seeing a-b, you want to parse it as a binary minus, rather than a unary minus and then an apply.