I'm trying to write a specification file for sablecc for a version of minipython (with postfix/prefix increment and decrement operators), and some productions naturally need to use identifiers, but i get these conflicts during parsing:
shift/reduce conflict in state [stack: TPrint TIdentifier *] on TPlusPlus in {
[ PMultiplication = TIdentifier * ] followed by TPlusPlus (reduce),
[ PPostfix = TIdentifier * TPlusPlus ] (shift)
}
shift/reduce conflict in state [stack: TPrint TIdentifier *] on TMinusMinus in {
[ PMultiplication = TIdentifier * ] followed by TMinusMinus (reduce),
[ PPostfix = TIdentifier * TMinusMinus ] (shift)
}
shift/reduce conflict in state [stack: TPrint TIdentifier *] on TLPar in {
[ PFunctionCall = TIdentifier * TLPar PArglist TRPar ] (shift),
[ PFunctionCall = TIdentifier * TLPar TRPar ] (shift),
[ PMultiplication = TIdentifier * ] followed by TLPar (reduce)
}
shift/reduce conflict in state [stack: TPrint TIdentifier *] on TLBr in {
[ PExpression = TIdentifier * TLBr PExpression TRBr ] (shift),
[ PMultiplication = TIdentifier * ] followed by TLBr (reduce),
[ PPostfix = TIdentifier * TLBr PExpression TRBr TMinusMinus ] (shift),
[ PPostfix = TIdentifier * TLBr PExpression TRBr TPlusPlus ] (shift)
}
java.lang.RuntimeException:
I started by following a given bnf of the language and got to this.
Here is the grammar file:
Productions
goal = {prgrm}program* ;
program = {func}function | {stmt}statement;
function = {func}def identifier l_par argument? r_par semi statement ;
argument = {arg} identifier assign_value? subsequent_arguments* ;
assign_value = {assign} eq value ;
subsequent_arguments = {more_args} comma identifier assign_value? ;
statement = {case1}tab* if comparison semi statement
| {case2}tab* while comparison semi statement
| {case3}tab* for [iterator]:identifier in [collection]:identifier semi statement
| {case4}tab* return expression
| {case5}tab* print expression more_expressions
| {simple_equals}tab* identifier eq expression
| {add_equals}tab* identifier add_eq expression
| {minus_equals}tab* identifier sub_eq expression
| {div_equals}tab* identifier div_eq expression
| {case7}tab* identifier l_br [exp1]:expression r_br eq [exp2]:expression
| {case8}tab* function_call;
comparison = {less_than} comparison less relation
| {greater_than} comparison great relation
| {rel} relation;
relation = {relational_value} relational_value
| {logic_not_equals} relation logic_neq relational_value
| {logic_equals} relation logic_equals relational_value;
relational_value = {expression_value} expression_value
| {true} true
| {false} false;
expression = {case1} arithmetic_expression
| {case2} prefix
| {case4} identifier l_br expression r_br
| {case9} l_br more_values r_br;
more_expressions = {more_exp} expression subsequent_expressions*;
subsequent_expressions = {more_exp} comma expression;
arithmetic_expression = {plus} arithmetic_expression plus multiplication
| {minus} arithmetic_expression minus multiplication
| {multiplication} multiplication ;
multiplication = {expression_value} expression_value
| {div} multiplication div expression_value
| {mult} multiplication mult expression_value;
expression_value = {exp} l_par expression r_par
| {function_call} function_call
| {value} value
| {identifier} identifier ;
prefix = {pre_increment} plus_plus prepost_operand
| {pre_decrement} minus_minus prepost_operand
| {postfix} postfix;
postfix = {post_increment} prepost_operand plus_plus
| {post_decrement} prepost_operand minus_minus;
prepost_operand = {value} identifier l_br expression r_br
| {identifier} identifier;
function_call = {args} identifier l_par arglist? r_par;
arglist = {arglist} more_expressions ;
value = {number} number
| {string} string ;
more_values = {more_values} value subsequent_values* ;
subsequent_values = comma value ;
number = {int} numeral
| {float} float_numeral ;
where identifier is of course a token, and the problematic productions where it can be found are function_call, prepost_operand, expression_value.
I experimentally removed prefix/postfix and prepost_operand to see if the conflicts would at least change a little, but that just leaves the two last conflicts.
Is there any way i can resolve these conflicts without changing the grammar much, or have i gone down a completely wrong path?
The problem is the production whose right-hand side is:
print expression more_expressions
more_expressions matches a list of expressions (so it probably should be called expression_list to be less confusing). Two consecutive expressions in a rule is obviously ambiguous (if you could have two expressions, would 1+1+1 be 1+1 followed by +1 or 1 followed by +1+1?). What you want is just
print more_expressions
Related
I am translating the rules of my grammar into an AST.
Is it necessary to use the "and" operator in defining our AST?
For instance, I have translated my grammar thus far like so:
type program =
| Decls of typ * identifier * decls_prime
type typ =
| INT
| BOOL
| VOID
type identifier = string
(* decls_prime = vdecl decls | fdecl decls *)
type declsprime =
| Vdecl of variabledeclaration * decls
| Fdecl of functiondeclaration * decls
(*“lparen” formals_opt “rparen” “LBRACE” vdecl_list stmt_list “RBRACE”*)
type functiondeclaration =
| Fdecl of variabledeclarationlist * stmtlist
(*formals_opt = formal_list | epsilon *)
type FormalsOpt =
|FormalsOpt of formallist
(* typ “ID” formal_list_prime *)
type formalList =
| FormalList of typ * identifier * formallistprime
type formallistprime =
| FormalListPrime of formalList
type variabledeclarationlist =
| VdeclList of variabledeclaration * variabledeclarationlist
(*stmt stmt_list | epsilon*)
type stmtlist =
| StmtList of stmt * stmtlist
| StmtlistNil
(* stmt = “RETURN” stmt_prime| expr SEMI |“LBRACE” stmt_list RBRACE| IF LPAREN expr RPAREN stmt stmt_prime_prime| FOR LPAREN expr_opt SEMI expr SEMI expr_opt RPAREN stmt| WHILE LPAREN expr RPAREN stmt*)
type Stmt
| Return of stmtprime
| Expression of expr
| StmtList of stmtlist
| IF of expr * stmt * stmtprimeprime
| FOR of expropt * expr * expropt * stmt
| WHILE of expr * stmt
(*stmt_prime = SEMI| expr SEMI*)
type stmtprime
| SEMI
| Expression of expr
(*NOELSE | ELSE stmt*)
type stmtprimeprime
| NOELSE
| ELSE of stmt
(* Expr_opt = expr | epsilon *)
type expropt =
| Expression of expr
| ExprNil
type Expr
type ExprPrime
(* Actuals_opt = actuals_list | epsilon *)
type ActualsOpt=
| ActualsList of actualslist
| ActualsNil
type ActualsList =
| ActualsList of expr * actualslistprime
(*actualslistprime = COMMA expr actuals_list_prime | epsilon*)
type actualslistprime =
| ActualsListPrime of expr * actualslistprime
| ALPNil
But it looks as though this example from Illinois uses a slightly different structure:
type program = Program of (class_decl list)
and class_decl = Class of id * id * (var_decl list) * (method_decl list)
and method_decl = Method....
Is it necessary to use "and" when defining my AST? And moreover, is it wrong for me to use a StmtList type rather than (stmt list) even though I call the AST StmtList method correctly in my parser?
You only need and when your definitions are mutually recursive. That is, if a statement could contain an expression and an expression could in turn contain a statement, then Expr and Stmt would have to be connected with an and. If your code compiles without and, you don't need the and.
PS: This is unrelated to your question, but I think it would make a lot more sense to use the list and option types than to define your own versions for specific types (such as stmntlist, expropt etc.). stmtprime is another such case: You could just define Return as Return of expr option and get rid of the stmtprime type. Same with stmtprimeprime.
I'm supposed to write a .grammar file for MiniPython using Sablecc. I'm getting these shift/reduce conflicts:
shift/reduce conflict in state [stack: TIf PTpower *] on TMult in {
[ PMltp = * TMult PTopower Mltp ] (shift)
[ PMlpt = * ] followed by TMult (reduce)
}
shift/reduce conflict in state [stack: TIf PTopower *] on TDiv in {
[ PMltp = * TDiv PTopower Mltp ] (shift)
[ PMltp = * ] followed by TDiv (reduce)
}
Some of the tokens are:
id = letter (letter | digit)*;
digit = ['0' .. '9'];
letter = ['a' .. 'z']|['A' .. 'Z'];
pow = '**';
mult = '*';
div = '/';
plus = '+';
minus = '-';
assert = 'assert';
l_par = '(';
r_par = ')';
l_bra = '[';
r_bra = ']';
Part of my .grammar file is this:
expression = multiplication exprsn;
exprsn = {addition} plus multiplication exprsn
| {subtraction} minus multiplication exprsn
| {empty};
topower = something tpwr;
tpwr = {topower} pow something tpwr
| {empty};
multiplication = topower mltp;
mltp = {multiplication} mult topower mltp
| {division} div topower mltp
| {empty};
something = {id} id
| {parexp} l_par expression r_par
| {fcall} functioncall
| {value} value
| {list} id l_bra expression r_bra
| {other} l_bra value comval* r_bra
| {assert} assert expression comexpr?;
comexpr = comma expression;
This is the grammar after I tried to eliminate left recursion. I noticed that if I remove the assert rule from the something production, I get no conflicts. Also, removing the {empty} rules from exprsn, tpwr and mltp rules gives me no conflicts but I don't think this is the correct way to resolve this.
Any tips would be really appreciated.
UPDATE: Here is the whole grammar, before removing left recursion, as requested:
Package minipython;
Helpers
digit = ['0' .. '9'];
letter = ['a' .. 'z']|['A' .. 'Z'];
cr = 13;
lf = 10;
all = [0..127];
eol = lf | cr | cr lf ;
not_eol = [all - [cr + lf]];
Tokens
tab = 9;
plus = '+';
dot = '.';
pow = '**';
minus = '-';
mult = '*';
div = '/';
eq = '=';
minuseq = '-=';
diveq = '/=';
exclam = '!';
def = 'def';
equal = '==';
nequal = '!=';
l_par = '(';
r_par = ')';
l_bra = '[';
r_bra = ']';
comma= ',';
qmark = '?';
gqmark = ';';
assert = 'assert';
if = 'if';
while = 'while';
for = 'for';
in = 'in';
print = 'print';
return = 'return';
importkn = 'import';
as = 'as';
from = 'from';
less = '<';
great = '>';
true = 'true';
semi = ':';
false = 'false';
quote = '"';
blank = (' ' | lf | cr);
line_comment = '#' not_eol* eol;
number = digit+ | (digit+ '.' digit+);
id = letter (letter | digit)*;
string = '"'not_eol* '"';
cstring = ''' letter ''';
Ignored Tokens
blank, line_comment;
Productions
program = commands*;
commands = {stmt} statement
| {func} function;
function = def id l_par argument? r_par semi statement;
argument = id eqval? ceidv*;
eqval = eq value;
ceidv = comma id eqval?;
statement = {if} tab* if comparison semi statement
| {while} tab* while comparison semi statement
| {for} tab* for [id1]:id in [id2]:id semi statement
| {return} tab* return expression
| {print} tab* print expression comexpr*
| {assign} tab* id eq expression
| {minassign} tab* id minuseq expression
| {divassign} tab* id diveq expression
| {list} tab* id l_bra [ex1]:expression r_bra eq [ex2]:expression
| {fcall} tab* functioncall
| {import} import;
comexpr = comma expression;
expression = {multiplication} multiplication
| {addition} expression plus multiplication
| {subtraction} expression minus multiplication;
topower = {smth} something
| {power} topower pow something;
something = {id} id
| {parexp} l_par expression r_par
| {fcall} functioncall
| {value} value
| {list} id l_bra expression r_bra
| {assert} assert expression comexpr?
| {other} l_bra value comval* r_bra;
comval = comma value;
multiplication = {power} topower
| {multiplication} multiplication mult topower
| {division} multiplication div topower;
import = {import} importkn module asid? comod*
| {from} from module importkn id asid? comid*;
asid = as id;
comod = comma module asid?;
comid = comma id asid?;
module = idot* id;
idot = id dot;
comparison = {true} true
| {false} false
| {greater} [ex1]:expression great [ex2]:expression
| {lesser} [ex1]:expression less [ex2]:expression
| {equals} [ex1]:expression equal [ex2]:expression
| {nequals} [ex1]:expression nequal [ex2]:expression;
functioncall = id l_par arglist? r_par;
arglist = expression comexpr*;
value = {fcall} id dot functioncall
| {numb} number
| {str} string
| {cstr} cstring;
The shift/reduce conflict now is:
shift/reduce conflict in state [stack: TIf PTopower *] on TPow in {
[ PMultiplication - PTopower * ] followed by TPow (reduce),
[ PTopower = PTopower * TPow PSomething ] (shift)
}
(Note: this answer has been drawn from the original grammar, not from the attempt to remove left-recursion, which has additional issues. There is no need to remove left-recursion from a grammar being provided to an LALR(1) parser generator like SableCC.)
Indeed, the basic problem is the production:
something = {assert} assert expression comexpr?
This production is curious, partly because the name of the non-terminal ("something") provides no hint whatsoever as to what it is, but mostly because one would normally expect assert expression to be a statement, not part of an expression. And something is clearly derived from expression:
expression = multiplication
multiplication = topower
topower = something
But the assert production ends with an expression. That leads to an ambiguity, since
assert 4 + 3
could be parsed as: (some steps omitted for succinctness):
expression = expression plus multiplication
| | |
V | |
something | |
| | |
V | |
assert expression | |
| | | |
| V V V
assert 4 + 3
Or, more naturally, as:
expression = something
|
V
assert expression
| |
| V
| expression plus multiplication
| | | |
| V V V
assert 4 + 3
The first parse seems unlikely because assert doesn't (as far as I would guess) actually return a value. (Although the second one would be more natural if the operator were a comparison rather than an addition.)
Without seeing the definition of the language you're trying to parse, I can't really provide a concrete suggestion for how to fix this, but my inclination would be to make assert a statement, and rename something to something more descriptive ("term" is common, although I usually use "atom").
Well, I'm writing my first parser, in OCaml, and I immediately somehow managed to make one with an infinite-loop.
Of particular note, I'm trying to lex identifiers according to the rules of the Scheme specification (I have no idea what I'm doing, obviously) — and there's some language in there about identifiers requiring that they are followed by a delimiter. My approach, right now, is to have a delimited_identifier regex that includes one of the delimiter characters, that should not be consumed by the main lexer … and then once that's been matched, the reading of that lexeme is reverted by Sedlexing.rollback (well, my wrapper thereof), before being passed to a sublexer that only eats the actual identifier, hopefully leaving the delimiter in the buffer to be eaten as a different lexeme by the parent lexer.
I'm using Menhir and Sedlex, mostly synthesizing the examples from #smolkaj's ocaml-parsing example-repo and RWO's parsing chapter; here's the simplest reduction of my current parser and lexer:
%token LPAR RPAR LVEC APOS TICK COMMA COMMA_AT DQUO SEMI EOF
%token <string> IDENTIFIER
(* %token <bool> BOOL *)
(* %token <int> NUM10 *)
(* %token <string> STREL *)
%start <Parser.AST.t> program
%%
program:
| p = list(expression); EOF { p }
;
expression:
| i = IDENTIFIER { Parser.AST.Atom i }
%%
… and …
(** Regular expressions *)
let newline = [%sedlex.regexp? '\r' | '\n' | "\r\n" ]
let whitespace = [%sedlex.regexp? ' ' | newline ]
let delimiter = [%sedlex.regexp? eof | whitespace | '(' | ')' | '"' | ';' ]
let digit = [%sedlex.regexp? '0'..'9']
let letter = [%sedlex.regexp? 'A'..'Z' | 'a'..'z']
let special_initial = [%sedlex.regexp?
'!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~' ]
let initial = [%sedlex.regexp? letter | special_initial ]
let special_subsequent = [%sedlex.regexp? '+' | '-' | '.' | '#' ]
let subsequent = [%sedlex.regexp? initial | digit | special_subsequent ]
let peculiar_identifier = [%sedlex.regexp? '+' | '-' | "..." ]
let identifier = [%sedlex.regexp? initial, Star subsequent | peculiar_identifier ]
let delimited_identifier = [%sedlex.regexp? identifier, delimiter ]
(** Swallow whitespace and comments. *)
let rec swallow_atmosphere buf =
match%sedlex buf with
| Plus whitespace -> swallow_atmosphere buf
| ";" -> swallow_comment buf
| _ -> ()
and swallow_comment buf =
match%sedlex buf with
| newline -> swallow_atmosphere buf
| any -> swallow_comment buf
| _ -> assert false
(** Return the next token. *)
let rec token buf =
swallow_atmosphere buf;
match%sedlex buf with
| eof -> EOF
| delimited_identifier ->
Sedlexing.rollback buf;
identifier buf
| '(' -> LPAR
| ')' -> RPAR
| _ -> illegal buf (Char.chr (next buf))
and identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
(Yes, it's basically a no-op / the simplest thing possible rn. I'm trying to learn! :x)
Unfortunately, this combination results in an infinite loop in the parsing automaton:
State 0:
Lookahead token is now IDENTIFIER (1-1)
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
...
I'm new to parsing and lexing and all this; any advice would be welcome. I'm sure it's just a newbie mistake, but …
Thanks!
As said before, implementing too much logic inside the lexer is a bad idea.
However, the infinite loop does not come from the rollback but from your definition of identifier:
identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
within this definition _ matches the shortest possible words in the language consisting of all possible characters. In other words, _ always matches the empty word μ without consuming any part of its input, sending the parser into an infinite loop.
I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:
The following grammar works, but also gives a warning:
test.g
grammar test;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
}
program
: expr ';'!
;
term: ID | INT
;
assign
: term ('='^ expr)?
;
add : assign (('+' | '-')^ assign)*
;
expr: add
;
// T O K E N S
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS :
( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
DOT : '.' ;
fragment
LETTER : ('a'..'z'|'A'..'Z') ;
fragment
DIGIT : '0'..'9' ;
Warning
[15:08:20] warning(200): C:\Users\Charles\Desktop\test.g:21:34:
Decision can match input such as "'+'..'-'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Again, it does produce a tree the way I want:
Input: 0 + a = 1 + b = 2 + 3;
ANTLR produces | ... but I think it
this tree: | gives the warning
| because it _could_
+ | also be parsed this
/ \ | way:
0 = |
/ \ | +
a + | / \
/ \ | + 3
1 = | / \
/ \ | + =
b + | / \ / \
/ \ | 0 = b 2
2 3 | / \
| a 1
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
Charles wrote:
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
You shouldn't create two separate rules for assign and add. As your rules are now, assign has precedence over add, which you don't want: they should have equal precedence by looking at your desired AST. So, you need to wrap all operators +, - and = in one rule:
program
: expr ';'!
;
expr
: term (('+' | '-' | '=')^ expr)*
;
But now the grammar is still ambiguous. You'll need to "help" the parser to look beyond this ambiguity to assure there really is operator expr ahead when parsing (('+' | '-' | '=') expr)*. This can be done using a syntactic predicate, which looks like this:
(look_ahead_rule(s)_in_here)=> rule(s)_to_actually_parse
(the ( ... )=> is the predicate syntax)
A little demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
program
: expr ';'!
;
expr
: term ((op expr)=> op^ expr)*
;
op
: '+'
| '-'
| '='
;
term
: ID
| INT
;
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "0 + a = 1 + b = 2 + 3;";
testLexer lexer = new testLexer(new ANTLRStringStream(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.program().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
And the output of the Main class corresponds to the following AST:
which is created without any warnings from ANTLR.