why are these bison rules useless? - parsing

I am trying to create a simple compiler using flex and bison, however all the rules i've written down have shown to be useless "14 useless nonterminals and 66 useless rules". What makes a rule useless and is there a way to fix it?
File: Decl_Class'*' 'EOF'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Field: Variable
|Constructor
|Method
Variable: Modifier'*' Expr_type Decl_variables
Decl_variables: Decl_variable
|Decl_variable ',' Decl_variables
Decl_variable: IDENT
|IDENT '=' Expr
Constructor: Modifier'*' IDENT '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Method: Modifier'*' '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Modifier: IDENT
Expr: IDENT
Params: '(' Expr_type ')' IDENT | '(' Expr_type ')' IDENT ',' Params
Expr_type: BOOLEAN | INT | DOUBLE | IDENT | INTEGER | REAL | TRU | FALS
| THIS | NULLVAL
| '(' Expr ')'
| Access|Access '=' Expr|Access '(' L_Expr '?' ')'
| NEW IDENT '(' L_Expr '?' ')'
| '+''+'Expr | '-''-'Expr | Expr'+''+' | Expr'-''-'
| '!'Expr | '-'Expr | '+'Expr
| Expr Operator Expr
| '(' Expr_type ')' Expr_type
Operator: "==" | "!=" | "<" | "<=" | ">" | ">=" | "+" | "-" | "*" | "/" | "%" | "&&" | "||"
Access: IDENT | Expr '.' IDENT
L_Expr: Expr | Expr ',' L_Expr
Instruction: ';'
| Expr';'
| Expr_type Decl_variables';'
| IF '(' Expr ')' Instruction
| IF '(' Expr ')' Instruction ELSE Instruction
| WHILE '(' Expr ')' Instruction
| FOR '(' L_Expr '?' ';' Expr '?' ';' L_Expr '?'')' Instruction
| FOR '(' Expr ')' Decl_variables ';' Expr '?' ';' L_Expr '?'')' Instruction { Instructions'*' }
| RETURN Expr '?'';'

Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Here the part inside the {} is a code action (which will produce a syntax error when compiled), not a reference to the Field non-terminal. So Field is never actually used and neither are the non-terminal referenced by it. That's what makes them useless: they're never used.
PS: At various places in your grammar you're using '*' and '?' in a way that suggests the intention may be to match zero or more or zero or one items respectively. Be aware that all '*' and '?' do is to match a token with the given value. There is no syntactic shortcut to repeat something or make it optional in bison - you'll need to define separate non-terminals for that.
PPS: In most (all?) languages that have ++ and -- operators, those consist of a single token not two subsequent '+' or '-' tokens (so - -x would be double negation and only --x without the space between the -s would be a decrement). So your rules for the decrement and increment operators are unusual in that regard.

Related

Why doesn't Bison accept this grammar file?

When I use the command bison -d -o parser.java parser.y to generate a parser from my grammar file parser.y, Bison produces the following error:
:8.8-10: syntax error, unexpected string, expecting char or identifier or type
Here is the file parser.y:
%{
import java.util.;
import java.io.;
%}
%start PROGRAM
%token number identifier function break call if else let read return while write
%token "(" ")" "{" "}" ";" "=" "+" "-" "" "/" "%" "<" ">" " <= " " >= " "==" "!=" "&" "|" "~" "!"
%left "+" "-"
%left "" "/" "%"
%left "&" "|"
%nonassoc "!"
%type <Node> PROGRAM FUNCTION PARAMLIST BLOCK STATEMENT IF ELSE EXPR
%type <String> identifier
%type <Integer> number
%union {
Node node;
String identifier;
int number;
}
%%
PROGRAM:
| PROGRAM FUNCTION
| BLOCK
;
FUNCTION:
function identifier '(' PARAMLIST ')' BLOCK
;
PARAMLIST:
identifier
| identifier ',' PARAMLIST
|
;
BLOCK:
'{' STATEMENT '}'
;
STATEMENT:
BREAK
| CALL ';'
| IF
| LET
| READ
| RETURN
| WHILE
| WRITE
;
BREAK:
break ';'
;
CALL:
call identifier '(' ARGLIST ')'
;
ARGLIST:
EXPR
| EXPR ',' ARGLIST
|
;
IF:
if EXPR BLOCK ELSE
;
ELSE:
else BLOCK
|
;
LET:
let identifier '=' EXPR ';'
| let identifier '=' CALL ';'
;
READ:
read identifier ';'
;
RETURN:
return EXPR ';'
;
WHILE:
while EXPR BLOCK
;
WRITE:
write EXPR ';'
;
EXPR:
number
| identifier
| '(' EXPR ')'
| '!' EXPR
| '~' EXPR
| EXPR '+' EXPR
| EXPR '-' EXPR
| EXPR '*' EXPR
| EXPR '/' EXPR
| EXPR '%' EXPR
| EXPR '&' EXPR
| EXPR '|' EXPR
| EXPR '<' EXPR
| EXPR '>' EXPR
| EXPR "<=" EXPR
| EXPR ">=" EXPR
| EXPR "==" EXPR
| EXPR "!=" EXPR
;
%%
int yyerror(String s) {
System.err.println("error: " + s);
}
Bison doesn't allow you to declare quoted token names (such as "(") with the %token declaration. It knows they are tokens; they cannot be anything else.
You use the %token declaration to declare symbolic names for tokens, which you will find useful when writing your lexer. In the declaration, the symbolic name comes first, optionally followed by the double-quoted alias. You can repeat that as often as you like. For example, you could write:
%token TK_LE "<=" TK_GE ">="
You can then use either the symbolic name or the alias in your grammar, but using the alias makes your grammar more readable. Also, Bison uses the alias when constructing error messages, which is a good thing since "expecting TK_SEMIC" is not a great way to communicate with a user that a ";" was required.
Keep in mind that a single-quoted single character token, such as '(', is not the same token as the double-quoted alias. In your grammar, you use '(' but attempt to declare "(". Had you succeeded in declaring "(", you would have gotten an "unused token" warning. Since '(' doesn't require a symbolic name, you can just remove the declaration. You will only need them for multicharacter tokens like "<=". (Note that spaces are significant inside quotes. " <= " is not the same as "<=".)
Symbolic token names are used as Java values, so their names cannot conflict with variables or Java keywords. You cannot, for example, use break as a symbolic token name. Trying to do so will cause compilation errors.
For this reason, it's customary to write token names in ALL_CAPS, and non-terminals in lower case. Non-terminals names are not used in the generated code, so you can use whatever names you wish.
You reverse this convention, which will cause a variety of errors when you compile the generated parser, and which is hard to read for those of us accustomed to the standard style.
A couple of other notes:
The bison Java interface does not use a %union declaration. The %type declarations are sufficient.
You are missing precedence declarations for many operators, particularly comparison operators. That will lead to a large number of parser conflicts. Make sure you write the precedence levels in the correct order.

Does antlr automatically factor top-level alternates?

I have written the following two grammars, one grouping the arithmetic expressions (where possible) and another that doesn't:
grammar NoPrefix;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| expr '*' expr
| expr '/' expr
| expr '+' expr
| expr '-' expr
| Atom
;
Atom: [a-z]+ | [0-9]+ | '\'' Atom '\'';
WHITESPACE: [ \t\r\n] -> skip;
grammar YesPrefix;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| Atom
;
Atom:[a-z]+ | [0-9]+ | '\'' Atom '\'';
WHITESPACE: [ \t\r\n] -> skip;
It seems that these two have almost identical runtimes, build sizes, etc. Does antlr automatically convert the two forms of alternatives to the same output, for example:
expr: expr '*' expr | expr '/' expr <==> expr: expr ('*'|'/') expr;
No. How would Antlr know that you wanted * and / to have the same binding precedence, different from + and -? You need to be explicit about that.

Yacc grammar expressions, conflicts

Can someone identify where the grammar conflict is in this expression production?
expr '+' expr
|
expr '-' expr
|
expr '*' expr
|
expr '/' expr
|
expr '(' ')'
|
T_IDENTIFIER
|
T_STRING_LITERAL
|
T_INTEGER_LITERAL
|
T_FLOAT_LITERAL
I'm trying to implement function calls taking an expr as the operand, so for example, the following would be valid grammar:
1()
1.5()
"STRING"()
fn()

ANTLR4 parser rules with other parser rules as arguments (meta-rules)

I would like to be able to write a "meta-rule" in ANTLR4 that takes a rule as an input argument and performs a set modification to that rule. Here's an example grammar:
grammar G;
WS: [ \t\n\r] + -> skip;
CHAR: [a-z];
term: (CHAR)+;
sum: term ('+' term)+;
pterm: '(' term ')' | '(' pterm ')';
psum: '(' sum ')' | '(' psum ')';
expr: term | sum | pterm | psum;
The rules for pterm and psum perform the same action on term and sum, enclosing them in possibly nested parentheses. I would like to be able to replace the last three lines above with something like the following:
enclose[rule]: '(' rule ')' | '(' enclose(rule) ')';
expr: term | sum | enclose(term) | enclose(sum);
Is there a way to construct a meta-rule like this?
The short answer is, no.
Better to resolve by refactoring the grammar and identifying the structurally significant terms:
expr: LPAREN sum RPAREN | LPAREN expr RPAREN ;
sum : term ('+' term)* ; // changed to Kleene star
term: CHAR+ ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : [a-z] ;
WS : [ \t\n\r]+ -> skip ;
The sum rule will consume all terms, so the expr rule only needs to handle sums.

How to fix YACC shift/reduce conflicts from post-increment operator?

I'm writing a grammar in YACC (actually Bison), and I'm having a shift/reduce problem. It results from including the postfix increment and decrement operators. Here is a trimmed down version of the grammar:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%right PREINC
%left POSTINC
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| INC expr %prec PREINC
| DEC expr %prec PREINC
| expr INC %prec POSTINC
| expr DEC %prec POSTINC
| '(' expr ')'
;
%%
Bison tells me there are 12 shift/reduce conflicts, but if I comment out the lines for the postfix increment and decrement, it works fine. Does anyone know how to fix this conflict? At this point, I'm considering moving to an LL(k) parser generator, which makes it much easier, but LALR grammars have always seemed much more natural to write. I'm also considering GLR, but I don't know of any good C/C++ GLR parser generators.
Bison/Yacc can generate a GLR parser if you specify %glr-parser in the option section.
Try this:
%token NUMBER ID INC DEC
%left '+' '-'
%left '*' '/'
%nonassoc '++' '--'
%left '('
%%
expr: NUMBER
| ID
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| '++' expr
| '--' expr
| expr '++'
| expr '--'
| '(' expr ')'
;
%%
The key is to declare postfix operators as non associative. Otherwise you would be able to
++var++--
The parenthesis also need to be given a precedence to minimize shift/reduce warnings
I like to define more items. You shouldn't need the %left, %right, %prec stuff.
simple_expr: NUMBER
| INC simple_expr
| DEC simple_expr
| '(' expr ')'
;
term: simple_expr
| term '*' simple_expr
| term '/' simple_expr
;
expr: term
| expr '+' term
| expr '-' term
;
Play around with this approach.
This basic problem is that you don't have a precedence for the INC and DEC tokens, so it doesn't know how to resolve ambiguities involving a lookahead of INC or DEC. If you add
%right INC DEC
at the end of the precedence list (you want unaries to be higher precedence and postfix higher than prefix), it will fix it, and you can even get rid of all the PREINC/POSTINC stuff, as it's irrelevant.
preincrement and postincrement operators have nonassoc so define that in the precedence section and in the rules make the precedence of these operators high by using %prec

Resources