Removing direct left recursion in JavaCC - parsing

I have the following in a JavaCC file:
void condition() : {}
{
expression() comp_op() expression()
| condition() (<AND> | <OR>) condition()
}
where <AND> is "&&" and <OR> is "||". This is causing problems due to the fact that it is direct left-recursion. How can I fix this?

A condition is essentially 1 or more of expression comp_op expression separated by an AND or an OR. You could do the following
condition --> simpleCondition ( (<AND> | <OR>) simpleCondition )*
simpleCondition --> expression comp_op expression

Related

Bison - Productions for longest matching expression

I am using Bison, together with Flex, to try and parse a simple grammar that was provided to me. In this grammar (almost) everything is considered an expression and has some kind of value; there are no statements. What's more, the EBNF definition of the grammar comes with certain ambiguities:
expression OP expression where op may be '+', '-' '&' etc. This can easily be solved using bison's associativity operators and setting %left, %right and %nonassoc according to common language standards.
IF expression THEN expression [ELSE expression] as well as DO expression WHILE expression, for which ignoring the common case dangling else problem I want the following behavior:
In if-then-else as well as while expressions, the embedded expressions are taken to be as long as possible (allowed by the grammar). E.g 5 + if cond_expr then then_expr else 10 + 12 is equivalent to 5 + (if cond_expr then then_expr else (10 + 12)) and not 5 + (if cond_expr then then_expr else 10) + 12
Given that everything in the language is considered an expression, I cannot find a way to re-write the production rules in a form that does not cause conflicts. One thing I tried, drawing inspiration from the dangling else example in the bison manual was:
expression: long_expression
| short_expression
;
long_expression: short_expression
| IF long_expression THEN long_expression
| IF long_expression long_expression ELSE long_expression
| WHILE long_expression DO long_expression
;
short_expression: short_expression '+' short_expression
| short_expression '-' short_expression
...
;
However this does not seem to work and I cannot figure out how I could tweak it into working. Note that I (assume I) have resolved the dangling ELSE problem using nonassoc for ELSE and THEN and the above construct as suggested in some book, but I am not sure this is even valid in the case where there are not statements but only expressions. Note as well as that associativity has been set for all other operators such as +, - etc. Any solutions or hints or examples that resolve this?
----------- EDIT: MINIMAL EXAMPLE ---------------
I tried to include all productions with tokens that have specific associativity, including some extra productions to show a bit of the grammar. Notice that I did not actually use my idea explained above. Notice as well that I have included a single binary and unary operator just to make the code a bit shorter, the rules for all operators are of the same form. Bison with -Wall flag finds no conflicts with these declarations (but I am pretty sure they are not 100% correct).
%token<int> INT32 LET IF WHILE INTEGER OBJECTID TYPEID NEW
%right <str> THEN ELSE STR
%right '^' UMINUS NOT ISNULL ASSIGN DO IN
%left '+' '-'
%left '*' '/'
%left <str> AND '.'
%nonassoc '<' '='
%nonassoc <str> LOWEREQ
%type<ast_expr> expression
%type ...
exprlist: expression { ; }
| exprlist ';' expression { ; };
block: '{' exprlist '}' { ; };
args: %empty { ; }
| expression { ; }
| args ',' expression { ; };
expression: IF expression THEN expression { ; }
| IF expression THEN expression ELSE expression { ; }
| WHILE expression DO expression { ; }
| LET OBJECTID ':' type IN expression { ; }
| NOT expression { /* UNARY OPERATORS */ ; }
| expression '=' expression { /* BINARY OPERATORS */ ; }
| OBJECTID '(' args ')' { ; }
| expression '.' OBJECTID '(' args ')' { ; }
| NEW TYPEID { ; }
| STR { ; }
| INTEGER { ; }
| '(' ')' { ; }
| '(' expression ')' { ; }
| block { ; }
;
The following associativity declarations resolved all shift/reduce conflicts and produced the expected output (in all tests I could think of at least):
...
%right <str> THEN ELSE
%right DO IN
%right ASSIGN
%left <str> AND
%right NOT
%nonassoc '<' '=' LOWEREQ
%left '+' '-'
%left '*' '/'
%right UMINUS ISNULL
%right '^'
%left '.'
...
%%
...
expression: IF expression THEN expression
| IF expression THEN expression ELSE expression
| WHILE expression DO expression
| LET OBJECTID ':' type IN expression
| LET OBJECTID ':' type ASSIGN expression IN expression
| OBJECTID ASSIGN expression
...
| '-' expression %prec UMINUS
| expression '=' expression
...
| expression LOWEREQ expression
| OBJECTID '(' args ')'
...
...
Notice that the order of declaration of associativity and precedence rules for all symbols matters! I have not included all the production rules but if-else-then, while-do, let in, unary and binary operands are the ones that produced conflicts or wrong results with different associativity declarations.

JavaCC: treat white space like <OR>

I'm trying to build a simple grammar for Search Engine query.
I've got this so far -
options {
STATIC=false;
MULTI=true;
VISITOR=true;
}
PARSER_BEGIN(SearchParser)
package com.syncplicity.searchservice.infrastructure.parser;
public class SearchParser {}
PARSER_END(SearchParser)
SKIP :
{
" "
| "\t"
| "\n"
| "\r"
}
<*> TOKEN : {
<#_TERM_CHAR: ~[ " ", "\t", "\n", "\r", "!", "(", ")", "\"", "\\", "/" ] >
| <#_QUOTED_CHAR: ~["\""] >
| <#_WHITESPACE: ( " " | "\t" | "\n" | "\r" | "\u3000") >
}
TOKEN :
{
<AND: "AND">
| <OR: "OR">
| <NOT: ("NOT" | "!")>
| <LBRACKET: "(">
| <RBRACKET: ")">
| <TERM: (<_TERM_CHAR>)+ >
| <QUOTED: "\"" (<_QUOTED_CHAR>)+ "\"">
}
/** Main production. */
ASTQuery query() #Query: {}
{
subQuery()
( <AND> subQuery() #LogicalAnd
| <OR> subQuery() #LogicalOr
| <NOT> subQuery() #LogicalNot
)*
{
return jjtThis;
}
}
void subQuery() #void: {}
{
<LBRACKET> query() <RBRACKET> | term() | quoted()
}
void term() #Term:
{
Token t;
}
{
(
t=<TERM>
)
{
jjtThis.value = t.image;
}
}
void quoted() #Quoted:
{
Token t;
}
{
(
t=<QUOTED>
)
{
jjtThis.value = t.image;
}
}
Looks like it works as I wanted to, e.g it can handle AND, OR, NOT/!, single terms and quoted text.
However I can't force it to handle whitespaces between terms as OR operator. E.g hello world should be treated as hello OR world
I've tried all obvious solutions, like <OR: ("OR" | " ")>, removing " " from SKIP, etc. But it still doesn't work.
Perhaps you don't want whitespace treated as an OR, perhaps you want the OR keyword to be optional. In that case you can use a grammar like this
query --> subquery (<AND> subquery | (<OR>)? subquery | <NOT> subquery)*
However this grammar treat NOT as an infix operator. Also it doesn't reflect precedence. Usually NOT has precedence over AND and AND over OR. Also your main production should look for an EOF. For that you can try
query --> query0 <EOF>
query0 --> query1 ((<OR>)? query1)*
query1 --> query2 (<AND> query2)*
query2 --> <NOT> query2 | subquery
subquery --> <LBRACKET> query0 <RBRACKET> | <TERM> | <QUOTED>
Ok. Suppose you actually do want to require that any missing ORs be replaced by at least one space. Or to put it another way, if there is one or more white spaces where an OR would be permitted, then that white space is considered to be an OR.
As in my other solution, I'll treat NOT as a unary operator and give NOT precedence over AND and AND precedence over either sort of OR.
Change
SKIP : { " " | "\t" | "\n" | "\r" }
to
TOKEN : {<WS : " " | "\t" | "\n" | "\r" > }
Now use a grammar like this
query() --> query0() ows() <EOF>
query0() --> query1()
( LOOKAHEAD( ows() <OR> | ws() (<NOT> | <LBRACKET> | <TERM> | <QUOTED>) )
( ows() (<OR>)?
query1()
)*
query1() --> query2() (LOOKAHEAD(ows() <AND>) ows() <AND> query2())*
query2() --> ows() (<NOT> query2() | subquery())
subquery() --> <LBRACKET> query0() ows() <RBRACKET> | <TERM> | <QUOTED>
ows() --> (<WS>)*
ws() --> (<WS>)+

antlr4: actions 'causing' left-recursion errors

The grammar below is failing to generate a parser with this error:
error(119): Sable.g4::: The following sets of rules are mutually left-recursive [expression]
1 error(s)
The error disappears when I comment out the embedded actions, but comes back when I make them effective.
Why are the actions bringing about such left-recursion that antlr4 cannot handle it?
grammar Sable;
options {}
#header {
package org.sable.parser;
import org.sable.parser.internal.*;
}
sourceFile : statement* EOF;
statement : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement : expression;
expression:
{System.out.println("w");} expression '+' expression {System.out.println("tf?");}
| IDENTIFIER
;
SEMICOLON : ';';
RPARENTH : '(';
LPARENTH : ')';
NEWLINE:
('\u000D' '\u000A')
| '\u000A'
;
WS : WhiteSpaceNotNewline -> channel(HIDDEN);
IDENTIFIER:
(IdentifierHead IdentifierCharacter*)
| ('`'(IdentifierHead IdentifierCharacter*)'`')
;
fragment DecimalDigit :'0'..'9';
fragment IdentifierHead:
'a'..'z'
| 'A'..'Z'
;
fragment IdentifierCharacter:
DecimalDigit
| IdentifierHead
;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline : [\u0020\u000C\u0009u000B\u000C];
UPDATE: the following workaround kind of solves this specific situation, but not the case when init/after-like actions are needed on a lesser scope - per choice, not per rule.
expression
#init {
enable(Token.HIDDEN_CHANNEL);
}
#after {
disable(Token.HIDDEN_CHANNEL);
}:
expression '+' expression
| IDENTIFIER
;

Where is this shift/reduce conflict coming from in Bison?

I am trying to get the hang of parsing by defining a very simple language in Jison (a javascript parser). It accepts the same / very similar syntax to bison.
Here is my grammar:
%token INT TRUE FALSE WHILE DO IF THEN ELSE LOCATION ASSIGN EOF DEREF
%left "+"
%left ">="
/* Define Start Production */
%start Program
/* Define Grammar Productions */
%%
Program
: Statement EOF
;
Statement
: Expression
| WHILE BoolExpression DO Statement
| LOCATION ASSIGN IntExpression
;
Expression
: IntExpression
| BoolExpression
;
IntExpression
: INT IntExpressionRest
| IF BoolExpression THEN Statement ELSE Statement
| DEREF LOCATION
;
IntExpressionRest
: /* epsilon */
| "+" IntExpression
;
BoolExpression
: TRUE
| FALSE
| IntExpression ">=" IntExpression
;
%%
I am getting one shift/reduce conflict. The output of Jison is here:
Conflict in grammar: multiple actions possible when lookahead token is >= in state 6
- reduce by rule: Expression -> IntExpression
- shift token (then go to state 17)
States with conflicts:
State 6
Expression -> IntExpression . #lookaheads= EOF >= THEN DO ELSE
BoolExpression -> IntExpression .>= IntExpression #lookaheads= EOF DO THEN ELSE >=
Your shift reduce conflict is detected because the >= is in the follow set of the Expression non-terminal. This is basically caused by the fact that a Statement can be an Expression and IntExpression can end with a statement. Consider the following input IF c THEN S1 ELSE S2 >= 42, if you had parentheses to disambiguate then this could be interpreted both as (IF c THEN S1 ELSE S2) >= 42 and IF c THEN S1 ELSE (S2 >= 42). Since the shift is preferred over the reduce, the latter will be chosen.
Your issue comes from
IF BoolExpression THEN Statement ELSE Statement
If the Statement after THEN contains an IF, how do you know if the ELSE belongs to the first or second IF? See here for more information: http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
The only 100% non-ambiguous fix is to require some sort of delimiter around your if/else statements (most languages use the brackets "{" and "}"). Ex,
IF BoolExpression THEN '{' Statement '}' ELSE '{' Statement '}'

Problems with LL(1) grammar

I have a 26 rule grammar for a sub-grammar of Mini Java. This grammar is supposed to be non-object-oriented. Anyway, I've been trying to left-factor it and remove left-recursion. However I test it with JFLAP, though, it tells me it is not LL(1). I have followed every step of the algorithm in the Aho-Sethi book.
Could you please give me some tips?
Goal ::= MainClass $
MainClass ::= class <IDENTIFIER> { MethodDeclarations public static void main ( ) {
VarDeclarations Statements } }
VarDeclarations ::= VarDeclaration VarDeclarations | e
VarDeclaration ::= Type <IDENTIFIER> ;
MethodDeclarations ::= MethodDeclaration MethodDeclarations | e
MethodDeclaration ::= public static Type <IDENTIFIER> ( Parameters ) {
VarDeclarations Statements return GenExpression ; }
Parameters ::= Type <IDENTIFIER> Parameter | e
Parameter ::= , Type <IDENTIFIER> Parameter | e
Type ::= boolean | int
Statements ::= Statement Statements | e
Statement ::= { Statements }
| if ( GenExpression ) Statement else Statement
| while ( GenExpression ) Statement
| System.out.println ( GenExpression ) ;
| <IDENTIFIER> = GenExpression ;
GenExpression ::= Expression | RelExpression
Expression ::= Term ExpressionRest
ExpressionRest ::= e | + Term ExpressionRest | - Term ExpressionRest
Term ::= Factor TermRest
TermRest ::= e | * Factor TermRest
Factor ::= ( Expression )
| true
| false
| <INTEGER-LITERAL>
| <IDENTIFIER> ArgumentList
ArgumentList ::= e | ( Arguments )
RelExpression ::= RelTerm RelExpressionRest
RelExpressionRest ::= e | && RelTerm RelExpressionEnd
RelExpressionEnd ::= e | RelExpressionRest
RelTerm ::= Term RelTermRest
RelTermRest ::= == Expression | < Expression | ExpressionRest RelTermEnding
RelTermEnding ::= == Expression | < Expression
Arguments ::= Expression Argument | RelExpression Argument | e
Argument ::= , GenExpression Argument | e
Each <IDENTIFIER> is a valid Java identifier, and <INTEGER-LITERAL> is a simple integer. Each e production stands for an epsilon production, and the $ in the first rule is the end-of-file marker.
I think I spotted two problems (there might be more):
Problem #1
In MainClass you have
MethodDeclarations public static void main
And a MethodDeclaration is
public static Type | e
That's not LL(1) since when the parser sees "public" it cannot tell if it's a MethodDeclaration or the "public static void main" method.
Problem #2
Arguments ::= Expression Argument | RelExpression Argument | e
Both Expression:
Expression ::= Term ExpressionRest
... and RelExpression:
RelExpression ::= RelTerm RelExpressionRest
RelTerm ::= Term RelTermRest
... start with "Term" so that's not LL(1) either.
I'd just go for LL(k) or LL(*) because they allow you to write much more maintainable grammars.
Is there anything to prevent IDENTIFIER being the same as one of your reserved words? if not then your grammar would be ambiguous. I don't see anything else though.
If all else fails, I'd remove all but the last line of the grammar, and test that. If that passes I'd add each line one at a time until I found the problem line.

Resources