encounter shift/reduce error when using bison - parsing

I am writing a sql parser with flex and bison.And I encountered a problem of shift reduce error,but I don't know where is the problem.
1. select id,length(name) from t;
2. select length(name),id from t;
3. select length('hello') len;
According to the phenomenon,1 can be parsed sucessfully,2 can't be parsed successfully.And I seems that is beacuse there is conflict between 3 and 2.
The production expression is as follows:
The SELECT no_table_attr_list SEMICOLON; supports 3.
select: select_clause SEMICOLON| SELECT no_table_attr_list SEMICOLON;
no_table_attr_list: no_table_attr |no_table_attr_list COMMA no_table_attr ;
no_table_attr:func LBRACE value RBRACE ID|func LBRACE value COMMA value RBRACE ID ;
select_clause:
select_begin select_attr FROM ID alias_ID rel_list where|select_begin select_attr FROM ID alias_ID INNER JOIN ID alias_ID ON condition condition_list join_list where;
select_begin: SELECT;
select_attr: func_with_param alias_ID attr_list |STAR attr_list| ID alias_ID attr_list
| ID DOT ID alias_ID attr_list | agg alias_ID attr_list;
func: LENGTH_T | ROUND_T| DATE_FORMAT_T ;
func_with_param: func LBRACE ID RBRACE| func LBRACE ID DOT ID RBRACE;

Related

Why tiger(Modern Compiler Implementation) use `fundecs` in chapter 4 instead of `fundec`?

I'm following the tiger book to write a compiler.
In chapter 3, based on the github's code and my understanding, I filled in the following rules for the dec:
decs:
%empty
| decs dec
;
dec:
tydec
| vardec
| fundec
;
tydec:
TYPE ID '=' ty
;
vardec:
VAR ID ASSIGN exp
| VAR ID ':' ID ASSIGN exp
;
fundec:
FUNCTION ID '(' tyfields ')' '=' exp
| FUNCTION ID '(' tyfields ')' ':' ID '=' exp
However, in chap 4, the book provided the following functions for ast:
A_fundecList A_FundecList(A_fundec head, A_fundecList tail);
A_nametyList A_NametyList(A_namety head, A_nametyList tail);
Which made the most of code I found adjust the decs token as follow
decs:
%empty
| decs dec
;
dec:
tydecs
| vardec
| fundecs
;
tydecs:
tydec
| tydec tydecs
tydec:
TYPE ID '=' ty
;
vardec:
VAR ID ASSIGN exp
| VAR ID ':' ID ASSIGN exp
;
fundecs:
fundec
| fundec fundecs {$$ = A_FundecList($1, $2);}
;
fundec:
FUNCTION ID '(' tyfields ')' '=' exp
| FUNCTION ID '(' tyfields ')' ':' ID '=' exp
The list token fundecs and tydecs were added into the production rule.
I do not understand why doing that, since this will obviously create conflict. Because decs is a list can contain fundecs and tydecs. So a list of fundecs, for example, can be reduced to either a list of decs or a list of fundecs.
Thus I would like to ask why doing this, what is the reason of adding conflict grammar for the parser??
Thanks a lot!!!

How to parse decimal values correctly?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.

Context-Free-Grammar for assignment statements in ANTLR

I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*

Bison: Conflicts: 1 shift/reduce error

I'm trying to build a parser with bison and have narrowed all my errors down to one difficult one.
Here's the debug output of bison with the state where the error lies:
state 120
12 statement_list: statement_list . SEMICOLON statement
24 if_statement: IF conditional THEN statement_lists ELSE statement_list .
SEMICOLON shift, and go to state 50
SEMICOLON [reduce using rule 24 (if_statement)]
$default reduce using rule 24 (if_statement)
Here are the translation rules in the parser.y source
%%
program : ID COLON block ENDP ID POINT
;
block : CODE statement_list
| DECLARATIONS declaration_block CODE statement_list
;
declaration_block : id_list OF TYPE type SEMICOLON
| declaration_block id_list OF TYPE type SEMICOLON
;
id_list : ID
| ID COMMA id_list
;
type : CHARACTER
| INTEGER
| REAL
;
statement_list : statement
| statement_list SEMICOLON statement
;
statement_lists : statement
| statement_list SEMICOLON statement
;
statement : assignment_statement
| if_statement
| do_statement
| while_statement
| for_statement
| write_statement
| read_statement
;
assignment_statement : expression OUTPUTTO ID
;
if_statement : IF conditional THEN statement_lists ENDIF
| IF conditional THEN statement_lists ELSE statement_list
;
do_statement : DO statement_list WHILE conditional ENDDO
;
while_statement : WHILE conditional DO statement_list ENDWHILE
;
for_statement : FOR ID IS expression BY expressions TO expression DO statement_list ENDFOR
;
write_statement : WRITE BRA output_list KET
| NEWLINE
;
read_statement : READ BRA ID KET
;
output_list : value
| value COMMA output_list
;
condition : expression comparator expression
;
conditional : condition
| NOT conditional
| condition AND conditional
| condition OR conditional
;
comparator : ASSIGNMENT
| BETWEEN
| LT
| GT
| LESSEQUAL
| GREATEREQUAL
;
expression : term
| term PLUS expression
| term MINUS expression
;
expressions : term
| term PLUS expressions
| term MINUS expressions
;
term : value
| value MULTIPLY term
| value DIVIDE term
;
value : ID
| constant
| BRA expression KET
;
constant : number_constant
| CHARCONST
;
number_constant : NUMBER
| MINUS NUMBER
| NUMBER POINT NUMBER
| MINUS NUMBER POINT NUMBER
;
%%
When I remove the if_statement rule there are no errors, so I've narrowed it down considerably, but still can't solve the error.
Thanks for any help.
Consider this statement: if condition then s2 else s3; s4
There are two interpretations:
if condition then
s1;
else
s2;
s3;
The other one is:
if condition then
s1;
else
s2;
s3;
In the first one, the statment list is composed of an if statement and s3. While the other statement is composed of only one if statement. That's where the ambiguity comes from. Bison will prefer shift to reduce when a shift-reduce conflict exist, so in the above case, the parser will choose to shift s3.
Since you have an ENDIF in your if-then statement, consider to introduce an ENDIF in your if-then-else statement, then the problem is solved.
I think you are missing ENDIF in the IF-THEN-ELSE-ENDIF rule.

removing ambiguity

I have the following bison grammar (as part of a more complex grammar):
classDeclaration : CLASS ID EXTENDS ID LBRACE variableDeclarationList methodDeclarationList RBRACE
;
variableDeclarationList : variableDeclarationList variableDeclaration
| /* empty */
;
variableDeclaration : type ID SEMICOLON
;
type : NATTYPE | ID
;
methodDeclarationList : methodDeclarationList methodDeclaration
| /* empty */
;
methodDeclaration : type ID LPAREN parameterDeclarationList RPAREN variableExpressionBlock
;
which is supposed to describe class declarations which look like this:
class foo extends object
{
nat number;
nat divide(nat aNumber)
{
0;
}
}
or this:
class foo extends object
{
nat divide(nat aNumber)
{
0;
}
}
or this:
class foo extends object
{
}
Problem is that there is ambiguity where variable declarations end and method declarations begin (2 shift/reduce conflicts). For example, the method declaration looks like a variable declaration until it sees the parenthesis.
How can I rewrite this grammar to eliminate this ambiguity?
To clarify: the class body can be empty, the only constraint is that variable declarations come before method declarations if there are any.
This isn't an ambiguity, its a lookahead problem. The problem is that you need 3 tokens of lookahead (up to the SEMICOLON or LPAREN of the next declaration) for the parser to figure out where the end of the variableDeclarationList is, as it needs to reduce an empty methodDeclarationList before it starts parsing more methodDeclarations.
The way to fix this is to remove the need for an empty reduction at the start of a method declaration list:
methodDeclarationList : nonEmptyMethodDeclarationList | /*empty */ ;
nonEmptyMethodDeclarationList : nonEmptyMethodDeclarationList methodDeclaration
| methodDeclaration
;
With this, the parser does not need to reduce an empty methodDeclarationList UNLESS there are no methods at all -- and in that case, only one token of lookahead is needed to see the RBRACE
Another way to do it is to not even have empty rules, and instead use multiple options, one with the nonterminal and one without.
classDeclaration : CLASS ID EXTENDS ID LBRACE RBRACE
| CLASS ID EXTENDS ID LBRACE methodDeclarationList RBRACE
| CLASS ID EXTENDS ID LBRACE variableDeclarationList RBRACE
| CLASS ID EXTENDS ID LBRACE variableDeclarationList methodDeclarationList RBRACE
;
variableDeclarationList : variableDeclaration
| variableDeclarationList variableDeclaration
;
variableDeclaration : type ID SEMICOLON
;
type : NATTYPE
| ID
;
methodDeclarationList : methodDeclaration
| methodDeclarationList methodDeclaration
;
methodDeclaration : type ID LPAREN RPAREN variableExpressionBlock
| type ID LPAREN parameterList RPAREN variableExpressionBlock
;
I am not familiar with bison, but have you tried making a rule for the common prefix of both rules? the "type ID" is present in both the variable and method patterns.
So if you had say:
typedId : type ID
;
and then
variableDeclaration : typedId SEMICOLON
;
methodDeclaration : typedId LPAREN parameterDeclarationList RPAREN variableExpressionBlock
;
this way the variable rule and method rule would not be looked at untill the common prefix is already pushed, and the next token whould be unambiguous.
I have not played with this sort of thing in years I hope this helps.
This slightly modified grammar works:
%token CLASS EXTENDS ID LBRACE RBRACE SEMICOLON NATTYPE LPAREN RPAREN DIGIT COMMA
%%
classDeclaration : CLASS ID EXTENDS ID LBRACE declarationList RBRACE
;
declarationList : /* Empty */
| declarationList declaration
;
declaration : variableDeclaration
| methodDeclaration
;
variableDeclaration : parameterDeclaration SEMICOLON
;
type : NATTYPE | ID
;
methodDeclaration : parameterDeclaration LPAREN parameterDeclarationList RPAREN
variableExpressionBlock
;
variableExpressionBlock : LBRACE DIGIT RBRACE
;
parameterDeclarationList : /* empty */
| parameterDeclarationList COMMA parameterDeclaration
;
parameterDeclaration : type ID
;
You probably want to rename the non-terminal 'parameterDeclaration' to something like 'singleVariableDeclaration', but by avoiding having two possibly empty rules in a row (the original 'variableDeclarationList' and 'methodDeclarationList', you avoid the ambiguity.
This does allow, syntactically, methods interleaved with variables in the class's declarationList. If that isn't acceptable for some reason, consider making that a semantic error rather than a syntactic error. If it must be a syntax error, then someone is going to have to do some thinking; I vote to make you do the thinking.
If you insist on at least one method declaration, then the grammar is unambiguous:
methodDeclarationList : methodDeclarationList methodDeclaration
| methodDeclaration /* empty */
;
If you try the same with a variable declaration list, the grammar still has two S/R conflicts.
One possibility, not to be completely ignored, is to use the Bison feature, %expect 2, to indicate that 2 shift/reduce conflicts are expected.
%expect 2
%token CLASS EXTENDS ID LBRACE RBRACE SEMICOLON NATTYPE LPAREN RPAREN DIGIT COMMA
%%
classDeclaration : CLASS ID EXTENDS ID LBRACE variableDeclarationList methodDeclarationList RBRACE
;
variableDeclarationList : variableDeclarationList variableDeclaration
| /* empty */
;
variableDeclaration : singleVariableDeclaration SEMICOLON
;
type : NATTYPE | ID
;
methodDeclarationList : methodDeclarationList methodDeclaration
| /* empty */
;
methodDeclaration : singleVariableDeclaration LPAREN parameterDeclarationList RPAREN variableExpressionBlock
;
variableExpressionBlock : LBRACE DIGIT RBRACE
;
parameterDeclarationList : /* empty */
| parameterDeclarationList COMMA parameterDeclaration
;
parameterDeclaration : singleVariableDeclaration
;
singleVariableDeclaration : type ID
;

Resources