Grammar ambiguity: why? (problem is: "(a)" vs "(a-z)")

Grammar ambiguity: why? (problem is: "(a)" vs "(a-z)") - parsing

So I am trying to implement a pretty simple grammar for one-line statements:
# Grammar
c : Character c [a-z0-9-]
(v) : Vowel (= [a,e,u,i,o])
(c) : Consonant
(?) : Any character (incl. number)
(l) : Any alpha char (= [a-z])
(n) : Any integer (= [0-9])
(c1-c2) : Range from char c1 to char c2
(c1,c2,c3) : List including chars c1, c2 and c3
Examples:
h(v)(c)no(l)(l)jj-k(n)
h(v)(c)no(l)(l)(a)(a)(n)
h(e-g)allo
h(e,f,g)allo
h(x,y,z)uul
h(x,y,z)(x,y,z)(x,y,z)(x,y,z)uul
I am using the Happy parser generator (http://www.haskell.org/happy/) but for some reason there seems to be some ambiguity problem.
The error message is: "shift/reduce conflicts: 1"
I think the ambiguity is with these two lines:
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket char hyphen char rBracket { TRange $2 $4 }
An example case is: "(a)" vs "(a-z)"
The lexer would give the following for the two cases:
(a) : [CLBracket, CChar 'a', CRBracket]
(a-z) : [CLBracket, CChar 'a', CHyphen, CChar 'z', CRBracket]
What I don't understand is how this can be ambiguous with an LL[2] parser.
In case it helps here is the entire Happy grammar definition:
{
module XHappyParser where
import Data.Char
import Prelude hiding (lex)
import XLexer
import XString
}
%name parse
%tokentype { Character }
%error { parseError }
%token
lBracket { CLBracket }
rBracket { CRBracket }
hyphen { CHyphen }
question { CQuestion }
comma { CComma }
char { CChar $$ }
%%
xstring : tokens { XString (reverse $1) }
tokens : token { [$1] }
| tokens token { $2 : $1 }
token : char { TLiteral $1 }
| hyphen { TLiteral '-' }
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket question rBracket { TAny }
| lBracket char hyphen char rBracket { TRange $2 $4 }
| lBracket listitems rBracket { TList $2 }
listitems : char { [$1] }
| listitems comma char { $1 ++ [$3] }
{
parseError :: [Character] -> a
parseError _ = error "parse error"
}
Thank you!

Here's the ambiguity:
token : [...]
| lBracket char rBracket
| [...]
| lBracket listitems rBracket
listitems : char
| [...]
Your parser could accept (v) as both TString [TVowel] and TString [TList ['v']], not to mention the missing characters in that case expression.
One possible way of solving it would be to modify your grammar so lists are at least two items, or have some different notation for vowels, consonants, etc.

The problem seems to be:
| lBracket char rBracket
...
| lBracket listitems rBracket
or in cleaner syntax:
(c)
Can be a TVowel, TConsonant, TLetter, TNumber (as you know) or a singleton TList.
As the happy manual says, shift reduce usually isn't an issue. You can us precedence to force behavior/remove the warning if you'd like.

Related

I am trying to make a Parser with ocamlyacc for a Language, but what type should I put?

I have the following code also have more after like expr: int {} | BOOL {} etc but i dont know what is the type that i should write in type of this parser, i have a calculator example that works with int and the type is int , but in my program i have float char string etc .. Thanks
%{
dont know what to write here
%}
%token <int> INT
%token <float> FLOAT
%token <char> CHAR
%token <bool> BOOL
%token <string> IDENT
%token PLUS Div Bigger Smaller MINUS TIMES
%token TYPE
%token DEF DD
%token Equals Atribuicao SoE BoE And Or
%token IF ELSE BEGIN END WHILE RETURN PV SEQ TO BY OF
%token RP LP LB RB
%token EOL
%left Bigger Smaller SoE BoE Equals Atribuicao Or And
%left PLUS MINUS
%left TIMES Div
%nonassoc UMINUS OF
%start main
%type <> main /* what should be in here ? */
main:
| expr EOL { $1 }
expr:
INT { }
| BOOL { }
| FLOAT { }
| CHAR { }
| expr OF expr { }
| BEGIN expr END { }
| RETURN expr PV { $2 }
| LP expr RP { $2 }
| LB expr RB { $2 }
| expr PLUS expr { }
| expr MINUS expr { }
| expr TIMES expr { }
%%
let main() = begin
Printf.printf "Hello yo\n" ;
end;;

Judging from your grammar, the return type should be something like expression, because it is expressions that you do parse. How you define that type depends on the semantics you want to implement. I guess that you will need a variant type that can at least hold atoms of type int, bool, float and char. So you can start with
type expression =
| Int of int
| Bool of bool
| Float of float
| Char of char
and see where it takes you.

Does YACC (LALR) parses left recursive grammars

I was doing a simple C parser project.
This problem occurred when I was writing the grammar for if-else construct.
the grammar that I have written is as following:
iexp: IF OP exp CP block eixp END{
printf("Valid if-else ladder\n");
};
eixp:
|ELSE iexp
|ELSE block;
exp: id
|NUMBER
|exp EQU exp
|exp LESS exp
|exp GRT exp
|OP exp CP;
block: statement
|iexp
|OCP statement CCP
|OCP iexp CCP;
statement: id ASSIGN NUMBER SEMICOL;
id: VAR;
where the lex part looks something like this
"if" {return IF;}
"else" {return ELSE;}
[0-9]+ {return NUMBER;}
">" {return GRT;}
"<" {return LESS;}
"==" {return EQU;}
"{" {return OCP;}
"}" {return CCP;}
"(" {return OP;}
")" {return CP;}
"$" {return END;}
";" {return SEMICOL;}
"=" {return ASSIGN;}
[a-zA-Z]+ {return VAR;}
. {;}
I am getting o/p as
yacc: 9 shift/reduce conflicts, 1 reduce/reduce conflict.
When I eliminate the left recursion on exp derivation the conflicts vanish but why it's so ?
the revised grammar after eliminating left recursion was :
exp: id
|NUMBER
|id EQU exp
|id LESS exp
|id GRT exp
|OP exp CP;
I was able to parse successfully the grammar for the evaluation of arithmetic expressions. Is it so that %right, %left made it successful
%token ID
%left '+' '-'
%left '*' '/'
%right NEGATIVE
%%
S:E {
printf("\nexpression : %s\nResult=%d\n", buf, $$);
buf[0] = '\0';
};
E: E '+' E {
printf("+");
($$ = $1 + $3);
} |
E '-' E {
printf("-");
($$ = $1 - $3);
} |
E '*' E {
printf("*");
($$ = $1 * $3);
} |
E '/' E {
printf("/");
($$ = $1 / $3);
} |
'(' E ')' {
($$ = $2);
} |
ID {
/*do nothing done by lex*/
};

Yacc derivations failing to be recognized

This is a class project of sorts, and I've worked out 99% of all kinks, but now I'm stuck. The grammar is for MiniJava.
I have the following lex file which works as intended:
%{
#include "y.tab.h"
%}
delim [ \t\n]
ws {delim}+
comment ("/*".*"*/")|("//".*\n)
id [a-zA-Z]([a-zA-Z0-9_])*
int_literal [0-9]*
op ("&&"|"<"|"+"|"-"|"*")
class "class"
public "public"
static "static"
void "void"
main "main"
string "String"
extends "extends"
return "return"
boolean "boolean"
if "if"
new "new"
else "else"
while "while"
length "length"
int "int"
true "true"
false "false"
this "this"
println "System.out.println"
lbrace "{"
rbrace "}"
lbracket "["
rbracket "]"
semicolon ";"
lparen "("
rparen ")"
comma ","
equals "="
dot "."
exclamation "!"
%%
{ws} { /* Do nothing! */ }
{comment} { /* Do nothing! */ }
{println} { return PRINTLN; } /* Before {period} to give this pre
cedence */
{op} { return OP; }
{int_literal} { return INTEGER_LITERAL; }
{class} { return CLASS; }
{public} { return PUBLIC; }
{static} { return STATIC; }
{void} { return VOID; }
{main} { return MAIN; }
{string} { return STRING; }
{extends} { return EXTENDS; }
{return} { return RETURN; }
{boolean} { return BOOLEAN; }
{if} { return IF; }
{new} { return NEW; }
{else} { return ELSE; }
{while} { return WHILE; }
{length} { return LENGTH; }
{int} { return INT; }
{true} { return TRUE; }
{false} { return FALSE; }
{this} { return THIS; }
{lbrace} { return LBRACE; }
{rbrace} { return RBRACE; }
{lbracket} { return LBRACKET; }
{rbracket} { return RBRACKET; }
{semicolon} { return SEMICOLON; }
{lparen} { return LPAREN; }
{rparen} { return RPAREN; }
{comma} { return COMMA; }
{equals} { return EQUALS; }
{dot} { return DOT; }
{exclamation} { return EXCLAMATION; }
{id} { return ID; }
%%
int main(void) {
yyparse();
exit(0);
}
int yywrap(void) {
return 0;
}
int yyerror(void) {
printf("Parse error. Sorry bro.\n");
exit(1);
}
And the yacc file:
%token PRINTLN
%token INTEGER_LITERAL
%token OP
%token CLASS
%token PUBLIC
%token STATIC
%token VOID
%token MAIN
%token STRING
%token EXTENDS
%token RETURN
%token BOOLEAN
%token IF
%token NEW
%token ELSE
%token WHILE
%token LENGTH
%token INT
%token TRUE
%token FALSE
%token THIS
%token LBRACE
%token RBRACE
%token LBRACKET
%token RBRACKET
%token SEMICOLON
%token LPAREN
%token RPAREN
%token COMMA
%token EQUALS
%token DOT
%token EXCLAMATION
%token ID
%%
Program: MainClass ClassDeclList
MainClass: CLASS ID LBRACE PUBLIC STATIC VOID MAIN LPAREN STRING LB
RACKET RBRACKET ID RPAREN LBRACE Statement RBRACE RBRACE
ClassDeclList: ClassDecl ClassDeclList
|
ClassDecl: CLASS ID LBRACE VarDeclList MethodDeclList RBRACE
| CLASS ID EXTENDS ID LBRACE VarDeclList MethodDeclList RB
RACE
VarDeclList: VarDecl VarDeclList
|
VarDecl: Type ID SEMICOLON
MethodDeclList: MethodDecl MethodDeclList
|
MethodDecl: PUBLIC Type ID LPAREN FormalList RPAREN LBRACE VarDeclLi
st StatementList RETURN Exp SEMICOLON RBRACE
FormalList: Type ID FormalRestList
|
FormalRestList: FormalRest FormalRestList
|
FormalRest: COMMA Type ID
Type: INT LBRACKET RBRACKET
| BOOLEAN
| INT
| ID
StatementList: Statement StatementList
|
Statement: LBRACE StatementList RBRACE
| IF LPAREN Exp RPAREN Statement ELSE Statement
| WHILE LPAREN Exp RPAREN Statement
| PRINTLN LPAREN Exp RPAREN SEMICOLON
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
Exp: Exp OP Exp
| Exp LBRACKET Exp RBRACKET
| Exp DOT LENGTH
| Exp DOT ID LPAREN ExpList RPAREN
| INTEGER_LITERAL
| TRUE
| FALSE
| ID
| THIS
| NEW INT LBRACKET Exp RBRACKET
| NEW ID LPAREN RPAREN
| EXCLAMATION Exp
| LPAREN Exp RPAREN
ExpList: Exp ExpRestList
|
ExpRestList: ExpRest ExpRestList
|
ExpRest: COMMA Exp
%%
The derivations that are not working are the following two:
Statement:
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
If I only lex the file and get the token stream, the tokens match the pattern perfectly. Here's an example input and output:
num1 = id1;
num2[0] = id2;
gives:
ID
EQUALS
ID
SEMICOLON
ID
LBRACKET
INTEGER_LITERAL
RBRACKET
EQUALS
ID
SEMICOLON
What I don't understand is how this token stream matches the grammar exactly, and yet yyerror is being called. I've been trying to figure this out for hours, and I've finally given up. I'd appreciate any insight into what's causing the problem.
For a full example, you can run the following input through the parser:
class Minimal {
public static void main (String[] a) {
// Infinite loop
while (true) {
/* Completely useless // (embedded comment) stat
ements */
if ((!false && true)) {
if ((new Maximal().calculateValue(id1, i
d2) * 2) < 5) {
System.out.println(new int[11].l
ength < 10);
}
else { System.out.println(0); }
}
else { System.out.println(false); }
}
}
}
class Maximal {
public int calculateValue(int[] id1, int id2) {
int[] num1; int num2;
num1 = id1;
num2[0] = id2;
return (num1[0] * num2) - (num1[0] + num2);
}
}
It should parse correctly, but it is tripping up on num1 = id1; and num2[0] = id2;.
PS - I know that this is semantically-incorrect MiniJava, but syntactically, it should be fine :)

There is nothing wrong with your definitions of Statement. The reason they trigger the error is that they start with ID.
To start with, when bison processes your input, it reports:
minijava.y: conflicts: 8 shift/reduce
Shift/reduce conflicts are not always a problem, but you can't just ignore them. You need to know what causes them and whether the default behaviour will be correct or not. (The default behaviour is to prefer shift over reduce.)
Six of the shift/reduce conflicts come from the fact that:
Exp: Exp OP Exp
which is inherently ambiguous. You'll need to fix that by using actual operators instead of OP and inserting precedence rules (or specific productions). That has nothing to do with the immediate problem, and since it doesn't (for now) matter whether the first Exp or the second one gets priority, the default resolution will be fine.
The other ones come from the following production:
VarDeclList: VarDecl VarDeclList
| %empty
Here, VarDecl might start with ID (in the case of a classname used as a type).
VarDeclList is being produced from MethodDecl:
MethodDecl: ... VarDeclList StatementList ...
Now, let's say we're parsing the input; we've just parsed:
int num2;
and we're looking at the next token, which is num1 (from num1 = id1). int num2; is certainly a VarDecl, so it will match VarDecl in
VarDeclList: VarDecl VarDeclList
In this context, VarDeclList could be empty, or it could start with another declaration. If it's empty, we need to reduce it right away (because we won't get another chance: non-terminals need to be reduced no later than when their right-hand sides are complete). If it's not empty, we can simply shift the first token. But we need to make that decision based on the current lookahead token, which is an ID.
Unfortunately, that doesn't help us. Both VarDeclList and StatementList could start with ID, so both reduce and shift are feasible. Consequently, bison shifts.
Now, let's suppose that VarDeclList used left-recursion instead of right-recursion. (Left recursion is almost always better in LR grammars.):
VarDeclList: VarDeclList VarDecl
| %empty
Now, when we reach the end of a VarDecl, we have only one option: reduce the VarDeclList. And then we'll be in the following state:
MethodDecl: ... VarDeclList · StatementList
VarDeclList: VarDeclList · VarDecl
Now, we see the ID lookhead, and we don't know whether it starts a StatementList or a VarDecl. But it doesn't matter because we don't need to reduce either of those non-terminals; we can wait to see what comes next before committing to one or the other.
Note that there is a small semantic difference between left- and right-recursion in this case. Clearly, the syntax trees are different:
VDL VDL
/ \ / \
VDL Decl Decl VDL
/ \ / \
VDL Decl Decl VDL
| |
λ λ
However, in practice the most likely actions are going to be:
VarDeclList: %empty { $$ = newVarDeclList(); }
| VarDeclList VarDecl { $$ = $1; appendVarDecl($$, $2); }
which works just fine.
By the way:
1) While flex allows you to use definitions in order to simplify the regular expressions, it does not require you to use them, and nowhere is it written (to my knowledge) that it is best practice to use definitions. I use definitions sparingly, usually only when I'm going to write two regular expressions with the same component, or occasionally when the regular expression is really complicated and I want to break it down into pieces. However, there is absolutely no need to clutter your flex file with:
begin "begin"
...
%%
...
{begin} { return BEGIN; }
rather than the simpler and more readable
"begin" { return BEGIN; }
2) Along the same lines, bison helpfully allows you to write single-character tokens as single-quoted literals: '('. This has a number of advantages, starting with the fact that it provides a more readable view of the grammar. Also, you don't need to declare those tokens, or think up a good name for them. Moreover, since the value of the token is the character itself, your flex file can also be simplified. Instead of
"+" { return PLUS; }
"-" { return MINUS; }
"(" { return LPAREN; }
...
you can just write:
[-+*/(){}[\]!] { return yytext[0]; }
In fact, I usually recommend not even using that; just use a catch-all flex rule at the end:
. { return yytext[0]; }
That will pass all otherwise unmatched characters as single-character tokens to bison; if the token is not known to bison, it will issue a syntax error. So all the error-handling is centralized in bison, instead of being split between the two files, and you save a lot of typing (and whoever is reading your code saves a lot of reading.)
3) It's not necessary to put "System.out.println" before ".". They can never be confused, because they don't start with the same character. The only time order matters is if two patterns will maximally match the same string at the same point (which is why the ID pattern needs to come after all the individual keywords).

Yacc and Lex error in parsing expressions which use binary operators

I am new to Lex and Yacc and I am trying to create a parser for a simple language which allows for basic arithmetic and equality expressions. Though I have some of it working, I am encountering errors when trying to parse expressions involving binary operations. Here is my .y file:
%{
#include <stdlib.h>
#include <stdio.h>
%}
%token NUMBER
%token HOME
%token PU
%token PD
%token FD
%token BK
%token RT
%token LT
%left '+' '-'
%left '=' '<' '>'
%nonassoc UMINUS
%%
S : statement S { printf("S -> stmt S\n"); }
| { printf("S -> \n"); }
;
statement : HOME { printf("stmt -> HOME\n"); }
| PD { printf("stmt -> PD\n"); }
| PU { printf("stmt -> PU\n"); }
| FD expression { printf("stmt -> FD expr\n"); }
| BK expression { printf("stmt -> BK expr\n"); }
| RT expression { printf("stmt -> RT expr\n"); }
| LT expression { printf("stmt -> LT expr\n"); }
;
expression : expression '+' expression { printf("expr -> expr + expr\n"); }
| expression '-' expression { printf("expr -> expr - expr\n"); }
| expression '>' expression { printf("expr -> expr > expr\n"); }
| expression '<' expression { printf("expr -> expr < expr\n"); }
| expression '=' expression { printf("expr -> expr = expr\n"); }
| '(' expression ')' { printf("expr -> (expr)\n"); }
| '-' expression %prec UMINUS { printf("expr -> -expr\n"); }
| NUMBER { printf("expr -> number\n"); }
;
%%
int yyerror(char *s)
{
fprintf (stderr, "%s\n", s);
return 0;
}
int main()
{
yyparse();
}
And here is my .l file for Lex:
%{
#include "testYacc.h"
%}
number [0-9]+
%%
[ ] { /* skip blanks */ }
{number} { sscanf(yytext, "%d", &yylval); return NUMBER; }
home { return HOME; }
pu { return PU; }
pd { return PD; }
fd { return FD; }
bk { return BK; }
rt { return RT; }
lt { return LT; }
%%
When I try to enter an arithmetic expression on the command-line for evaluation, it results in the following error:
home
stmt -> HOME
pu
stmt -> PU
fd 10
expr -> number
fd 10
stmt -> FD expr
expr -> number
fd (10 + 10)
stmt -> FD expr
(expr -> number
+stmt -> FD expr
S ->
S -> stmt S
S -> stmt S
S -> stmt S
S -> stmt S
S -> stmt S
syntax error

Your lexer lacks rules to match and return tokens such as '+' and '*', so if there are any in your input, it will just echo them and discard them. This is what happens when you enter fd (10 + 10) -- the lexer returns the tokens FD NUMBER NUMBER while + and ( get echoed to stdout. The parser then gives a syntax error.
You want to add a rule to return these single character tokens. The easiest is to just add a single rule to your .l file at the end:
. { return *yytext; }
which matches any single character.
Note that this does NOT match a \n (newline), so newlines in your input will still be echoed and ignored. You might want to add them (and tabs and carriage returns) to your skip blanks rule:
[ \t\r\n] { /* skip blanks */ }

Fsyacc: an item with the same key has been added

I'm starting to play with Fslex/Fsyacc. When trying to generate the parser using this input
Parser.fsy:
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token <System.Int32> INT
%token <System.String> STRING
%token <System.String> ID
%token PLUS MINUS ASTER SLASH LT LT EQ GTE GT
%token LPAREN RPAREN LCURLY RCURLY LBRACKET RBRACKET COMMA
%token ARRAY IF THEN ELSE WHILE FOR TO DO LET IN END OF BREAK NIL FUNCTION VAR TYPE IMPORT PRIMITIVE
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type <Ast.Program> start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Prog { Program($1) }
Prog:
| Expr EOF { $1 }
Expr:
// literals
| NIL { Ast.Nil ($1) }
| INT { Ast.Integer($1) }
| STRING { Ast.Str($1) }
// arrays and records
| ID LBRACKET Expr RBRACKET OF Expr { Ast.Array ($1, $3, $6) }
| ID LCURLY AssignmentList RCURLY { Ast.Record ($1, $3) }
AssignmentList:
| Assignment { [$1] }
| Assignment COMMA AssignmentList {$1 :: $3 }
Assignment:
| ID EQ Expr { Ast.Assignment ($1,$3) }
Ast.fs
namespace Ast
open System
type Integer =
| Integer of Int32
and Str =
| Str of string
and Nil =
| None
and Id =
| Id of string
and Array =
| Array of Id * Expr * Expr
and Record =
| Record of Id * (Assignment list)
and Assignment =
| Assignment of Id * Expr
and Expr =
| Nil
| Integer
| Str
| Array
| Record
and Program =
| Program of Expr
Fsyacc reports the following error: "FSYACC: error FSY000: An item with the same key has already been added."
I believe the problem is in the production for AssignmentList, but can't find a way around...
Any tips will be appreciated

Hate answering my own questions, but the problem was here (line 15 of parser input file)
%token PLUS MINUS ASTER SLASH LT LT EQ GTE GT
Note the double definition (should have been LTE)
My vote goes for finding a way to improve the output of the Fslex/Fsyacc executables/msbuild tasks

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Grammar ambiguity: why? (problem is: "(a)" vs "(a-z)") - parsing

Related

I am trying to make a Parser with ocamlyacc for a Language, but what type should I put?

Does YACC (LALR) parses left recursive grammars

Yacc derivations failing to be recognized

Yacc and Lex error in parsing expressions which use binary operators

Fsyacc: an item with the same key has been added

Categories

Resources