I am trying to parse a file like this: (too simple for my actual purpose, but for the beginning, this is ok)
#Book{key2,
Author="Some2VALUE" ,
Title="VALUE2"
}
The lexer is:
[A-Za-z"][^\\\" \n\(\),=\{\}#~_]* { yylval.sval = strdup(yytext); return KEY; }
#[A-Za-z][A-Za-z]+ {yylval.sval = strdup(yytext + 1); return ENTRYTYPE;}
[ \t\n] ; /* ignore whitespace */
[{}=,] { return *yytext; }
. { fprintf(stderr, "Unrecognized character %c in input\n", *yytext); }
And then parsing this with:
%union
{
char *sval;
};
%token <sval> ENTRYTYPE
%type <sval> VALUE
%token <sval> KEY
%start Input
%%
Input: Entry
| Input Entry ; /* input is zero or more entires */
Entry:
ENTRYTYPE '{' KEY ','{
b_entry.type = $1;
b_entry.id = $3;
b_entry.table = g_hash_table_new_full(g_str_hash, g_str_equal, free, free);}
KeyVals '}' {
parse_entry(&b_entry);
g_hash_table_destroy(b_entry.table);
free(b_entry.type); free(b_entry.id);
b_entry.table = NULL;
b_entry.type = b_entry.id = NULL;}
;
KeyVals:
/* empty */
| KeyVals KeyVal ; /* zero or more keyvals */
VALUE:
/*empty*/
| KEY
| VALUE KEY
;
KeyVal:
/*empty*/
KEY '=' VALUE ',' { g_hash_table_replace(b_entry.table, $1, $3); }
| KEY '=' VALUE { g_hash_table_replace(b_entry.table, $1, $3); }
| error '\n' {yyerrok;}
;
There are few problem, so that I need to generalize both the lexer and parser:
1) It can not read a sentence, i.e. if the RHS of Author="Some Value", it only shows "Some. i.e. space is not handled. Dont know how to do it.
2) If I enclose the RHS with {} rather then "", it gives syntax error.
Looking for help for this 2 situation.
The main issue is that your tokens are not appropriate. You should try to recognize the tokens of your example as follows:
#Book ENTRYTYPE
{ '{'
key2 KEY
, ','
Author KEY
= '='
"Some2VALUE" VALUE
, ','
Title KEY
= '='
"VALUE2" VALUE
} '}'
The VALUE token could for example be defined as follows:
%x value
%%
"\"" {BEGIN(value);}
<value>"\"" {BEGIN{INITIAL); return VALUE;}
<value>"\\\"" { /* escaped " */ }
<value>[^"] { /* Non-escaped char */ }
Or in a single expression as
"\""([^"]|("\\\""))*"\""
This is assuming that only " needs to be escaped with a \. I'm not sure how BibTeX defines how to escape a ", if possible at all.
Related
I am trying to parse logical BNF statements , and trying to apply paranthesis to them.
For example:
I am trying to parse a statement a=>b<=>c&d as ((a)=>(b))<=>((c)&(d)), and similar statements as well.
Problem Facing: Some of the statements are working fine, and while some are not. The example provided above is not working and the solution is printing as ((c)&(d))<=>((c)&(d)) The second expr seems to be overriding the first one.
Conditions that are working: While other simple examples like a<=>b , a|(b&c) are working fine.
I think I have made some basic error in my code, which I cannot figure out.
Here is my code
lex file
letters [a-zA-Z]
identifier {letters}+
operator (?:<=>|=>|\||&|!)
separator [\(\)]
%%
{identifier} {
yylval.s = strdup(yytext);
return IDENTIFIER; }
{operator} { return *yytext; }
{separator} { return *yytext; }
[\n] { return *yytext; }
%%
yacc file
%start program
%union {char* s;}
%type <s> program expr IDENTIFIER
%token IDENTIFIER
%left '<=>'
%left '=>'
%left '|'
%left '&'
%right '!'
%left '(' ')'
%%
program : expr '\n'
{
cout<<$$;
exit(0);
}
;
expr : IDENTIFIER {
cout<<" Atom ";
cout<<$1<<endl;
string s1 = string($1);
cout<<$$<<endl;
}
| expr '<=>' expr {
cout<<"Inside <=>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"<=>"+"(" + s2 +")";
$$ = (char * )s3.c_str();
cout<<s3<<endl;
}
| expr '=>' expr {
cout<<"Inside =>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"=>"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '|' expr {
cout<<"Inside |\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"|"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '&' expr {
cout<<"Inside &\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"&"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| '!' expr {
cout<<"Inside !\n";
string s1 = string($2);
cout<<s1<<endl;
string s2 = "!" + s1;
$$ = (char *)s2.c_str();
cout<<$$<<endl;
}
| '(' expr ')' { $$ = $2; cout<<"INSIDE BRACKETS"; }
;
%%
Please let me know the mistake I have made.
Thank you
The basic problem you have is that you save the pointer returned by string::c_str() on the yacc value stack, but after the action finishes and the string object is destroyed, that pointer is no longer valid.
To fix this you need to either not use std::string at all, or change your %union to be { std::string *s; } (instead of char *). In either case you have issues with memory leaks. If you are using Linux, the former is pretty easy. Your actions would become something like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
asprintf(&$$, "(%s)<=>(%s)", $1, $3);
cout<<$$<<endl;
free($1);
free($3);
}
for the latter, the action would look like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
$$ = new string("(" + *$1 +")" +"<=>"+"(" + *$2 +")");
cout<<$$<<endl;
delete $1;
delete $3;
}
Parsing the c-like example code, i have the following issue. Its like some tokens, like identifiers, are ignored by grammar, causing a non-reason syntax error.
Parser code :
%{
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror (char const *);
%}
%token T_MAINCLASS T_ID T_PUBLIC T_STATIC T_VOID T_MAIN T_PRINTLN T_INT T_FLOAT T_FOR T_WHILE T_IF T_ELSE T_EQUAL T_SMALLER T_BIGGER T_NOTEQUAL T_NUM T_STRING
%left '(' ')'
%left '+' '-'
%left '*' '/'
%left '{' '}'
%left ';' ','
%left '<' '>'
%%
PROGRAM : T_MAINCLASS T_ID '{' T_PUBLIC T_STATIC T_VOID T_MAIN '(' ')' COMP_STMT '}'
;
COMP_STMT : '{' STMT_LIST '}'
;
STMT_LIST : /* nothing */
| STMT_LIST STMT
;
STMT : ASSIGN_STMT
| FOR_STMT
| WHILE_STMT
| IF_STMT
| COMP_STMT
| DECLARATION
| NULL_STMT
| T_PRINTLN '(' EXPR ')' ';'
;
DECLARATION : TYPE ID_LIST ';'
;
TYPE : T_INT
| T_FLOAT
;
ID_LIST : T_ID ',' ID_LIST
|
;
NULL_STMT : ';'
;
ASSIGN_STMT : ASSIGN_EXPR ';'
;
ASSIGN_EXPR : T_ID '=' EXPR
;
EXPR : ASSIGN_EXPR
| RVAL
;
FOR_STMT : T_FOR '(' OPASSIGN_EXPR ';' OPBOOL_EXPR ';' OPASSIGN_EXPR ')' STMT
;
OPASSIGN_EXPR : /* nothing */
| ASSIGN_EXPR
;
OPBOOL_EXPR : /* nothing */
| BOOL_EXPR
;
WHILE_STMT : T_WHILE '(' BOOL_EXPR ')' STMT
;
IF_STMT : T_IF '(' BOOL_EXPR ')' STMT ELSE_PART
;
ELSE_PART : /* nothing */
| T_ELSE STMT
;
BOOL_EXPR : EXPR C_OP EXPR
;
C_OP : T_EQUAL | '<' | '>' | T_SMALLER | T_BIGGER | T_NOTEQUAL
;
RVAL : RVAL '+' TERM
| RVAL '-' TERM
| TERM
;
TERM : TERM '*' FACTOR
| TERM '/' FACTOR
| FACTOR
;
FACTOR : '(' EXPR ')'
| T_ID
| T_NUM
;
%%
void yyerror (const char * msg)
{
fprintf(stderr, "C-like : %s\n", msg);
exit(1);
}
int main ()
{
if(!yyparse()){
printf("Compiled !!!\n");
}
}
Part of Lexical Scanner code :
{Empty}+ { printf("EMPTY ") ; /* nothing */ }
"mainclass" { printf("MAINCLASS ") ; return T_MAINCLASS ; }
"public" { printf("PUBLIC ") ; return T_PUBLIC; }
"static" { printf("STATIC ") ; return T_STATIC ; }
"void" { printf("VOID ") ; return T_VOID ; }
"main" { printf("MAIN ") ; return T_MAIN ; }
"println" { printf("PRINTLN ") ; return T_PRINTLN ; }
"int" { printf("INT ") ; return T_INT ; }
"float" { printf("FLOAT ") ; return T_FLOAT ; }
"for" { printf("FOR ") ; return T_FOR ; }
"while" { printf("WHILE ") ; return T_WHILE ; }
"if" { printf("IF ") ; return T_IF ; }
"else" { printf("ELSE ") ; return T_ELSE ; }
"==" { printf("EQUAL ") ; return T_EQUAL ; }
"<=" { printf("SMALLER ") ; return T_SMALLER ; }
">=" { printf("BIGGER ") ; return T_BIGGER ; }
"!=" { printf("NOTEQUAL ") ; return T_NOTEQUAL ; }
{id} { printf("ID ") ; return T_ID ; }
{num} { printf("NUM ") ; return T_NUM ; }
{string} { printf("STRING ") ; return T_STRING ; }
{punct} { printf("PUNCT ") ; return yytext[0] ; }
<<EOF>> { printf("EOF ") ; return T_EOF; }
. { yyerror("lexical error"); exit(1); }
Example :
mainclass Example {
public static void main ( )
{
int c;
float x, sum, mo;
c=0;
x=3.5;
sum=0.0;
while (c<5)
{
sum=sum+x;
c=c+1;
x=x+1.5;
}
mo=sum/5;
println (mo);
}
}
Running all this stuff it showed up this output:
C-like : syntax error
MAINCLASS EMPTY ID
It seems like id is in wrong position although in grammar we have:
PROGRAM : T_MAINCLASS T_ID '{' T_PUBLIC T_STATIC T_VOID T_MAIN '(' ')' COMP_STMT '}'
Based on the "solution" proposed in OP's self answer, it's pretty clear that the original problem was that the generated header used to compile the scanner was not the same as the header generated by bison/yacc from the parser specification.
The generated header includes definitions of all the token types as small integers; in order for the scanner to communicate with the parser, it must identify each token with the correct token type. So the parser generator (bison/yacc) produces a header based on the parser specification (the .y file), and that header must be #included into the generated scanner so that scanner actions can used symbolic token type names.
If the scanner was compiled with a header file generated from some previous version of the parser specification, it is quite possible that the token numbers no longer correspond with what the parser is expecting.
The easiest way to avoid this problem is to use a build system like make, which will automatically recompile the scanner if necessary.
The easiest way to detect this problem is to use bison's built-in trace facility. Enabling tracing requires only a couple of lines of code, and saves you from having to scatter printf statements throughout your scanner and parser. The bison trace will show you exactly what is going on, so not only is it less work than adding printfs, it is also more precise. In particular, it reports every token which is passed to the parser (and, with a little more effort, you can get it to report the semantic values of those tokens as well). So if the parser is getting the wrong token code, you'll see that right away.
After many potential helpful changes, parser worked by changing the order of these tokens.
From
%token T_MAINCLASS T_ID T_PUBLIC T_STATIC T_VOID T_MAIN T_PRINTLN T_INT T_FLOAT T_FOR T_WHILE T_IF T_ELSE T_EQUAL T_SMALLER T_BIGGER T_NOTEQUAL T_NUM T_STRING
TO
%token T_MAINCLASS T_PUBLIC T_STATIC T_VOID T_MAIN T_PRINTLN T_INT T_FLOAT T_FOR T_WHILE T_IF T_EQUAL T_ID T_NUM T_SMALLER T_BIGGER T_NOTEQUAL T_ELSE T_STRING
It looked like that the reading element was else but lexer normaly returned an id. Somehow this modification was the solution.
This is not homework, but it is from a book.
I'm given a following bison spec file:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%token NUMBER
%%
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
The task is to do the following:
Rewrite the spec to add the following useful error messages:
"missing right parenthesis," generated by the string (2+3
"missing left parenthesis," generated by the string 2+3)
"missing operator," generated by the string 2 3
"missing operand," generated by the string (2+)
The simplest solution that I was able to come up with is to do the following:
half_exp : exp '+' { $$ = $1; }
| exp '-' { $$ = $1; }
| exp '*' { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp '\n' { yyerror("missing right parenthesis"); }
| exp ')' { yyerror("missing left parenthesis"); }
| '(' exp '\n' { yyerror("missing left parenthesis"); }
| '(' exp ')' { $$ = $2; }
| '(' half_exp ')' { yyerror("missing operand"); exit(0); }
;
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
| exp exp { yyerror("missing operator"); }
;
These changes work, however they lead to a lot of conflicts.
Here is my question.
Is there a way to rewrite this grammar in such a way so that it wouldn't generate conflicts?
Any help is appreciated.
Yes it is possible:
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp: exp '+' exp {
// code
}
| exp '-' exp {
// code
}
| exp '*' exp {
// code
}
| exp '/' exp {
// code
}
|'(' exp ')' {
// code
}
Bison allows Ambiguous grammars.
I don't see how can you rewrite grammar to avoid conflicts. You just missed the point of terms, factors etc. You use these when you want left recursion context free grammar.
From this grammar:
E -> E+T
|T
T -> T*F
|F
F -> (E)
|num
Once you free it from left recursion you would go to:
E -> TE' { num , ( }
E' -> +TE' { + }
| eps { ) , EOI }
T -> FT' { ( , num }
T' -> *FT' { * }
|eps { + , ) , EOI }
F -> (E) { ( }
|num { num }
These sets alongside rules are showing what input character has to be in order to use that rule. Of course this is just example for simple arithmetic expressions for example 2*(3+4)*5+(3*3*3+4+5*6) etc.
If you want to learn more about this topic I suggest you to read about "left recursion context free grammar". There are some great books covering this topic and also covering how to get input sets.
But as I said above, all of this can be avoided because Bison allows Ambiguous grammars.
This is a class project of sorts, and I've worked out 99% of all kinks, but now I'm stuck. The grammar is for MiniJava.
I have the following lex file which works as intended:
%{
#include "y.tab.h"
%}
delim [ \t\n]
ws {delim}+
comment ("/*".*"*/")|("//".*\n)
id [a-zA-Z]([a-zA-Z0-9_])*
int_literal [0-9]*
op ("&&"|"<"|"+"|"-"|"*")
class "class"
public "public"
static "static"
void "void"
main "main"
string "String"
extends "extends"
return "return"
boolean "boolean"
if "if"
new "new"
else "else"
while "while"
length "length"
int "int"
true "true"
false "false"
this "this"
println "System.out.println"
lbrace "{"
rbrace "}"
lbracket "["
rbracket "]"
semicolon ";"
lparen "("
rparen ")"
comma ","
equals "="
dot "."
exclamation "!"
%%
{ws} { /* Do nothing! */ }
{comment} { /* Do nothing! */ }
{println} { return PRINTLN; } /* Before {period} to give this pre
cedence */
{op} { return OP; }
{int_literal} { return INTEGER_LITERAL; }
{class} { return CLASS; }
{public} { return PUBLIC; }
{static} { return STATIC; }
{void} { return VOID; }
{main} { return MAIN; }
{string} { return STRING; }
{extends} { return EXTENDS; }
{return} { return RETURN; }
{boolean} { return BOOLEAN; }
{if} { return IF; }
{new} { return NEW; }
{else} { return ELSE; }
{while} { return WHILE; }
{length} { return LENGTH; }
{int} { return INT; }
{true} { return TRUE; }
{false} { return FALSE; }
{this} { return THIS; }
{lbrace} { return LBRACE; }
{rbrace} { return RBRACE; }
{lbracket} { return LBRACKET; }
{rbracket} { return RBRACKET; }
{semicolon} { return SEMICOLON; }
{lparen} { return LPAREN; }
{rparen} { return RPAREN; }
{comma} { return COMMA; }
{equals} { return EQUALS; }
{dot} { return DOT; }
{exclamation} { return EXCLAMATION; }
{id} { return ID; }
%%
int main(void) {
yyparse();
exit(0);
}
int yywrap(void) {
return 0;
}
int yyerror(void) {
printf("Parse error. Sorry bro.\n");
exit(1);
}
And the yacc file:
%token PRINTLN
%token INTEGER_LITERAL
%token OP
%token CLASS
%token PUBLIC
%token STATIC
%token VOID
%token MAIN
%token STRING
%token EXTENDS
%token RETURN
%token BOOLEAN
%token IF
%token NEW
%token ELSE
%token WHILE
%token LENGTH
%token INT
%token TRUE
%token FALSE
%token THIS
%token LBRACE
%token RBRACE
%token LBRACKET
%token RBRACKET
%token SEMICOLON
%token LPAREN
%token RPAREN
%token COMMA
%token EQUALS
%token DOT
%token EXCLAMATION
%token ID
%%
Program: MainClass ClassDeclList
MainClass: CLASS ID LBRACE PUBLIC STATIC VOID MAIN LPAREN STRING LB
RACKET RBRACKET ID RPAREN LBRACE Statement RBRACE RBRACE
ClassDeclList: ClassDecl ClassDeclList
|
ClassDecl: CLASS ID LBRACE VarDeclList MethodDeclList RBRACE
| CLASS ID EXTENDS ID LBRACE VarDeclList MethodDeclList RB
RACE
VarDeclList: VarDecl VarDeclList
|
VarDecl: Type ID SEMICOLON
MethodDeclList: MethodDecl MethodDeclList
|
MethodDecl: PUBLIC Type ID LPAREN FormalList RPAREN LBRACE VarDeclLi
st StatementList RETURN Exp SEMICOLON RBRACE
FormalList: Type ID FormalRestList
|
FormalRestList: FormalRest FormalRestList
|
FormalRest: COMMA Type ID
Type: INT LBRACKET RBRACKET
| BOOLEAN
| INT
| ID
StatementList: Statement StatementList
|
Statement: LBRACE StatementList RBRACE
| IF LPAREN Exp RPAREN Statement ELSE Statement
| WHILE LPAREN Exp RPAREN Statement
| PRINTLN LPAREN Exp RPAREN SEMICOLON
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
Exp: Exp OP Exp
| Exp LBRACKET Exp RBRACKET
| Exp DOT LENGTH
| Exp DOT ID LPAREN ExpList RPAREN
| INTEGER_LITERAL
| TRUE
| FALSE
| ID
| THIS
| NEW INT LBRACKET Exp RBRACKET
| NEW ID LPAREN RPAREN
| EXCLAMATION Exp
| LPAREN Exp RPAREN
ExpList: Exp ExpRestList
|
ExpRestList: ExpRest ExpRestList
|
ExpRest: COMMA Exp
%%
The derivations that are not working are the following two:
Statement:
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
If I only lex the file and get the token stream, the tokens match the pattern perfectly. Here's an example input and output:
num1 = id1;
num2[0] = id2;
gives:
ID
EQUALS
ID
SEMICOLON
ID
LBRACKET
INTEGER_LITERAL
RBRACKET
EQUALS
ID
SEMICOLON
What I don't understand is how this token stream matches the grammar exactly, and yet yyerror is being called. I've been trying to figure this out for hours, and I've finally given up. I'd appreciate any insight into what's causing the problem.
For a full example, you can run the following input through the parser:
class Minimal {
public static void main (String[] a) {
// Infinite loop
while (true) {
/* Completely useless // (embedded comment) stat
ements */
if ((!false && true)) {
if ((new Maximal().calculateValue(id1, i
d2) * 2) < 5) {
System.out.println(new int[11].l
ength < 10);
}
else { System.out.println(0); }
}
else { System.out.println(false); }
}
}
}
class Maximal {
public int calculateValue(int[] id1, int id2) {
int[] num1; int num2;
num1 = id1;
num2[0] = id2;
return (num1[0] * num2) - (num1[0] + num2);
}
}
It should parse correctly, but it is tripping up on num1 = id1; and num2[0] = id2;.
PS - I know that this is semantically-incorrect MiniJava, but syntactically, it should be fine :)
There is nothing wrong with your definitions of Statement. The reason they trigger the error is that they start with ID.
To start with, when bison processes your input, it reports:
minijava.y: conflicts: 8 shift/reduce
Shift/reduce conflicts are not always a problem, but you can't just ignore them. You need to know what causes them and whether the default behaviour will be correct or not. (The default behaviour is to prefer shift over reduce.)
Six of the shift/reduce conflicts come from the fact that:
Exp: Exp OP Exp
which is inherently ambiguous. You'll need to fix that by using actual operators instead of OP and inserting precedence rules (or specific productions). That has nothing to do with the immediate problem, and since it doesn't (for now) matter whether the first Exp or the second one gets priority, the default resolution will be fine.
The other ones come from the following production:
VarDeclList: VarDecl VarDeclList
| %empty
Here, VarDecl might start with ID (in the case of a classname used as a type).
VarDeclList is being produced from MethodDecl:
MethodDecl: ... VarDeclList StatementList ...
Now, let's say we're parsing the input; we've just parsed:
int num2;
and we're looking at the next token, which is num1 (from num1 = id1). int num2; is certainly a VarDecl, so it will match VarDecl in
VarDeclList: VarDecl VarDeclList
In this context, VarDeclList could be empty, or it could start with another declaration. If it's empty, we need to reduce it right away (because we won't get another chance: non-terminals need to be reduced no later than when their right-hand sides are complete). If it's not empty, we can simply shift the first token. But we need to make that decision based on the current lookahead token, which is an ID.
Unfortunately, that doesn't help us. Both VarDeclList and StatementList could start with ID, so both reduce and shift are feasible. Consequently, bison shifts.
Now, let's suppose that VarDeclList used left-recursion instead of right-recursion. (Left recursion is almost always better in LR grammars.):
VarDeclList: VarDeclList VarDecl
| %empty
Now, when we reach the end of a VarDecl, we have only one option: reduce the VarDeclList. And then we'll be in the following state:
MethodDecl: ... VarDeclList · StatementList
VarDeclList: VarDeclList · VarDecl
Now, we see the ID lookhead, and we don't know whether it starts a StatementList or a VarDecl. But it doesn't matter because we don't need to reduce either of those non-terminals; we can wait to see what comes next before committing to one or the other.
Note that there is a small semantic difference between left- and right-recursion in this case. Clearly, the syntax trees are different:
VDL VDL
/ \ / \
VDL Decl Decl VDL
/ \ / \
VDL Decl Decl VDL
| |
λ λ
However, in practice the most likely actions are going to be:
VarDeclList: %empty { $$ = newVarDeclList(); }
| VarDeclList VarDecl { $$ = $1; appendVarDecl($$, $2); }
which works just fine.
By the way:
1) While flex allows you to use definitions in order to simplify the regular expressions, it does not require you to use them, and nowhere is it written (to my knowledge) that it is best practice to use definitions. I use definitions sparingly, usually only when I'm going to write two regular expressions with the same component, or occasionally when the regular expression is really complicated and I want to break it down into pieces. However, there is absolutely no need to clutter your flex file with:
begin "begin"
...
%%
...
{begin} { return BEGIN; }
rather than the simpler and more readable
"begin" { return BEGIN; }
2) Along the same lines, bison helpfully allows you to write single-character tokens as single-quoted literals: '('. This has a number of advantages, starting with the fact that it provides a more readable view of the grammar. Also, you don't need to declare those tokens, or think up a good name for them. Moreover, since the value of the token is the character itself, your flex file can also be simplified. Instead of
"+" { return PLUS; }
"-" { return MINUS; }
"(" { return LPAREN; }
...
you can just write:
[-+*/(){}[\]!] { return yytext[0]; }
In fact, I usually recommend not even using that; just use a catch-all flex rule at the end:
. { return yytext[0]; }
That will pass all otherwise unmatched characters as single-character tokens to bison; if the token is not known to bison, it will issue a syntax error. So all the error-handling is centralized in bison, instead of being split between the two files, and you save a lot of typing (and whoever is reading your code saves a lot of reading.)
3) It's not necessary to put "System.out.println" before ".". They can never be confused, because they don't start with the same character. The only time order matters is if two patterns will maximally match the same string at the same point (which is why the ID pattern needs to come after all the individual keywords).
I am writing a parser for delphi's dfm's files. The lexer looks like this:
EXP ([Ee][-+]?[0-9]+)
%%
("#"([0-9]{1,5}|"$"[0-9a-fA-F]{1,6})|"'"([^']|'')*"'")+ {
return tkStringLiteral; }
"object" { return tkObjectBegin; }
"end" { return tkObjectEnd; }
"true" { /*yyval.boolean = true;*/ return tkBoolean; }
"false" { /*yyval.boolean = false;*/ return tkBoolean; }
"+" | "." | "(" | ")" | "[" | "]" | "{" | "}" | "<" | ">" | "=" | "," |
":" { return yytext[0]; }
[+-]?[0-9]{1,10} { /*yyval.integer = atoi(yytext);*/ return tkInteger; }
[0-9A-F]+ { return tkHexValue; }
[+-]?[0-9]+"."[0-9]+{EXP}? { /*yyval.real = atof(yytext);*/ return tkReal; }
[a-zA-Z_][0-9A-Z_]* { return tkIdentifier; }
"$"[0-9A-F]+ { /* yyval.integer = atoi(yytext);*/ return tkHexNumber; }
[ \t\r\n] { /* ignore whitespace */ }
. { std::cerr << boost::format("Mystery character %c\n") % *yytext; }
<<EOF>> { yyterminate(); }
%%
and the bison grammar looks like
%token tkInteger
%token tkReal
%token tkIdentifier
%token tkHexValue
%token tkHexNumber
%token tkObjectBegin
%token tkObjectEnd
%token tkBoolean
%token tkStringLiteral
%%object:
tkObjectBegin tkIdentifier ':' tkIdentifier
property_assignment_list tkObjectEnd
;
property_assignment_list:
property_assignment
| property_assignment_list property_assignment
;
property_assignment:
property '=' value
| object
;
property:
tkIdentifier
| property '.' tkIdentifier
;
value:
atomic_value
| set
| binary_data
| strings
| collection
;
atomic_value:
tkInteger
| tkReal
| tkIdentifier
| tkBoolean
| tkHexNumber
| long_string
;
long_string:
tkStringLiteral
| long_string '+' tkStringLiteral
;
atomic_value_list:
atomic_value
| atomic_value_list ',' atomic_value
;
set:
'[' ']'
| '[' atomic_value_list ']'
;
binary_data:
'{' '}'
| '{' hexa_lines '}'
;
hexa_lines:
tkHexValue
| hexa_lines tkHexValue
;
strings:
'(' ')'
| '(' string_list ')'
;
string_list:
tkStringLiteral
| string_list tkStringLiteral
;
collection:
'<' '>'
| '<' collection_item_list '>'
;
collection_item_list:
collection_item
| collection_item_list collection_item
;
collection_item:
tkIdentifier property_assignment_list tkObjectEnd
;
%%
void yyerror(const char *s, ...) {...}
The problem with this grammar occurs while parsing the binary data. Binary data in the dfm's files is nothing
but a sequence of hexadecimal characters which never spans more than 80 characters per line. An example of
it is:
Picture.Data = {
055449636F6E0000010001002020000001000800A80800001600000028000000
2000000040000000010008000000000000000000000000000000000000000000
...
FF00000000000000000000000000000000000000000000000000000000000000
00000000FF000000FF000000FF00000000000000000000000000000000000000
00000000}
As you can see, this element lacks any markers, so the strings clashes with other elements. In the example
above the first line is returns the proper token tkHexValue. The second however returns a tkInteger token
and the third a tkIdentifier token. So when the parsing comes, it fails with an syntax error because
binary data is composed only of tkHexValue tokens.
My first workaround was to require integers to have a maximum length (which helped in all but the last line
of the binary data). And the second was to move the tkHexValue token above the tkIdentifier but it means
that now I will not have identifiers like F0
I was wondering if there is any way to fix this grammar?
Ok, I solved this one. I needed to define a state so tkHexValue is only returned while reading binary data. In the preamble part of the lexer I added
%x BINARY
and modify the following rules
"{" {BEGIN BINARY; return yytext[0];}
<BINARY>"}" {BEGIN INITIAL; return yytext[0];}
<BINARY>[ \t\r\n] { /* ignore whitespace */ }
And that was all!