This is a class project of sorts, and I've worked out 99% of all kinks, but now I'm stuck. The grammar is for MiniJava.
I have the following lex file which works as intended:
%{
#include "y.tab.h"
%}
delim [ \t\n]
ws {delim}+
comment ("/*".*"*/")|("//".*\n)
id [a-zA-Z]([a-zA-Z0-9_])*
int_literal [0-9]*
op ("&&"|"<"|"+"|"-"|"*")
class "class"
public "public"
static "static"
void "void"
main "main"
string "String"
extends "extends"
return "return"
boolean "boolean"
if "if"
new "new"
else "else"
while "while"
length "length"
int "int"
true "true"
false "false"
this "this"
println "System.out.println"
lbrace "{"
rbrace "}"
lbracket "["
rbracket "]"
semicolon ";"
lparen "("
rparen ")"
comma ","
equals "="
dot "."
exclamation "!"
%%
{ws} { /* Do nothing! */ }
{comment} { /* Do nothing! */ }
{println} { return PRINTLN; } /* Before {period} to give this pre
cedence */
{op} { return OP; }
{int_literal} { return INTEGER_LITERAL; }
{class} { return CLASS; }
{public} { return PUBLIC; }
{static} { return STATIC; }
{void} { return VOID; }
{main} { return MAIN; }
{string} { return STRING; }
{extends} { return EXTENDS; }
{return} { return RETURN; }
{boolean} { return BOOLEAN; }
{if} { return IF; }
{new} { return NEW; }
{else} { return ELSE; }
{while} { return WHILE; }
{length} { return LENGTH; }
{int} { return INT; }
{true} { return TRUE; }
{false} { return FALSE; }
{this} { return THIS; }
{lbrace} { return LBRACE; }
{rbrace} { return RBRACE; }
{lbracket} { return LBRACKET; }
{rbracket} { return RBRACKET; }
{semicolon} { return SEMICOLON; }
{lparen} { return LPAREN; }
{rparen} { return RPAREN; }
{comma} { return COMMA; }
{equals} { return EQUALS; }
{dot} { return DOT; }
{exclamation} { return EXCLAMATION; }
{id} { return ID; }
%%
int main(void) {
yyparse();
exit(0);
}
int yywrap(void) {
return 0;
}
int yyerror(void) {
printf("Parse error. Sorry bro.\n");
exit(1);
}
And the yacc file:
%token PRINTLN
%token INTEGER_LITERAL
%token OP
%token CLASS
%token PUBLIC
%token STATIC
%token VOID
%token MAIN
%token STRING
%token EXTENDS
%token RETURN
%token BOOLEAN
%token IF
%token NEW
%token ELSE
%token WHILE
%token LENGTH
%token INT
%token TRUE
%token FALSE
%token THIS
%token LBRACE
%token RBRACE
%token LBRACKET
%token RBRACKET
%token SEMICOLON
%token LPAREN
%token RPAREN
%token COMMA
%token EQUALS
%token DOT
%token EXCLAMATION
%token ID
%%
Program: MainClass ClassDeclList
MainClass: CLASS ID LBRACE PUBLIC STATIC VOID MAIN LPAREN STRING LB
RACKET RBRACKET ID RPAREN LBRACE Statement RBRACE RBRACE
ClassDeclList: ClassDecl ClassDeclList
|
ClassDecl: CLASS ID LBRACE VarDeclList MethodDeclList RBRACE
| CLASS ID EXTENDS ID LBRACE VarDeclList MethodDeclList RB
RACE
VarDeclList: VarDecl VarDeclList
|
VarDecl: Type ID SEMICOLON
MethodDeclList: MethodDecl MethodDeclList
|
MethodDecl: PUBLIC Type ID LPAREN FormalList RPAREN LBRACE VarDeclLi
st StatementList RETURN Exp SEMICOLON RBRACE
FormalList: Type ID FormalRestList
|
FormalRestList: FormalRest FormalRestList
|
FormalRest: COMMA Type ID
Type: INT LBRACKET RBRACKET
| BOOLEAN
| INT
| ID
StatementList: Statement StatementList
|
Statement: LBRACE StatementList RBRACE
| IF LPAREN Exp RPAREN Statement ELSE Statement
| WHILE LPAREN Exp RPAREN Statement
| PRINTLN LPAREN Exp RPAREN SEMICOLON
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
Exp: Exp OP Exp
| Exp LBRACKET Exp RBRACKET
| Exp DOT LENGTH
| Exp DOT ID LPAREN ExpList RPAREN
| INTEGER_LITERAL
| TRUE
| FALSE
| ID
| THIS
| NEW INT LBRACKET Exp RBRACKET
| NEW ID LPAREN RPAREN
| EXCLAMATION Exp
| LPAREN Exp RPAREN
ExpList: Exp ExpRestList
|
ExpRestList: ExpRest ExpRestList
|
ExpRest: COMMA Exp
%%
The derivations that are not working are the following two:
Statement:
| ID EQUALS Exp SEMICOLON
| ID LBRACKET Exp RBRACKET EQUALS Exp SEMICOLON
If I only lex the file and get the token stream, the tokens match the pattern perfectly. Here's an example input and output:
num1 = id1;
num2[0] = id2;
gives:
ID
EQUALS
ID
SEMICOLON
ID
LBRACKET
INTEGER_LITERAL
RBRACKET
EQUALS
ID
SEMICOLON
What I don't understand is how this token stream matches the grammar exactly, and yet yyerror is being called. I've been trying to figure this out for hours, and I've finally given up. I'd appreciate any insight into what's causing the problem.
For a full example, you can run the following input through the parser:
class Minimal {
public static void main (String[] a) {
// Infinite loop
while (true) {
/* Completely useless // (embedded comment) stat
ements */
if ((!false && true)) {
if ((new Maximal().calculateValue(id1, i
d2) * 2) < 5) {
System.out.println(new int[11].l
ength < 10);
}
else { System.out.println(0); }
}
else { System.out.println(false); }
}
}
}
class Maximal {
public int calculateValue(int[] id1, int id2) {
int[] num1; int num2;
num1 = id1;
num2[0] = id2;
return (num1[0] * num2) - (num1[0] + num2);
}
}
It should parse correctly, but it is tripping up on num1 = id1; and num2[0] = id2;.
PS - I know that this is semantically-incorrect MiniJava, but syntactically, it should be fine :)
There is nothing wrong with your definitions of Statement. The reason they trigger the error is that they start with ID.
To start with, when bison processes your input, it reports:
minijava.y: conflicts: 8 shift/reduce
Shift/reduce conflicts are not always a problem, but you can't just ignore them. You need to know what causes them and whether the default behaviour will be correct or not. (The default behaviour is to prefer shift over reduce.)
Six of the shift/reduce conflicts come from the fact that:
Exp: Exp OP Exp
which is inherently ambiguous. You'll need to fix that by using actual operators instead of OP and inserting precedence rules (or specific productions). That has nothing to do with the immediate problem, and since it doesn't (for now) matter whether the first Exp or the second one gets priority, the default resolution will be fine.
The other ones come from the following production:
VarDeclList: VarDecl VarDeclList
| %empty
Here, VarDecl might start with ID (in the case of a classname used as a type).
VarDeclList is being produced from MethodDecl:
MethodDecl: ... VarDeclList StatementList ...
Now, let's say we're parsing the input; we've just parsed:
int num2;
and we're looking at the next token, which is num1 (from num1 = id1). int num2; is certainly a VarDecl, so it will match VarDecl in
VarDeclList: VarDecl VarDeclList
In this context, VarDeclList could be empty, or it could start with another declaration. If it's empty, we need to reduce it right away (because we won't get another chance: non-terminals need to be reduced no later than when their right-hand sides are complete). If it's not empty, we can simply shift the first token. But we need to make that decision based on the current lookahead token, which is an ID.
Unfortunately, that doesn't help us. Both VarDeclList and StatementList could start with ID, so both reduce and shift are feasible. Consequently, bison shifts.
Now, let's suppose that VarDeclList used left-recursion instead of right-recursion. (Left recursion is almost always better in LR grammars.):
VarDeclList: VarDeclList VarDecl
| %empty
Now, when we reach the end of a VarDecl, we have only one option: reduce the VarDeclList. And then we'll be in the following state:
MethodDecl: ... VarDeclList · StatementList
VarDeclList: VarDeclList · VarDecl
Now, we see the ID lookhead, and we don't know whether it starts a StatementList or a VarDecl. But it doesn't matter because we don't need to reduce either of those non-terminals; we can wait to see what comes next before committing to one or the other.
Note that there is a small semantic difference between left- and right-recursion in this case. Clearly, the syntax trees are different:
VDL VDL
/ \ / \
VDL Decl Decl VDL
/ \ / \
VDL Decl Decl VDL
| |
λ λ
However, in practice the most likely actions are going to be:
VarDeclList: %empty { $$ = newVarDeclList(); }
| VarDeclList VarDecl { $$ = $1; appendVarDecl($$, $2); }
which works just fine.
By the way:
1) While flex allows you to use definitions in order to simplify the regular expressions, it does not require you to use them, and nowhere is it written (to my knowledge) that it is best practice to use definitions. I use definitions sparingly, usually only when I'm going to write two regular expressions with the same component, or occasionally when the regular expression is really complicated and I want to break it down into pieces. However, there is absolutely no need to clutter your flex file with:
begin "begin"
...
%%
...
{begin} { return BEGIN; }
rather than the simpler and more readable
"begin" { return BEGIN; }
2) Along the same lines, bison helpfully allows you to write single-character tokens as single-quoted literals: '('. This has a number of advantages, starting with the fact that it provides a more readable view of the grammar. Also, you don't need to declare those tokens, or think up a good name for them. Moreover, since the value of the token is the character itself, your flex file can also be simplified. Instead of
"+" { return PLUS; }
"-" { return MINUS; }
"(" { return LPAREN; }
...
you can just write:
[-+*/(){}[\]!] { return yytext[0]; }
In fact, I usually recommend not even using that; just use a catch-all flex rule at the end:
. { return yytext[0]; }
That will pass all otherwise unmatched characters as single-character tokens to bison; if the token is not known to bison, it will issue a syntax error. So all the error-handling is centralized in bison, instead of being split between the two files, and you save a lot of typing (and whoever is reading your code saves a lot of reading.)
3) It's not necessary to put "System.out.println" before ".". They can never be confused, because they don't start with the same character. The only time order matters is if two patterns will maximally match the same string at the same point (which is why the ID pattern needs to come after all the individual keywords).
Related
I have the following code also have more after like expr: int {} | BOOL {} etc but i dont know what is the type that i should write in type of this parser, i have a calculator example that works with int and the type is int , but in my program i have float char string etc .. Thanks
%{
dont know what to write here
%}
%token <int> INT
%token <float> FLOAT
%token <char> CHAR
%token <bool> BOOL
%token <string> IDENT
%token PLUS Div Bigger Smaller MINUS TIMES
%token TYPE
%token DEF DD
%token Equals Atribuicao SoE BoE And Or
%token IF ELSE BEGIN END WHILE RETURN PV SEQ TO BY OF
%token RP LP LB RB
%token EOL
%left Bigger Smaller SoE BoE Equals Atribuicao Or And
%left PLUS MINUS
%left TIMES Div
%nonassoc UMINUS OF
%start main
%type <> main /* what should be in here ? */
main:
| expr EOL { $1 }
expr:
INT { }
| BOOL { }
| FLOAT { }
| CHAR { }
| expr OF expr { }
| BEGIN expr END { }
| RETURN expr PV { $2 }
| LP expr RP { $2 }
| LB expr RB { $2 }
| expr PLUS expr { }
| expr MINUS expr { }
| expr TIMES expr { }
%%
let main() = begin
Printf.printf "Hello yo\n" ;
end;;
Judging from your grammar, the return type should be something like expression, because it is expressions that you do parse. How you define that type depends on the semantics you want to implement. I guess that you will need a variant type that can at least hold atoms of type int, bool, float and char. So you can start with
type expression =
| Int of int
| Bool of bool
| Float of float
| Char of char
and see where it takes you.
I'm trying to parse some bits and pieces of Verilog - I'm primarily interested in extracting module definitions and instantiations.
In verilog a module is defined like:
module foo ( ... ) endmodule;
And a module is instantiated in one of two different possible ways:
foo fooinst ( ... );
foo #( ...list of params... ) fooinst ( .... );
At this point I'm only interested in finding the name of the defined or instantiated module; 'foo' in both cases above.
Given this menhir grammar (verParser.mly):
%{
type expr = Module of expr
| ModInst of expr
| Ident of string
| Int of int
| Lparen
| Rparen
| Junk
| ExprList of expr list
%}
%token <string> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE TICK OTHER HASH EOF
%start expr2
%type <expr> mod_expr
%type <expr> expr1
%type <expr list> expr2
%%
mod_expr:
| MODULE IDENT LPAREN { Module ( Ident $2) }
| IDENT IDENT LPAREN { ModInst ( Ident $1) }
| IDENT HASH LPAREN { ModInst ( Ident $1) };
junk:
| LPAREN { }
| RPAREN { }
| HASH { }
| INT { };
expr1:
| junk* mod_expr junk* { $2 } ;
expr2:
| expr1* EOF { $1 };
When I try this out in the menhir interpretter it works fine extracting the module instantion:
MODULE IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: MODULE IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
It works fine for the single module instantiation:
IDENT IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: IDENT IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
But of course, if there is an IDENT that appears prior to any of these it will REJECT:
IDENT MODULE IDENT LPAREN IDENT IDENT LPAREN
REJECT
... and of course there will be identifiers in an actual verilog file prior to these defs.
I'm trying not to have to fully specify a Verilog grammar, instead I want to build the grammar up slowly and incrementally to eventually parse more and more of the language.
If I add IDENT to the junk rule, that fixes the problem above, but then the module instantiation rule doesn't work because now the junk rule is capturing the IDENT.
Is it possible to create a very permissive rule that will bypass stuff I don't want to match, or is it generally required that you must create a complete grammar to actually do something like this?
Is it possible to create a rule that would let me match:
MODULE IDENT LPAREN stuff* RPAREN ENDMODULE
where "stuff*" initially matches everything but RPAREN?
Something like :
stuff:
| !RPAREN { } ;
I've used PEG parsers in the past which would allow constructs like that.
I've decided that PEG is a better fit for a permissive, non-exhaustive grammar. Took a look at peg/leg and was able to very quickly put together a leg grammar that does what I need to do:
start = ( comment | mod_match | char)
line = < (( '\n' '\r'* ) | ( '\r' '\n'* )) > { lines++; chars += yyleng; }
module_decl = module modnm:ident lparen ( !rparen . )* rparen { chars += yyleng; printf("Module decl: <%s>\n",yytext);}
module_inst = modinstname:ident ident lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
|modinstname:ident hash lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
mod_match = ( module_decl | module_inst )
module = 'module' ws { modules++; chars +=yyleng; printf("Module: <%s>\n", yytext); }
endmodule = 'endmodule' ws { endmodules++; chars +=yyleng; printf("EndModule: <%s>\n", yytext); }
kwd = (module|endmodule)
ident = !kwd<[a-zA-z][a-zA-Z0-9_]+>- { words++; chars += yyleng; printf("Ident: <%s>\n", yytext); }
char = . { chars++; }
lparen = '(' -
rparen = ')' -
hash = '#'
- = ( space | comment )*
ws = space+
space = ' ' | '\t' | EOL
comment = '//' ( !EOL .)* EOL
| '/*' ( !'*/' .)* '*/'
EOF = !.
EOL = '\r\n' | '\n' | '\r'
Aurochs is possibly also an option, but I have concerns about speed and memory usage of an Aurochs generated parser. peg/leg produce a parser in C which should be quite speedy.
So I am trying to implement a pretty simple grammar for one-line statements:
# Grammar
c : Character c [a-z0-9-]
(v) : Vowel (= [a,e,u,i,o])
(c) : Consonant
(?) : Any character (incl. number)
(l) : Any alpha char (= [a-z])
(n) : Any integer (= [0-9])
(c1-c2) : Range from char c1 to char c2
(c1,c2,c3) : List including chars c1, c2 and c3
Examples:
h(v)(c)no(l)(l)jj-k(n)
h(v)(c)no(l)(l)(a)(a)(n)
h(e-g)allo
h(e,f,g)allo
h(x,y,z)uul
h(x,y,z)(x,y,z)(x,y,z)(x,y,z)uul
I am using the Happy parser generator (http://www.haskell.org/happy/) but for some reason there seems to be some ambiguity problem.
The error message is: "shift/reduce conflicts: 1"
I think the ambiguity is with these two lines:
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket char hyphen char rBracket { TRange $2 $4 }
An example case is: "(a)" vs "(a-z)"
The lexer would give the following for the two cases:
(a) : [CLBracket, CChar 'a', CRBracket]
(a-z) : [CLBracket, CChar 'a', CHyphen, CChar 'z', CRBracket]
What I don't understand is how this can be ambiguous with an LL[2] parser.
In case it helps here is the entire Happy grammar definition:
{
module XHappyParser where
import Data.Char
import Prelude hiding (lex)
import XLexer
import XString
}
%name parse
%tokentype { Character }
%error { parseError }
%token
lBracket { CLBracket }
rBracket { CRBracket }
hyphen { CHyphen }
question { CQuestion }
comma { CComma }
char { CChar $$ }
%%
xstring : tokens { XString (reverse $1) }
tokens : token { [$1] }
| tokens token { $2 : $1 }
token : char { TLiteral $1 }
| hyphen { TLiteral '-' }
| lBracket char rBracket { (\c -> case c of
'v' -> TVowel
'c' -> TConsonant
'l' -> TLetter
'n' -> TNumber) $2 }
| lBracket question rBracket { TAny }
| lBracket char hyphen char rBracket { TRange $2 $4 }
| lBracket listitems rBracket { TList $2 }
listitems : char { [$1] }
| listitems comma char { $1 ++ [$3] }
{
parseError :: [Character] -> a
parseError _ = error "parse error"
}
Thank you!
Here's the ambiguity:
token : [...]
| lBracket char rBracket
| [...]
| lBracket listitems rBracket
listitems : char
| [...]
Your parser could accept (v) as both TString [TVowel] and TString [TList ['v']], not to mention the missing characters in that case expression.
One possible way of solving it would be to modify your grammar so lists are at least two items, or have some different notation for vowels, consonants, etc.
The problem seems to be:
| lBracket char rBracket
...
| lBracket listitems rBracket
or in cleaner syntax:
(c)
Can be a TVowel, TConsonant, TLetter, TNumber (as you know) or a singleton TList.
As the happy manual says, shift reduce usually isn't an issue. You can us precedence to force behavior/remove the warning if you'd like.
I'm starting to play with Fslex/Fsyacc. When trying to generate the parser using this input
Parser.fsy:
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token <System.Int32> INT
%token <System.String> STRING
%token <System.String> ID
%token PLUS MINUS ASTER SLASH LT LT EQ GTE GT
%token LPAREN RPAREN LCURLY RCURLY LBRACKET RBRACKET COMMA
%token ARRAY IF THEN ELSE WHILE FOR TO DO LET IN END OF BREAK NIL FUNCTION VAR TYPE IMPORT PRIMITIVE
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type <Ast.Program> start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Prog { Program($1) }
Prog:
| Expr EOF { $1 }
Expr:
// literals
| NIL { Ast.Nil ($1) }
| INT { Ast.Integer($1) }
| STRING { Ast.Str($1) }
// arrays and records
| ID LBRACKET Expr RBRACKET OF Expr { Ast.Array ($1, $3, $6) }
| ID LCURLY AssignmentList RCURLY { Ast.Record ($1, $3) }
AssignmentList:
| Assignment { [$1] }
| Assignment COMMA AssignmentList {$1 :: $3 }
Assignment:
| ID EQ Expr { Ast.Assignment ($1,$3) }
Ast.fs
namespace Ast
open System
type Integer =
| Integer of Int32
and Str =
| Str of string
and Nil =
| None
and Id =
| Id of string
and Array =
| Array of Id * Expr * Expr
and Record =
| Record of Id * (Assignment list)
and Assignment =
| Assignment of Id * Expr
and Expr =
| Nil
| Integer
| Str
| Array
| Record
and Program =
| Program of Expr
Fsyacc reports the following error: "FSYACC: error FSY000: An item with the same key has already been added."
I believe the problem is in the production for AssignmentList, but can't find a way around...
Any tips will be appreciated
Hate answering my own questions, but the problem was here (line 15 of parser input file)
%token PLUS MINUS ASTER SLASH LT LT EQ GTE GT
Note the double definition (should have been LTE)
My vote goes for finding a way to improve the output of the Fslex/Fsyacc executables/msbuild tasks
I am writing a parser for delphi's dfm's files. The lexer looks like this:
EXP ([Ee][-+]?[0-9]+)
%%
("#"([0-9]{1,5}|"$"[0-9a-fA-F]{1,6})|"'"([^']|'')*"'")+ {
return tkStringLiteral; }
"object" { return tkObjectBegin; }
"end" { return tkObjectEnd; }
"true" { /*yyval.boolean = true;*/ return tkBoolean; }
"false" { /*yyval.boolean = false;*/ return tkBoolean; }
"+" | "." | "(" | ")" | "[" | "]" | "{" | "}" | "<" | ">" | "=" | "," |
":" { return yytext[0]; }
[+-]?[0-9]{1,10} { /*yyval.integer = atoi(yytext);*/ return tkInteger; }
[0-9A-F]+ { return tkHexValue; }
[+-]?[0-9]+"."[0-9]+{EXP}? { /*yyval.real = atof(yytext);*/ return tkReal; }
[a-zA-Z_][0-9A-Z_]* { return tkIdentifier; }
"$"[0-9A-F]+ { /* yyval.integer = atoi(yytext);*/ return tkHexNumber; }
[ \t\r\n] { /* ignore whitespace */ }
. { std::cerr << boost::format("Mystery character %c\n") % *yytext; }
<<EOF>> { yyterminate(); }
%%
and the bison grammar looks like
%token tkInteger
%token tkReal
%token tkIdentifier
%token tkHexValue
%token tkHexNumber
%token tkObjectBegin
%token tkObjectEnd
%token tkBoolean
%token tkStringLiteral
%%object:
tkObjectBegin tkIdentifier ':' tkIdentifier
property_assignment_list tkObjectEnd
;
property_assignment_list:
property_assignment
| property_assignment_list property_assignment
;
property_assignment:
property '=' value
| object
;
property:
tkIdentifier
| property '.' tkIdentifier
;
value:
atomic_value
| set
| binary_data
| strings
| collection
;
atomic_value:
tkInteger
| tkReal
| tkIdentifier
| tkBoolean
| tkHexNumber
| long_string
;
long_string:
tkStringLiteral
| long_string '+' tkStringLiteral
;
atomic_value_list:
atomic_value
| atomic_value_list ',' atomic_value
;
set:
'[' ']'
| '[' atomic_value_list ']'
;
binary_data:
'{' '}'
| '{' hexa_lines '}'
;
hexa_lines:
tkHexValue
| hexa_lines tkHexValue
;
strings:
'(' ')'
| '(' string_list ')'
;
string_list:
tkStringLiteral
| string_list tkStringLiteral
;
collection:
'<' '>'
| '<' collection_item_list '>'
;
collection_item_list:
collection_item
| collection_item_list collection_item
;
collection_item:
tkIdentifier property_assignment_list tkObjectEnd
;
%%
void yyerror(const char *s, ...) {...}
The problem with this grammar occurs while parsing the binary data. Binary data in the dfm's files is nothing
but a sequence of hexadecimal characters which never spans more than 80 characters per line. An example of
it is:
Picture.Data = {
055449636F6E0000010001002020000001000800A80800001600000028000000
2000000040000000010008000000000000000000000000000000000000000000
...
FF00000000000000000000000000000000000000000000000000000000000000
00000000FF000000FF000000FF00000000000000000000000000000000000000
00000000}
As you can see, this element lacks any markers, so the strings clashes with other elements. In the example
above the first line is returns the proper token tkHexValue. The second however returns a tkInteger token
and the third a tkIdentifier token. So when the parsing comes, it fails with an syntax error because
binary data is composed only of tkHexValue tokens.
My first workaround was to require integers to have a maximum length (which helped in all but the last line
of the binary data). And the second was to move the tkHexValue token above the tkIdentifier but it means
that now I will not have identifiers like F0
I was wondering if there is any way to fix this grammar?
Ok, I solved this one. I needed to define a state so tkHexValue is only returned while reading binary data. In the preamble part of the lexer I added
%x BINARY
and modify the following rules
"{" {BEGIN BINARY; return yytext[0];}
<BINARY>"}" {BEGIN INITIAL; return yytext[0];}
<BINARY>[ \t\r\n] { /* ignore whitespace */ }
And that was all!