Flex: Unrecognized rule error - flex-lexer

I come up with an "unrecognized rule" error in Flex. I have read some articles but I did not find any solution to my problem. I have tried to make some changes in my code, but nothing seems to make it work(sometimes these changes made it even worse instead). I post my code below hoping a solution to be found.
My flex code:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real" | "boolean" | "integer" | "char"
%%
{VAR_DEFINER} {printf("A keyword: %s\n", yytext);}
{VAR_NAME} | ","{VAR_NAME} {printf("A variable name: %s\n", yytext);}
":" {printf("A colon\n");}
{VAR_TYPE}";""\n" {printf("The variable type is: %s\n", yytext);}
"\n"{VAR_DEFINER} {printf("Error: The keyword 'var' is defined once at the beginning.\n");}
[ \t\n]+ /* eat up whitespace */
. {printf("Unrecognized character: %s\n", yytext);}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}

As you wrote in your own answer to your question, you can fix the errors by being careful with whitespace.
But the underlying problem is that you are trying to let the scanner do work that is better done by the parser. If you want to parse things like var x boolean, then that shouldn't be a single token, discovered by the scanner. The usual, and most often much better, approach is to let the scanner discover three separate tokens (var, x and boolean), and then let the parser group them into a variable declaration.

I found the answer on my own. I would like to post it to help anyone else who may have a similar problem, just in case.
My fault was that I left unquoted whitespaces amongst the terms of expressions or amongst the variable types in declaration part. For example, I have written VAR_TYPE "real" | "boolean" | "integer" | "char" , instead of VAR_TYPE "real"|"boolean"|"integer"|"char" (without whitespaces).
So, mind all kinds of brackets and the whitespaces!!!
I hope to have helped!

Related

Strange lexing issue keywords vs identifiers regex matching

I've been struggling to understand some behavior of flex.
I started defining a small toy-like example program which will tokenize into keywords and strings.
One definition of the regex performs as expected but another behaves quite differently, contrary to my expectation.
It has been a few years since I've played with this stuff so hopefully someone can point me in the right direction.
I modified the token regular expression to get it to work but I'd really like to understand why my original choice behaved differently.
This first example is the non-working code
%{
#include <iostream>
using namespace std;
%}
%option noyywrap
%%
[ \t\n] {cout << "ws" << endl;};
buzz {cout << "kw" << endl;};
[^\n]+ {cout << "str" << endl;};
%%
int main(){
yylex();
}
The second example is the modified version which does behave properly.
%{
#include <iostream>
using namespace std;
%}
%option noyywrap
%%
[ \t\n] {cout << "ws" << endl;};
buzz {cout << "kw" << endl;};
[a-zA-Z]+ {cout << "str" << endl;};
%%
int main(){
yylex();
}
In the code, buzz is supposed to be a keyword, and anything following should be just read as a string.
For the first example, buzz gets consumed up along with the remaining word as a "str".
In the second example, buzz is properly recognized and the remaining word becomes the "str".
I understand that the third rule in both cases is also a valid definition for a token containing the characters b-u-z-z. Each of these four letters is in [^\n]+, as well as [a-zA-Z]+. So why on earth is the behavior different?
example inputs would be:
buzz lightyear
buzz aldren
Thanks!
Flex (as well as most other lexer generators) works according to the maximum munch rule. That rule says that if multiple patterns can match on the current input, the one that produces the longest match is chosen. If multiple pattern produce a match of the same size, the one that appears first in the .l file is chosen.
So in your working solution the patterns buzz and [a-zA-Z0-9]+ both match buzz, so buzz is chosen because it appears first in the file (if you switched the two lines, str would be printed instead). In your non-working solution buzz still would only match buzz, but [^\n]+ matches buzz lightyear and buzz aldren respectively, which is the longer match. Thus it wins according to the maximum munch rule.

How to use "literal string tokens" in Bison

I am learning Flex/Bison. The manual of Bison says:
A literal string token is written like a C string constant; for
example, "<=" is a literal string token. A literal string token
doesn’t need to be declared unless you need to specify its semantic
value data type
But I do not figure how to use it and I do not find an example.
I have the following code for testing:
example.l
%option noyywrap nodefault
%{
#include "example.tab.h"
%}
%%
[ \t\n] {;}
[0-9] { return NUMBER; }
. { return yytext[0]; }
%%
example.y
%{
#include <stdio.h>
#define YYSTYPE char *
%}
%token NUMBER
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
%%
main(int argc, char **argv) {
yyparse();
}
yyerror(char *s) {
fprintf(stderr, "error: %s\n", s);
}
Makefile
#!/usr/bin/make
# by RAM
all: example
example.tab.c example.tab.h: example.y
bison -d $<
lex.yy.c: example.l example.tab.h
flex $<
example: lex.yy.c example.tab.c
cc -o $# example.tab.c lex.yy.c -lfl
clean:
rm -fr example.tab.c example.tab.h lex.yy.c example
And when I run it:
$ ./example
3<4
<
6>9
>
6=>9
error: syntax error
Any idea?
UPDATE: I want to clarify that I know alternative ways to solve it, but I want to use literal string tokens.
One Alternative: using multiple "literal character tokens":
tokens:
NUMBER '<' '=' NUMBER { printf("<="); }
| NUMBER '=' '>' NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
When I run it:
$ ./example
3<=9
<=
Other alternative:
In example.l:
"<=" { return LE; }
"=>" { return GE; }
In example.y:
...
%token NUMBER
%token LE "<="
%token GE "=>"
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
...
When I run it:
$ ./example
3<=4
<=
But the manual says:
A literal string token doesn’t need to be declared unless you need to
specify its semantic value data type
The quoted manual paragraph is correct, but you need to read the next paragraph, too:
You can associate the literal string token with a symbolic name as an alias, using the %token declaration (see Token Declarations). If you don’t do that, the lexical analyzer has to retrieve the token number for the literal string token from the yytname table.
So you don't need to declare the literal string token, but you still need to arrange for the lexer to send the correct token number, and if you don't declare an associated token name, the only way to find the correct value is to search for the code in the yytname table.
In short, your last example where you define LE and GE as aliases, is by far the most common approach. Separating the tokens into individual characters is not a good idea; it might create shift-reduce conflicts and it will definitely allow invalid inputs, such as putting whitespace between the characters.
If you want to try the yytname solution, there is sample code in the bison manual. But please be aware that this code discovers bison's internal token number, which is not the number which needs to be returned from the scanner. There is no way to get the external token number which is easy, portable and documented; the easy and undocumented way is to look the token number up in yytoknum but since that array is not documented and conditional on a preprocessor macro, there is no guarantee that it will work. Note also that these tables are declared static so the function(s) which rely on them must be included in the bison input file. (Of course, these functions can have external linkage so that they can be called from the lexer. But you can't just use yytname directly in the lexer.)
I havent used flex/bison for a while but two things:
. as far as I remember only matches a single character. yytext is a pointer to a null terminated string char* so yytext[0] is a char which means that you can't match strings this way. You probably need to change it to return yytext. Otherwise . will probably create a token PER character and you'd probably have to write NUMBER '<' '=' NUMBER.

Flex: How to define a term to be the first one at the beginning of a line(exclusively)

I need some help regarding a problem I face in my flex code.
My task: To write a flex code which recognizes the declaration part of a programming language, described below.
Let a programming language PL. Its variable definition part is described as follows:
At the beginning we have to start with the keyword "var". After writing this keyword we have to write the variable names(one or more) separated by commas ",". Then a colon ":" is inserted and after that we must write the variable type(say real, boolean, integer or char in my example) followed by a semicolon ";". After doing the previous steps there is the potentiality to declare into a new line new variables(variable names separated by commas "," followed by colon ":" followed by variable type followed by a semicolon ";"), but we must not use the "var" keyword again at the beginning of the new line( the "var" keyword is written once!!!)
E.g.
var number_of_attendants, sum: integer;
ticket_price: real;
symbols: char;
Concretely, I do not know how to make it possible to define that each and every declaration part must start only with the 'var' keyword. Until now, if I would begin a declaration part directly declaring a variable, say x (without having written "var" at the beginning of the line), then no error would occur(unwanted state).
My current flex code below:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real"|"boolean"|"integer"|"char"
SUBEXPRESSION [{VAR_NAME}[","{VAR_NAME}]*":"[ \t\n]*{VAR_TYPE}";"]+
EXPRESSION {VAR_DEFINER}{SUBEXPRESSION}
%%
^{EXPRESSION} {
printf("This is not a well-syntaxed expression!\n");
return 0;
}
{EXPRESSION} printf("This is a well-syntaxed expression!\n");
";"[ \t\n]*{VAR_DEFINER} {
printf("The keyword 'var' is defined once at the beginning of a new line. You can not use it again\n");
return 0;
}
{VAR_DEFINER} printf("A keyword: %s\n", yytext);
^{VAR_DEFINER} printf("Each and every declaration part must start with the 'var' keyword.\n");
{VAR_TYPE}";" printf("The variable type is: %s\n", yytext);
{VAR_NAME} printf("A variable name: %s\n", yytext);
","/[ \t\n]*{VAR_NAME} /* eat up commas */
":"/[ \t\n]*{VAR_TYPE}";" /* eat up single colon */
[ \t\n]+ /* eat up whitespace */
. {
printf("Unrecognized character: %s\n", yytext);
return 0;
}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
I hope to have made it as much as possible clear.
I am looking forward to reading your answers!
You seem to be trying to do too much in the scanner. Do you really have to do everything in Flex? In other words, is this an exercise to learn advanced use of Flex, or is it a problem that may be solved using more appropriate tools?
I've read that the first Fortran compiler took 18 staff-years to create, back in the 1950's. Today, "a substantial compiler can be implemented even as a student project in a one-semester compiler design course", as the Dragon Book from 1986 says. One of the main reasons for this increased efficiency is that we have learned how to divide the compiler into modules that can be constructed separately. The two first such parts, or phases, of a typical compiler is the scanner and the parser.
The scanner, or lexical analyzer, can be generated by Flex from a specification file, or constructed otherwise. Its job is to read the input, which consists of a sequence of characters, and split it into a sequence of tokens. A token is the smallest meaningful part of the input language, such as a semicolon, the keyword var, the identifier number_of_attendants, or the operator <=. You should not use the scanner to do more than that.
Here is how I woould write a simplified Flex specification for your tokens:
[ \t\n] { /* Ignore all whitespace */ }
var { return VAR; }
real { return REAL; }
boolean { return BOOLEAN; }
integer { return INTEGER; }
char { return CHAR; }
[a-zA-Z][a-zA-Z0-9_]* { return VAR_NAME; }
. { return yytext[0]; }
The sequence of tokens is then passed on to the parser, or syntactical analyzer. The parser compares the token sequence with the grammar for the language. For example, the input var number_of_attendants, sum : integer; consists of the keyword var, a comma-separated list of variables, a colon, a data type keyword, and a semicolon. If I understand what your input is supposed to look like, perhaps this grammar would be correct:
program : VAR typedecls ;
typedecls : typedecl | typedecls typedecl ;
typedecl : varlist ':' var_type ';' ;
varlist : VAR_NAME | varlist ',' VAR_NAME ;
var_type : REAL | BOOLEAN | INTEGER | CHAR ;
This grammar happens to be written in a format that Bison, a parser-generator that often is used together with Flex, can understand.
If you separate your solution into a lexical part, using Flex, and a grammar part, using Bison, your life is likely to be much simpler and happier.

Bison grammar warnings

I am writing a parser with Bison and I am getting the following warnings.
fol.y:42 parser name defined to default :"parse"
fol.y:61: warning: type clash ('' 'pred') on default action
I have been using Google to search for a way to get rid of them, but have pretty much come up empty handed on what they mean (much less how to fix them) since every post I found with them has a compilation error and the warnings them selves aren't addressed. Could someone tell me what they mean and how to fix them? The relevant code is below. Line 61 is the last semicolon. I cut out the rest of the grammar since it is incredibly verbose.
%union {
char* var;
char* name;
char* pred;
}
%token <var> VARIABLE
%token <name> NAME
%token <pred> PRED
%%
fol:
declines clauses {cout << "Done parsing with file" << endl;}
;
declines:
declines decline
|decline
;
decline:
PRED decs
;
The first message is likely just a warning that you didn't include %start parse in the grammar specification.
The second means that somewhere you have rule that is supposed to return a value but you haven't properly specified which type of value it is to return. The PRED returns the pred element of your union; the problem might be that you've not created %type entries for decline and declines. If you have a union, you have to specify the type for most, if not all, rules — or maybe just rules that don't have an explicit action (so as to override the default $$ = $1; action).
I'm not convinced that the problem is in the line you specify, and because we don't have a complete, minimal reproduction of your problem, we can't investigate for you to validate it. The specification for decs may be relevant (I'm not convinced it is, but it might be).
You may get more information from the output of bison -v, which is the y.output file (or something similar).
Finally found it.
To fix this:
fol.y:42 parser name defined to default :"parse"
Add %name parse before %token
Eg:
%name parse
%token NUM
(From: https://bdhacker.wordpress.com/2012/05/05/flex-bison-in-ubuntu/#comment-2669)

Bison: How to ignore a token if it doesn't fit into a rule

I'm writing a program that handles comments as well as a few other things. If a comment is in a specific place, then my program does something.
Flex passes a token upon finding a comment, and Bison then looks to see if that token fits into a particular rule. If it does, then it takes an action associated with that rule.
Here's the thing: the input I'm receiving might actually have comments in the wrong places. In this case, I just want to ignore the comment rather than flagging an error.
My question:
How can I use a token if it fits into a rule, but ignore it if it doesn't? Can I make a token "optional"?
(Note: The only way I can think of of doing this right now is scattering the comment token in every possible place in every possible rule. There MUST be a better solution than this. Maybe some rule involving the root?)
One solution may be to use bison's error recovery (see the Bison manual).
To summarize, bison defines the terminal token error to represent an error (say, a comment token returned in the wrong place). That way, you can (for example) close parentheses or braces after the wayward comment is found. However, this method will probably discard a certain amount of parsing, because I don't think bison can "undo" reductions. ("Flagging" the error, as with printing a message to stderr, is not related to this: you can have an error without printing an error--it depends on how you define yyerror.)
You may instead want to wrap each terminal in a special nonterminal:
term_wrap: comment TERM
This effectively does what you're scared to do (put in a comment in every single rule), but it does it in fewer places.
To force myself to eat my own dog food, I made up a silly language for myself. The only syntax is print <number> please, but if there's (at least) one comment (##) between the number and the please, it prints the number in hexadecimal, instead.
Like this:
print 1 please
1
## print 2 please
2
print ## 3 please
3
print 4 ## please
0x4
print 5 ## ## please
0x5
print 6 please ##
6
My lexer:
%{
#include <stdio.h>
#include <stdlib.h>
#include "y.tab.h"
%}
%%
print return PRINT;
[[:digit:]]+ yylval = atoi(yytext); return NUMBER;
please return PLEASE;
## return COMMENT;
[[:space:]]+ /* ignore */
. /* ditto */
and the parser:
%debug
%error-verbose
%verbose
%locations
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str) {
fprintf(stderr, "error: %s\n", str);
}
int yywrap() {
return 1;
}
extern int yydebug;
int main(void) {
yydebug = 0;
yyparse();
}
%}
%token PRINT NUMBER COMMENT PLEASE
%%
commands: /* empty */
|
commands command
;
command: print number comment please {
if ($3) {
printf("%#x", $2);
} else {
printf("%d", $2);
}
printf("\n");
}
;
print: comment PRINT
;
number: comment NUMBER {
$$ = $2;
}
;
please: comment PLEASE
;
comment: /* empty */ {
$$ = 0;
}
|
comment COMMENT {
$$ = 1;
}
;
So, as you can see, not exactly rocket science, but it does the trick. There's a shift/reduce conflict in there, because of the empty string matching comment in multiple places. Also, there's no rule to fit comments in between the final please and EOF. But overall, I think it's a good example.
Treat comments as whitespace at the lexer level.
But keep two separate rules, one for whitespace and one for comments, both returning the same token ID.
The rule for comments (+ optional whitespace) keeps track of the comment in a dedicated structure.
The rule for whitespace resets the structure.
When you enter that “specific place”, look if the last whitespace was a comment or trigger an error.

Resources