why my lexical analyzer can not recognize numbers and ids and operators - flex-lexer

my lexical analyzer in flex can not recognize numbers and ids and operators ,only keywords were recognized where is my mistake? this is my code:
%{
#include<stdio.h>
%}
Nums [0-9]
LowerCase [a-z]
UpperCase [A-Z]
Letters LowerCase|UpperCase|[_]
Id {Letters}({Letters}|{Nums})*
operators +|-|\|*
%%
"if" {printf("if keyword founded \n");}
"then" {printf("then keyword founded \n");}
"else" {printf("else keyword founded \n");}
Operators {printf(" operator founded \n");}
Id {printf(" id founded ");}
%%
int main (void)
{ yylex(); return(0);}
int yywrap(void)
{ return 1;}

The pattern Operators is equivalent to "Operators", so it only matches that single word. If you meant to expand the macro by that name, the syntax is {Operators}. (Actually, {operators} since you seem to have inconsistently spelled the macro name in all lower-case.)
If you do that, flex will complain because of the syntax error in that macro. (Syntax errors in macros aren't detected unless the macro is expanded. That's just one of the problems with using macros.)
You have different problems with your other macros. For example, Nums doesn't appear in any rule at all.
My suggestion would be to use fewer (or no) macros and more character classes. Eg.:
[[:alpha:]_][[:alnum:]_]* { /* Action for identifier. */ }
[[:digit:]]+ { /* Action for number. */ }
[-+*/] { /* Action for operator. */ }
Please read the Patterns section in the flex manual for a full description of the pattern syntax, including the named character class expressions used in the first two patterns above.

To use a named definition, it ust be enclosed in {}. So your Letters rule should be
Letters {LowerCase}|{UpperCase}|[_]
... as it is, it matches the literal inputs LowerCase and UpperCase. Similarly in your rules, you want
{Operators} ...
{Id} ...
as what you have will match the literal input strings Operators and Id

Related

What is the difference between word and word in quotations in FLEX [duplicate]

I am writing a simple scanner in flex. I want my scanner to print out "integer type seen" when it sees the keyword "int". Is there any difference between the following two ways?
1st way:
%%
int printf("integer type seen");
%%
2nd way:
%%
"int" printf("integer type seen");
%%
So, is there a difference between writing if or "if"? Also, for example when we see a == operator, we print something. Is there a difference between writing == or "==" in the flex file?
There's no difference in these specific cases -- the quotes(") just tell lex to NOT interpret any special characters (eg, for regular expressions) in the quoted string, but if there are no special characters involved, they don't matter:
[a-z] printf("matched a single letter\n");
"[a-z]" printf("matched the 5-character string '[a-z]'\n");
0* printf("matched zero or more zero characters\n");
"0*" printf("matched a zero followed by an asterisk\n");
Characters that are special and mean something different outside of quotes include . * + ? | ^ $ < > [ ] ( ) { } /. Some of those only have special meaning if they appear at certain places, but its generally clearer to quote them regardless of where they appear if you want to match the literal characters.

Unrecognized rule in Flex lexer

In the process of making an XML parser :
As the title suggests I have documented the rules as shown in my code below , but flex seems to miss a specific one.
Error : Cmd Error Img
The line in question is :
{boolean} {yylval.booleanval = strdup(yytext); if(err==1){printf("\t\t\t\t\t\t");}; return BOOLEAN;}```
When clearly declared flex seems to disregard it, where for the other rules no such problem arises.
Flex Code :
%option noyywrap
%option yylineno
string [_a-zA-Z][_a-zA-Z0-9]*
digit [0-9]
integer {digit}+
boolean "True" | "False"
text ({string}| )*
%%
. {printf("%s",yytext);}
{boolean} {yylval.booleanval = strdup(yytext); if(err==1){printf("\t\t\t\t\t\t");}; return BOOLEAN;}
{integer} {return INT;}
{string} {return STRING;}
%%
Rereading the question, I think there is a terminology problem. The rule is
{boolean} {yylval.booleanval = strdup(yytext); if(err==1){printf("\t\t\t\t\t\t");}; return BOOLEAN;}
Like all rule, that rule consists of *pattern" and an action. The pattern {boolean} consists only of a macro expansion. Once the macro is expanded, the line can no longer be recognised as a rule because of stray whitespace in the macro's definition, as I explained in the original answer below:
As indicated by the error message, the problem is the pattern in line 22 of your flex file, which contains a macro expansion of boolean:
boolean "True" | "False"
Flex patterns may not contain unquoted whitespace, whether entered directly or through a macro.
If you insist on using a macro, it could be:
boolean True|False
Although nothing prevents you from inserting the pattern directly in the rule:
True|False {yylval.booleanval = strdup(yytext); if(err==1){printf("\t\t\t\t\t\t");}; return BOOLEAN;}

How to use "literal string tokens" in Bison

I am learning Flex/Bison. The manual of Bison says:
A literal string token is written like a C string constant; for
example, "<=" is a literal string token. A literal string token
doesn’t need to be declared unless you need to specify its semantic
value data type
But I do not figure how to use it and I do not find an example.
I have the following code for testing:
example.l
%option noyywrap nodefault
%{
#include "example.tab.h"
%}
%%
[ \t\n] {;}
[0-9] { return NUMBER; }
. { return yytext[0]; }
%%
example.y
%{
#include <stdio.h>
#define YYSTYPE char *
%}
%token NUMBER
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
%%
main(int argc, char **argv) {
yyparse();
}
yyerror(char *s) {
fprintf(stderr, "error: %s\n", s);
}
Makefile
#!/usr/bin/make
# by RAM
all: example
example.tab.c example.tab.h: example.y
bison -d $<
lex.yy.c: example.l example.tab.h
flex $<
example: lex.yy.c example.tab.c
cc -o $# example.tab.c lex.yy.c -lfl
clean:
rm -fr example.tab.c example.tab.h lex.yy.c example
And when I run it:
$ ./example
3<4
<
6>9
>
6=>9
error: syntax error
Any idea?
UPDATE: I want to clarify that I know alternative ways to solve it, but I want to use literal string tokens.
One Alternative: using multiple "literal character tokens":
tokens:
NUMBER '<' '=' NUMBER { printf("<="); }
| NUMBER '=' '>' NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
When I run it:
$ ./example
3<=9
<=
Other alternative:
In example.l:
"<=" { return LE; }
"=>" { return GE; }
In example.y:
...
%token NUMBER
%token LE "<="
%token GE "=>"
%%
start: %empty | start tokens
tokens:
NUMBER "<=" NUMBER { printf("<="); }
| NUMBER "=>" NUMBER { printf("=>\n"); }
| NUMBER '>' NUMBER { printf(">\n"); }
| NUMBER '<' NUMBER { printf("<\n"); }
...
When I run it:
$ ./example
3<=4
<=
But the manual says:
A literal string token doesn’t need to be declared unless you need to
specify its semantic value data type
The quoted manual paragraph is correct, but you need to read the next paragraph, too:
You can associate the literal string token with a symbolic name as an alias, using the %token declaration (see Token Declarations). If you don’t do that, the lexical analyzer has to retrieve the token number for the literal string token from the yytname table.
So you don't need to declare the literal string token, but you still need to arrange for the lexer to send the correct token number, and if you don't declare an associated token name, the only way to find the correct value is to search for the code in the yytname table.
In short, your last example where you define LE and GE as aliases, is by far the most common approach. Separating the tokens into individual characters is not a good idea; it might create shift-reduce conflicts and it will definitely allow invalid inputs, such as putting whitespace between the characters.
If you want to try the yytname solution, there is sample code in the bison manual. But please be aware that this code discovers bison's internal token number, which is not the number which needs to be returned from the scanner. There is no way to get the external token number which is easy, portable and documented; the easy and undocumented way is to look the token number up in yytoknum but since that array is not documented and conditional on a preprocessor macro, there is no guarantee that it will work. Note also that these tables are declared static so the function(s) which rely on them must be included in the bison input file. (Of course, these functions can have external linkage so that they can be called from the lexer. But you can't just use yytname directly in the lexer.)
I havent used flex/bison for a while but two things:
. as far as I remember only matches a single character. yytext is a pointer to a null terminated string char* so yytext[0] is a char which means that you can't match strings this way. You probably need to change it to return yytext. Otherwise . will probably create a token PER character and you'd probably have to write NUMBER '<' '=' NUMBER.

Flex: How to define a term to be the first one at the beginning of a line(exclusively)

I need some help regarding a problem I face in my flex code.
My task: To write a flex code which recognizes the declaration part of a programming language, described below.
Let a programming language PL. Its variable definition part is described as follows:
At the beginning we have to start with the keyword "var". After writing this keyword we have to write the variable names(one or more) separated by commas ",". Then a colon ":" is inserted and after that we must write the variable type(say real, boolean, integer or char in my example) followed by a semicolon ";". After doing the previous steps there is the potentiality to declare into a new line new variables(variable names separated by commas "," followed by colon ":" followed by variable type followed by a semicolon ";"), but we must not use the "var" keyword again at the beginning of the new line( the "var" keyword is written once!!!)
E.g.
var number_of_attendants, sum: integer;
ticket_price: real;
symbols: char;
Concretely, I do not know how to make it possible to define that each and every declaration part must start only with the 'var' keyword. Until now, if I would begin a declaration part directly declaring a variable, say x (without having written "var" at the beginning of the line), then no error would occur(unwanted state).
My current flex code below:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real"|"boolean"|"integer"|"char"
SUBEXPRESSION [{VAR_NAME}[","{VAR_NAME}]*":"[ \t\n]*{VAR_TYPE}";"]+
EXPRESSION {VAR_DEFINER}{SUBEXPRESSION}
%%
^{EXPRESSION} {
printf("This is not a well-syntaxed expression!\n");
return 0;
}
{EXPRESSION} printf("This is a well-syntaxed expression!\n");
";"[ \t\n]*{VAR_DEFINER} {
printf("The keyword 'var' is defined once at the beginning of a new line. You can not use it again\n");
return 0;
}
{VAR_DEFINER} printf("A keyword: %s\n", yytext);
^{VAR_DEFINER} printf("Each and every declaration part must start with the 'var' keyword.\n");
{VAR_TYPE}";" printf("The variable type is: %s\n", yytext);
{VAR_NAME} printf("A variable name: %s\n", yytext);
","/[ \t\n]*{VAR_NAME} /* eat up commas */
":"/[ \t\n]*{VAR_TYPE}";" /* eat up single colon */
[ \t\n]+ /* eat up whitespace */
. {
printf("Unrecognized character: %s\n", yytext);
return 0;
}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
I hope to have made it as much as possible clear.
I am looking forward to reading your answers!
You seem to be trying to do too much in the scanner. Do you really have to do everything in Flex? In other words, is this an exercise to learn advanced use of Flex, or is it a problem that may be solved using more appropriate tools?
I've read that the first Fortran compiler took 18 staff-years to create, back in the 1950's. Today, "a substantial compiler can be implemented even as a student project in a one-semester compiler design course", as the Dragon Book from 1986 says. One of the main reasons for this increased efficiency is that we have learned how to divide the compiler into modules that can be constructed separately. The two first such parts, or phases, of a typical compiler is the scanner and the parser.
The scanner, or lexical analyzer, can be generated by Flex from a specification file, or constructed otherwise. Its job is to read the input, which consists of a sequence of characters, and split it into a sequence of tokens. A token is the smallest meaningful part of the input language, such as a semicolon, the keyword var, the identifier number_of_attendants, or the operator <=. You should not use the scanner to do more than that.
Here is how I woould write a simplified Flex specification for your tokens:
[ \t\n] { /* Ignore all whitespace */ }
var { return VAR; }
real { return REAL; }
boolean { return BOOLEAN; }
integer { return INTEGER; }
char { return CHAR; }
[a-zA-Z][a-zA-Z0-9_]* { return VAR_NAME; }
. { return yytext[0]; }
The sequence of tokens is then passed on to the parser, or syntactical analyzer. The parser compares the token sequence with the grammar for the language. For example, the input var number_of_attendants, sum : integer; consists of the keyword var, a comma-separated list of variables, a colon, a data type keyword, and a semicolon. If I understand what your input is supposed to look like, perhaps this grammar would be correct:
program : VAR typedecls ;
typedecls : typedecl | typedecls typedecl ;
typedecl : varlist ':' var_type ';' ;
varlist : VAR_NAME | varlist ',' VAR_NAME ;
var_type : REAL | BOOLEAN | INTEGER | CHAR ;
This grammar happens to be written in a format that Bison, a parser-generator that often is used together with Flex, can understand.
If you separate your solution into a lexical part, using Flex, and a grammar part, using Bison, your life is likely to be much simpler and happier.

Valid regular expression for identifier using flex

I'm trying to make a regular expression that will only work when a valid identifier name is given, using flex (the name cannot start with a number). I'm using this code :
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
%}
%%
"if" { printf("IF "); }
[a-zA-Z_][a-zA-Z_0-9]* { printf("%s ", yytext); }
%%
int main() {
yylex();
}
but it is not working. how to make sure that flex accepts only a valid identifier?
When I provide the input:
if
abc
9abc
I see the following output:
IF
abc
9abc
but I expected:
IF
abc
(nothing)
Your patterns do not match all possible inputs.
In such cases, (f)lex adds a default catch-all rule, of the form
.|\n { ECHO; }
In other words, any character not recognized by your patterns will simply be printed on stdout. That will be the case with the newline characters in your input, as well as with the digit 9. After the 9 is recognized by the default rule, the remaining input will again be recognized by your identifier rule.
So you probably wanted something like this:
%option warn nodefault
%%
[[:space:]]+ ; /* Ignore whitespace */
"if" { /* TODO: Handle an "if" token */ }
[[:alpha:]_][[:alnum:]_]* { /* TODO: Handle an identifier token */ }
. { /* TODO: Handle an error */ }
Instead of printing information to stdout in an action as a debugging or learning aid, I strongly suggest you use the -T (or --trace) option when you are building your scanner. That will automatically output debugging information in a consistent and complete manner; it would have told you that the default rule was being matched, for example.
Notes:
%option nodefault tells flex not to insert a default rule. I recommend always using it, because it will keep you out of trouble. The warn option ensures that a warning is issued in this case; I think that warn is default flex behaviour but the manual suggests using it and it cannot hurt.
It's good style to use standard character class expressions. Inside a character class ([…]), [:xxx:] matches anything for which the standard library function isxxx would return true. So [[:space:]]+ matches one or more whitespace characters, including space, tab, and newline (and some others), [[:alpha:]_] matches any letter or an underscore, and [[:alnum:]_]* matches any number (including 0) of letters, digits, or underscores. See the Patterns section of the manual.

Resources