Strange lexing issue keywords vs identifiers regex matching - flex-lexer

I've been struggling to understand some behavior of flex.
I started defining a small toy-like example program which will tokenize into keywords and strings.
One definition of the regex performs as expected but another behaves quite differently, contrary to my expectation.
It has been a few years since I've played with this stuff so hopefully someone can point me in the right direction.
I modified the token regular expression to get it to work but I'd really like to understand why my original choice behaved differently.
This first example is the non-working code
%{
#include <iostream>
using namespace std;
%}
%option noyywrap
%%
[ \t\n] {cout << "ws" << endl;};
buzz {cout << "kw" << endl;};
[^\n]+ {cout << "str" << endl;};
%%
int main(){
yylex();
}
The second example is the modified version which does behave properly.
%{
#include <iostream>
using namespace std;
%}
%option noyywrap
%%
[ \t\n] {cout << "ws" << endl;};
buzz {cout << "kw" << endl;};
[a-zA-Z]+ {cout << "str" << endl;};
%%
int main(){
yylex();
}
In the code, buzz is supposed to be a keyword, and anything following should be just read as a string.
For the first example, buzz gets consumed up along with the remaining word as a "str".
In the second example, buzz is properly recognized and the remaining word becomes the "str".
I understand that the third rule in both cases is also a valid definition for a token containing the characters b-u-z-z. Each of these four letters is in [^\n]+, as well as [a-zA-Z]+. So why on earth is the behavior different?
example inputs would be:
buzz lightyear
buzz aldren
Thanks!

Flex (as well as most other lexer generators) works according to the maximum munch rule. That rule says that if multiple patterns can match on the current input, the one that produces the longest match is chosen. If multiple pattern produce a match of the same size, the one that appears first in the .l file is chosen.
So in your working solution the patterns buzz and [a-zA-Z0-9]+ both match buzz, so buzz is chosen because it appears first in the file (if you switched the two lines, str would be printed instead). In your non-working solution buzz still would only match buzz, but [^\n]+ matches buzz lightyear and buzz aldren respectively, which is the longer match. Thus it wins according to the maximum munch rule.

Related

Eliminate characters between numbers in Lex code

How can I eliminate characters between two or more integer numbers in lex code?
Ex:12bd35
output:12 35
Lex builds lexical analyzers, which are intended to split the input into separate tokens. Once you recognize a token, you can ignore it, which is somewhat similar to "eliminating characters". But you always need to recognise them.
So you might start with the following minimalist scanner:
%option noinput nounput noyywrap
%%
[[:digit:]]+ { ECHO; fputc(' ', yyout); } /* print numbers.
[^[:digit:]]+ ; /* ignore everything else. */
And then modify it to fit your actual need.

Flex lexical analyzer not behaving as expected

I'm trying to use Flex to match basic patterns and print something.
%%
^[^qA-Z]*q[a-pr-z0-9]*4\n {printf("userid1, userid2 \n"); return 1;}
%%
int yywrap(void){return 1;}
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
while (yylex());
}
Resolved dumb question
I don't know what you are trying to do, so I'll focus on the immediate issue, which is your last pattern:
^[^qA-Z]*q[a-pr-z0-9]*4[a-pr-z0-9]*4[a-pr-z0-9]*\n
That pattern starts by matching [^qA-Z]*, which is any number of anything which is not a q nor a capital letter (A-Z). Then it matches a q.
Here it's worth considering all the things which are not a q nor a capital letter (A-Z). Obviously, that includes lower-case letters such as s (other than q). It also includes digits. And it includes any other character: punctuation, whitespace, even control characters. In particular, it includes a newline character.
So when you type
10s10<newline>
That certainly could be the start of the last pattern. The scanner hasn't yet seen a q so it doesn't know whether the pattern will eventually match, but it hasn't yet failed. So it keeps on reading more characters, including more newlines.
When you eventually type a q, the scanner can continue with the rest of the pattern. Depending on what you type next, it might or might not be able to continue. If, as seems likely, your input eventually fails to match the pattern, the lexer will fall back to the longest successful match, which is the first pattern. At that point, it will perform the first action
Negative character classes need to be used with a bit of caution. It's s easy to fall into the trap of thinking that "not ..." only includes "reasonable" input. But it includes everything. Often, as in this case, you'll want to at least exclude newlines.,

Lex parsing to determine

This is an extension to a previous question. I'm trying to parse a .txt file and determine if each line is valid or invalid depending on my rules. the text files will contain an assortment of random strings, hex, integers and decimals seperated by a single space such as:
5 -0xA98F 0XA98H text hello 2.3 -12 0xabc
I'm trying to identify valid hex, integers and decimals and get an output like so.
5 valid
-0xA98F valid
0xA98H invalid
text invalid
hello invalid
2.3 valid
-12 valid
0xabc invalid
My current code however displays like so:
5 valid
-0xa98f valid
0xA98 valid <--- issue 1 just remoives the H
2.3 valid <--- ignores text and hello
-12 valid
0xabc invalid
here is the code I current have:
%{
#include <iostream>
using namespace std;
%}
Decimal [+-]?[0-9]+\.[0-9]+?
Integers [+-]?[0-9]+
Hex [-]?[0][xX][0-9A-F]+
%%
[ \t\n] ;
{Decimal} {cout << yytext << "Valid" << endl; }
{Integers} { cout << yytext << "Valid" << endl; }
{Hex} {cout << yytext << " Valid" << endl;}
. ;
%%
main() {
FILE *myfile= fopen("something.txt", "r");
if (!myfile) {
cout << "Error" << endl;
return -1;
}
yyin = myfile;
yylex();
fclose(yyin);
}
The key to using flex for problems like this is understanding the "maximal-munch" rule. The rule is simple: Flex always picks the action corresponding to the pattern which matches the longest string (starting with the current input point; flex never "searches" for a match.) If more than one pattern matches the same longest substring, then the first pattern in the flex description is chosen. That means that the order of rules is important.
This is described at more length in the Flex manual section on How the Input is Matched.
So let's suppose that you are interested in matching complete words, where "words" are non-empty sequences of arbitrary non-whitespace characters separated by whitespace. (So, for example, the line 3, 4 and 5. would contain only one valid strings.)
It's easy to identify the four possibilities:
Decimal integers
Decimal floating point
Hexadecimal integers
Anything other word.
We also need to ignore whitespace, other than recognizing it as a word separator.
If we put the rules in that order, we can be confident that the correct rule will be chosen for each line, because of the maximal munch rule.
So here's the entire flex file (except for the definition of main):
%option noinput nounput noyywrap nodefault
%%
[[:space:]]+ { /* Ignore whitespace */ }
[+-]?[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal integer */ }
[+-]?[[:digit:]]+"."[[:digit:]]* {
printf("%s valid\n", yytext); /* Decimal point */ }
[+-]?"."[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal point */ }
[+-]0[xX][[:xdigit:]]+ { printf("%s valid\n", yytext); /* Hexadecimal integer */ }
[^[:space:]]+ { printf("%s invalid\n", yytext); /* Any word not matched by above rules */ }
Notes
I've used ordinary printf statements here. You're free to use C++ streams, of course, but I prefer to use either stdio.h or iostreams, but not both. It might be considered cleaner to #include <stdio.h>, but in fact Flex already does that because it needs it for its own purposes.
The %option statement tells flex that you don't need yywrap (which means you don't need to provide one or link with -lfl), that you don't use input or unput (which means you can compile with -Wall without getting unused function warnings) and that you don't expect flex to need to insert a default rule (which saves you from embarrassing errors, because flex will warn you if there is anything which might not match any rule.)
I used [[:xdigit:]]+ in the hexadecimal pattern, which allows both upper and lower-case hex digits. If that's not desired, you could replace it with [0-9A-F] as in your original code, but your examples seem to indicate that your original code was not correct. Of course, you could write out the posix character classes, but I find them more readable. See the Flex manual section on Patterns for a complete list.

Flex: Unrecognized rule error

I come up with an "unrecognized rule" error in Flex. I have read some articles but I did not find any solution to my problem. I have tried to make some changes in my code, but nothing seems to make it work(sometimes these changes made it even worse instead). I post my code below hoping a solution to be found.
My flex code:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real" | "boolean" | "integer" | "char"
%%
{VAR_DEFINER} {printf("A keyword: %s\n", yytext);}
{VAR_NAME} | ","{VAR_NAME} {printf("A variable name: %s\n", yytext);}
":" {printf("A colon\n");}
{VAR_TYPE}";""\n" {printf("The variable type is: %s\n", yytext);}
"\n"{VAR_DEFINER} {printf("Error: The keyword 'var' is defined once at the beginning.\n");}
[ \t\n]+ /* eat up whitespace */
. {printf("Unrecognized character: %s\n", yytext);}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}
As you wrote in your own answer to your question, you can fix the errors by being careful with whitespace.
But the underlying problem is that you are trying to let the scanner do work that is better done by the parser. If you want to parse things like var x boolean, then that shouldn't be a single token, discovered by the scanner. The usual, and most often much better, approach is to let the scanner discover three separate tokens (var, x and boolean), and then let the parser group them into a variable declaration.
I found the answer on my own. I would like to post it to help anyone else who may have a similar problem, just in case.
My fault was that I left unquoted whitespaces amongst the terms of expressions or amongst the variable types in declaration part. For example, I have written VAR_TYPE "real" | "boolean" | "integer" | "char" , instead of VAR_TYPE "real"|"boolean"|"integer"|"char" (without whitespaces).
So, mind all kinds of brackets and the whitespaces!!!
I hope to have helped!

rule exclusion in flex

I am trying to write a flex file which recognizes (-! comment !-) as one token called comment. The following is my file:
%{
#include <stdio.h>
void showToken(char* name);
void error();
void enter();
int lineNum=1;
%}
%option yylineno
%option noyywrap
whitespace ([\t ])
enter ([\n])
startcomment (\(\-\!)
endcomment (\!\-\))
comment (^\!\-\))
%%
{startcomment}{comment}*{endcomment} showToken("COMMENT");
{enter} enter();
{whitespace}
. error();
%%
void showToken(char* name){
printf("%d %s %s %d% \n",lineNum,name, yytext);
}
void enter(){
lineNum++;
}
void error(){
printf("%d error %s \n",lineNum,yytext);
}
but i fail for a simple (-! comment !-) input, this file does recognize the (-! and !-) but fails to recognize my comment rule. I did try replacing it with comment (^{endcomment}) but it did not work, any suggestions?
You seem to think that ^ means the following pattern should not match, but it means to match the start of a line. Inside a character class ^ does mean everything but the character class, but outside a character class its meaning is totally different.
In answer to your question for an alternative. Your problem is similar to C-comment /* comment */. The following expression matches C-comment:
"/*"([^*]|"*"+[^/*])*"*"+"/"
Alternatively and more intuitive (if you like) you can use a sub-automaton:
%x comment
%%
"/*" { BEGIN(comment); }
<comment>(.|"\n") { /* Skip */ }
<comment>"*/" { BEGIN(INITIAL); }
%%
I'll leave it as an exercise to apply this to your comment style. Having !-) as the closing of your comment, makes the first solution a bit more complicated.
Note that in general the second solution is preferred because it does not cause the use of a big buffer. The first solution will create a buffer containing the complete comment (which can be big), whereas the buffer requirements for the second solution is at most two characters long.
The easiest way to maintain line-numbers is using the %option yylineno as flex will then keep track of line-numbers in the variable int yylineno. Alternatively you can count the number of new-lines in yytext. In the second solution you can split the second rule and make a separate case for "\n" and count line-numbers there.

Resources