rule exclusion in flex - flex-lexer

I am trying to write a flex file which recognizes (-! comment !-) as one token called comment. The following is my file:
%{
#include <stdio.h>
void showToken(char* name);
void error();
void enter();
int lineNum=1;
%}
%option yylineno
%option noyywrap
whitespace ([\t ])
enter ([\n])
startcomment (\(\-\!)
endcomment (\!\-\))
comment (^\!\-\))
%%
{startcomment}{comment}*{endcomment} showToken("COMMENT");
{enter} enter();
{whitespace}
. error();
%%
void showToken(char* name){
printf("%d %s %s %d% \n",lineNum,name, yytext);
}
void enter(){
lineNum++;
}
void error(){
printf("%d error %s \n",lineNum,yytext);
}
but i fail for a simple (-! comment !-) input, this file does recognize the (-! and !-) but fails to recognize my comment rule. I did try replacing it with comment (^{endcomment}) but it did not work, any suggestions?

You seem to think that ^ means the following pattern should not match, but it means to match the start of a line. Inside a character class ^ does mean everything but the character class, but outside a character class its meaning is totally different.
In answer to your question for an alternative. Your problem is similar to C-comment /* comment */. The following expression matches C-comment:
"/*"([^*]|"*"+[^/*])*"*"+"/"
Alternatively and more intuitive (if you like) you can use a sub-automaton:
%x comment
%%
"/*" { BEGIN(comment); }
<comment>(.|"\n") { /* Skip */ }
<comment>"*/" { BEGIN(INITIAL); }
%%
I'll leave it as an exercise to apply this to your comment style. Having !-) as the closing of your comment, makes the first solution a bit more complicated.
Note that in general the second solution is preferred because it does not cause the use of a big buffer. The first solution will create a buffer containing the complete comment (which can be big), whereas the buffer requirements for the second solution is at most two characters long.
The easiest way to maintain line-numbers is using the %option yylineno as flex will then keep track of line-numbers in the variable int yylineno. Alternatively you can count the number of new-lines in yytext. In the second solution you can split the second rule and make a separate case for "\n" and count line-numbers there.

Related

Eliminate characters between numbers in Lex code

How can I eliminate characters between two or more integer numbers in lex code?
Ex:12bd35
output:12 35
Lex builds lexical analyzers, which are intended to split the input into separate tokens. Once you recognize a token, you can ignore it, which is somewhat similar to "eliminating characters". But you always need to recognise them.
So you might start with the following minimalist scanner:
%option noinput nounput noyywrap
%%
[[:digit:]]+ { ECHO; fputc(' ', yyout); } /* print numbers.
[^[:digit:]]+ ; /* ignore everything else. */
And then modify it to fit your actual need.

Lex parsing to determine

This is an extension to a previous question. I'm trying to parse a .txt file and determine if each line is valid or invalid depending on my rules. the text files will contain an assortment of random strings, hex, integers and decimals seperated by a single space such as:
5 -0xA98F 0XA98H text hello 2.3 -12 0xabc
I'm trying to identify valid hex, integers and decimals and get an output like so.
5 valid
-0xA98F valid
0xA98H invalid
text invalid
hello invalid
2.3 valid
-12 valid
0xabc invalid
My current code however displays like so:
5 valid
-0xa98f valid
0xA98 valid <--- issue 1 just remoives the H
2.3 valid <--- ignores text and hello
-12 valid
0xabc invalid
here is the code I current have:
%{
#include <iostream>
using namespace std;
%}
Decimal [+-]?[0-9]+\.[0-9]+?
Integers [+-]?[0-9]+
Hex [-]?[0][xX][0-9A-F]+
%%
[ \t\n] ;
{Decimal} {cout << yytext << "Valid" << endl; }
{Integers} { cout << yytext << "Valid" << endl; }
{Hex} {cout << yytext << " Valid" << endl;}
. ;
%%
main() {
FILE *myfile= fopen("something.txt", "r");
if (!myfile) {
cout << "Error" << endl;
return -1;
}
yyin = myfile;
yylex();
fclose(yyin);
}
The key to using flex for problems like this is understanding the "maximal-munch" rule. The rule is simple: Flex always picks the action corresponding to the pattern which matches the longest string (starting with the current input point; flex never "searches" for a match.) If more than one pattern matches the same longest substring, then the first pattern in the flex description is chosen. That means that the order of rules is important.
This is described at more length in the Flex manual section on How the Input is Matched.
So let's suppose that you are interested in matching complete words, where "words" are non-empty sequences of arbitrary non-whitespace characters separated by whitespace. (So, for example, the line 3, 4 and 5. would contain only one valid strings.)
It's easy to identify the four possibilities:
Decimal integers
Decimal floating point
Hexadecimal integers
Anything other word.
We also need to ignore whitespace, other than recognizing it as a word separator.
If we put the rules in that order, we can be confident that the correct rule will be chosen for each line, because of the maximal munch rule.
So here's the entire flex file (except for the definition of main):
%option noinput nounput noyywrap nodefault
%%
[[:space:]]+ { /* Ignore whitespace */ }
[+-]?[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal integer */ }
[+-]?[[:digit:]]+"."[[:digit:]]* {
printf("%s valid\n", yytext); /* Decimal point */ }
[+-]?"."[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal point */ }
[+-]0[xX][[:xdigit:]]+ { printf("%s valid\n", yytext); /* Hexadecimal integer */ }
[^[:space:]]+ { printf("%s invalid\n", yytext); /* Any word not matched by above rules */ }
Notes
I've used ordinary printf statements here. You're free to use C++ streams, of course, but I prefer to use either stdio.h or iostreams, but not both. It might be considered cleaner to #include <stdio.h>, but in fact Flex already does that because it needs it for its own purposes.
The %option statement tells flex that you don't need yywrap (which means you don't need to provide one or link with -lfl), that you don't use input or unput (which means you can compile with -Wall without getting unused function warnings) and that you don't expect flex to need to insert a default rule (which saves you from embarrassing errors, because flex will warn you if there is anything which might not match any rule.)
I used [[:xdigit:]]+ in the hexadecimal pattern, which allows both upper and lower-case hex digits. If that's not desired, you could replace it with [0-9A-F] as in your original code, but your examples seem to indicate that your original code was not correct. Of course, you could write out the posix character classes, but I find them more readable. See the Flex manual section on Patterns for a complete list.

Valid regular expression for identifier using flex

I'm trying to make a regular expression that will only work when a valid identifier name is given, using flex (the name cannot start with a number). I'm using this code :
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
%}
%%
"if" { printf("IF "); }
[a-zA-Z_][a-zA-Z_0-9]* { printf("%s ", yytext); }
%%
int main() {
yylex();
}
but it is not working. how to make sure that flex accepts only a valid identifier?
When I provide the input:
if
abc
9abc
I see the following output:
IF
abc
9abc
but I expected:
IF
abc
(nothing)
Your patterns do not match all possible inputs.
In such cases, (f)lex adds a default catch-all rule, of the form
.|\n { ECHO; }
In other words, any character not recognized by your patterns will simply be printed on stdout. That will be the case with the newline characters in your input, as well as with the digit 9. After the 9 is recognized by the default rule, the remaining input will again be recognized by your identifier rule.
So you probably wanted something like this:
%option warn nodefault
%%
[[:space:]]+ ; /* Ignore whitespace */
"if" { /* TODO: Handle an "if" token */ }
[[:alpha:]_][[:alnum:]_]* { /* TODO: Handle an identifier token */ }
. { /* TODO: Handle an error */ }
Instead of printing information to stdout in an action as a debugging or learning aid, I strongly suggest you use the -T (or --trace) option when you are building your scanner. That will automatically output debugging information in a consistent and complete manner; it would have told you that the default rule was being matched, for example.
Notes:
%option nodefault tells flex not to insert a default rule. I recommend always using it, because it will keep you out of trouble. The warn option ensures that a warning is issued in this case; I think that warn is default flex behaviour but the manual suggests using it and it cannot hurt.
It's good style to use standard character class expressions. Inside a character class ([…]), [:xxx:] matches anything for which the standard library function isxxx would return true. So [[:space:]]+ matches one or more whitespace characters, including space, tab, and newline (and some others), [[:alpha:]_] matches any letter or an underscore, and [[:alnum:]_]* matches any number (including 0) of letters, digits, or underscores. See the Patterns section of the manual.

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Bison: How to ignore a token if it doesn't fit into a rule

I'm writing a program that handles comments as well as a few other things. If a comment is in a specific place, then my program does something.
Flex passes a token upon finding a comment, and Bison then looks to see if that token fits into a particular rule. If it does, then it takes an action associated with that rule.
Here's the thing: the input I'm receiving might actually have comments in the wrong places. In this case, I just want to ignore the comment rather than flagging an error.
My question:
How can I use a token if it fits into a rule, but ignore it if it doesn't? Can I make a token "optional"?
(Note: The only way I can think of of doing this right now is scattering the comment token in every possible place in every possible rule. There MUST be a better solution than this. Maybe some rule involving the root?)
One solution may be to use bison's error recovery (see the Bison manual).
To summarize, bison defines the terminal token error to represent an error (say, a comment token returned in the wrong place). That way, you can (for example) close parentheses or braces after the wayward comment is found. However, this method will probably discard a certain amount of parsing, because I don't think bison can "undo" reductions. ("Flagging" the error, as with printing a message to stderr, is not related to this: you can have an error without printing an error--it depends on how you define yyerror.)
You may instead want to wrap each terminal in a special nonterminal:
term_wrap: comment TERM
This effectively does what you're scared to do (put in a comment in every single rule), but it does it in fewer places.
To force myself to eat my own dog food, I made up a silly language for myself. The only syntax is print <number> please, but if there's (at least) one comment (##) between the number and the please, it prints the number in hexadecimal, instead.
Like this:
print 1 please
1
## print 2 please
2
print ## 3 please
3
print 4 ## please
0x4
print 5 ## ## please
0x5
print 6 please ##
6
My lexer:
%{
#include <stdio.h>
#include <stdlib.h>
#include "y.tab.h"
%}
%%
print return PRINT;
[[:digit:]]+ yylval = atoi(yytext); return NUMBER;
please return PLEASE;
## return COMMENT;
[[:space:]]+ /* ignore */
. /* ditto */
and the parser:
%debug
%error-verbose
%verbose
%locations
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str) {
fprintf(stderr, "error: %s\n", str);
}
int yywrap() {
return 1;
}
extern int yydebug;
int main(void) {
yydebug = 0;
yyparse();
}
%}
%token PRINT NUMBER COMMENT PLEASE
%%
commands: /* empty */
|
commands command
;
command: print number comment please {
if ($3) {
printf("%#x", $2);
} else {
printf("%d", $2);
}
printf("\n");
}
;
print: comment PRINT
;
number: comment NUMBER {
$$ = $2;
}
;
please: comment PLEASE
;
comment: /* empty */ {
$$ = 0;
}
|
comment COMMENT {
$$ = 1;
}
;
So, as you can see, not exactly rocket science, but it does the trick. There's a shift/reduce conflict in there, because of the empty string matching comment in multiple places. Also, there's no rule to fit comments in between the final please and EOF. But overall, I think it's a good example.
Treat comments as whitespace at the lexer level.
But keep two separate rules, one for whitespace and one for comments, both returning the same token ID.
The rule for comments (+ optional whitespace) keeps track of the comment in a dedicated structure.
The rule for whitespace resets the structure.
When you enter that “specific place”, look if the last whitespace was a comment or trigger an error.

Resources