lex parser not displaying hex correctly

lex parser not displaying hex correctly - parsing

I'm trying to identify a hex number from a parsed text file and everything is about 99% accurate however I keep having an issue with this certain instance 0xa98h. whenever it finds this line it will output 0xa98 instead of ignoring it altogether since it is not valid. I've tried so many variations to this code and have yet to find a way to exclude that issue.
[-]?[0][x|X][0-9A-F]+ {cout << yytext << " Number" << endl; }

The pattern for hex numbers does not consider digits 'a' ... 'f'. Try this:
[-]?[0][xX][0-9a-fA-F]+ {cout << yytext << " Number" << endl; }
Further observations:
The vertical bar in [x|X] is probably wrong. Otherwise, this would also work: 0|a98h.
The 'h' at end of sample is not matched. (This may or may not be intended.)
An alternative approach could be this (test-hex.l):
%{
#include <iostream>
using namespace std;
%}
%option caseless
%%
[-]?[0][x][0-9a-f]+ {cout << yytext << " Number" << endl; }
%%
int main(int argc, char **argv) { return yylex(); }
int yywrap() { return 1; }
Compiled and tested with flex and gcc on cygwin:
$ flex -V
flex 2.6.3
$ flex -otest-hex.cc test-hex.l ; g++ -o test-hex test-hex.cc
$ echo '0xa98h' | ./test-hex
0xa98 Number
h
There is no pattern matching h. This is printed because lex/flex generate a default rule to echo everything what is not matched to standard output.

Related

Strange lexing issue keywords vs identifiers regex matching

I've been struggling to understand some behavior of flex.
I started defining a small toy-like example program which will tokenize into keywords and strings.
One definition of the regex performs as expected but another behaves quite differently, contrary to my expectation.
It has been a few years since I've played with this stuff so hopefully someone can point me in the right direction.
I modified the token regular expression to get it to work but I'd really like to understand why my original choice behaved differently.
This first example is the non-working code
%{
#include <iostream>
using namespace std;
%}
%option noyywrap
%%
[ \t\n] {cout << "ws" << endl;};
buzz {cout << "kw" << endl;};
[^\n]+ {cout << "str" << endl;};
%%
int main(){
yylex();
}
The second example is the modified version which does behave properly.
%{
#include <iostream>
using namespace std;
%}
%option noyywrap
%%
[ \t\n] {cout << "ws" << endl;};
buzz {cout << "kw" << endl;};
[a-zA-Z]+ {cout << "str" << endl;};
%%
int main(){
yylex();
}
In the code, buzz is supposed to be a keyword, and anything following should be just read as a string.
For the first example, buzz gets consumed up along with the remaining word as a "str".
In the second example, buzz is properly recognized and the remaining word becomes the "str".
I understand that the third rule in both cases is also a valid definition for a token containing the characters b-u-z-z. Each of these four letters is in [^\n]+, as well as [a-zA-Z]+. So why on earth is the behavior different?
example inputs would be:
buzz lightyear
buzz aldren
Thanks!

Flex (as well as most other lexer generators) works according to the maximum munch rule. That rule says that if multiple patterns can match on the current input, the one that produces the longest match is chosen. If multiple pattern produce a match of the same size, the one that appears first in the .l file is chosen.
So in your working solution the patterns buzz and [a-zA-Z0-9]+ both match buzz, so buzz is chosen because it appears first in the file (if you switched the two lines, str would be printed instead). In your non-working solution buzz still would only match buzz, but [^\n]+ matches buzz lightyear and buzz aldren respectively, which is the longer match. Thus it wins according to the maximum munch rule.

Lex parsing to determine

This is an extension to a previous question. I'm trying to parse a .txt file and determine if each line is valid or invalid depending on my rules. the text files will contain an assortment of random strings, hex, integers and decimals seperated by a single space such as:
5 -0xA98F 0XA98H text hello 2.3 -12 0xabc
I'm trying to identify valid hex, integers and decimals and get an output like so.
5 valid
-0xA98F valid
0xA98H invalid
text invalid
hello invalid
2.3 valid
-12 valid
0xabc invalid
My current code however displays like so:
5 valid
-0xa98f valid
0xA98 valid <--- issue 1 just remoives the H
2.3 valid <--- ignores text and hello
-12 valid
0xabc invalid
here is the code I current have:
%{
#include <iostream>
using namespace std;
%}
Decimal [+-]?[0-9]+\.[0-9]+?
Integers [+-]?[0-9]+
Hex [-]?[0][xX][0-9A-F]+
%%
[ \t\n] ;
{Decimal} {cout << yytext << "Valid" << endl; }
{Integers} { cout << yytext << "Valid" << endl; }
{Hex} {cout << yytext << " Valid" << endl;}
. ;
%%
main() {
FILE *myfile= fopen("something.txt", "r");
if (!myfile) {
cout << "Error" << endl;
return -1;
}
yyin = myfile;
yylex();
fclose(yyin);
}

The key to using flex for problems like this is understanding the "maximal-munch" rule. The rule is simple: Flex always picks the action corresponding to the pattern which matches the longest string (starting with the current input point; flex never "searches" for a match.) If more than one pattern matches the same longest substring, then the first pattern in the flex description is chosen. That means that the order of rules is important.
This is described at more length in the Flex manual section on How the Input is Matched.
So let's suppose that you are interested in matching complete words, where "words" are non-empty sequences of arbitrary non-whitespace characters separated by whitespace. (So, for example, the line 3, 4 and 5. would contain only one valid strings.)
It's easy to identify the four possibilities:
Decimal integers
Decimal floating point
Hexadecimal integers
Anything other word.
We also need to ignore whitespace, other than recognizing it as a word separator.
If we put the rules in that order, we can be confident that the correct rule will be chosen for each line, because of the maximal munch rule.
So here's the entire flex file (except for the definition of main):
%option noinput nounput noyywrap nodefault
%%
[[:space:]]+ { /* Ignore whitespace */ }
[+-]?[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal integer */ }
[+-]?[[:digit:]]+"."[[:digit:]]* {
printf("%s valid\n", yytext); /* Decimal point */ }
[+-]?"."[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal point */ }
[+-]0[xX][[:xdigit:]]+ { printf("%s valid\n", yytext); /* Hexadecimal integer */ }
[^[:space:]]+ { printf("%s invalid\n", yytext); /* Any word not matched by above rules */ }
Notes
I've used ordinary printf statements here. You're free to use C++ streams, of course, but I prefer to use either stdio.h or iostreams, but not both. It might be considered cleaner to #include <stdio.h>, but in fact Flex already does that because it needs it for its own purposes.
The %option statement tells flex that you don't need yywrap (which means you don't need to provide one or link with -lfl), that you don't use input or unput (which means you can compile with -Wall without getting unused function warnings) and that you don't expect flex to need to insert a default rule (which saves you from embarrassing errors, because flex will warn you if there is anything which might not match any rule.)
I used [[:xdigit:]]+ in the hexadecimal pattern, which allows both upper and lower-case hex digits. If that's not desired, you could replace it with [0-9A-F] as in your original code, but your examples seem to indicate that your original code was not correct. Of course, you could write out the posix character classes, but I find them more readable. See the Flex manual section on Patterns for a complete list.

Flex: Unrecognized rule error

I come up with an "unrecognized rule" error in Flex. I have read some articles but I did not find any solution to my problem. I have tried to make some changes in my code, but nothing seems to make it work(sometimes these changes made it even worse instead). I post my code below hoping a solution to be found.
My flex code:
%{
#include <stdio.h>
%}
VAR_DEFINER "var"
VAR_NAME [a-zA-Z][a-zA-Z0-9_]*
VAR_TYPE "real" | "boolean" | "integer" | "char"
%%
{VAR_DEFINER} {printf("A keyword: %s\n", yytext);}
{VAR_NAME} | ","{VAR_NAME} {printf("A variable name: %s\n", yytext);}
":" {printf("A colon\n");}
{VAR_TYPE}";""\n" {printf("The variable type is: %s\n", yytext);}
"\n"{VAR_DEFINER} {printf("Error: The keyword 'var' is defined once at the beginning.\n");}
[ \t\n]+ /* eat up whitespace */
. {printf("Unrecognized character: %s\n", yytext);}
%%
main(argc, argv)
int argc;
char** argv;
{
++argv, --argc;
if (argc > 0)
yyin = fopen(argv[0],"r");
else
yyin = stdin;
yylex();
}

As you wrote in your own answer to your question, you can fix the errors by being careful with whitespace.
But the underlying problem is that you are trying to let the scanner do work that is better done by the parser. If you want to parse things like var x boolean, then that shouldn't be a single token, discovered by the scanner. The usual, and most often much better, approach is to let the scanner discover three separate tokens (var, x and boolean), and then let the parser group them into a variable declaration.

I found the answer on my own. I would like to post it to help anyone else who may have a similar problem, just in case.
My fault was that I left unquoted whitespaces amongst the terms of expressions or amongst the variable types in declaration part. For example, I have written VAR_TYPE "real" | "boolean" | "integer" | "char" , instead of VAR_TYPE "real"|"boolean"|"integer"|"char" (without whitespaces).
So, mind all kinds of brackets and the whitespaces!!!
I hope to have helped!

Parser - Segmentation fault when calling yytext

My parser is recognizing the grammar and indicating the correct error line using yylineno. I want to print the symbol wich caused the error.
int yyerror(string s)
{
extern int yylineno; // defined and maintained in lex.yy.c
extern char *yytext; // defined and maintained in lex.yy.c
cerr << "error: " << s << " -> " << yytext << " # line " << yylineno << endl;
//exit(1);
}
I get this error when I write something not acceptable by the grammar:
error: syntax error -> Segmentation fault
Am I not supposed to used yytext? If not what variable contains the symbol that caused the syntax error?
Thanks

Depending on the version of lex you are using, yytext may be an array or may be a pointer. Since it is defined in a different compilation unit, if it is an array and you declare it as a pointer, you won't see any error messages from the compiler or linker (linker generally don't do type checking). Instead it will treat the first several characters in the array as a pointer and try to dereference it and probably crash.
If you are using flex, you can add a %pointer declaration to the first section of your .l file to ensure that it is a pointer and not an array

Are you using lex or flex? If you're using lex,yytext is a char[], not a char*.
EDIT If you aren't using flex you should be, it is superior in every way and has been from the moment of its appearance nearly 30 years ago. lex was obsoleted on that day.

Bison: How to ignore a token if it doesn't fit into a rule

I'm writing a program that handles comments as well as a few other things. If a comment is in a specific place, then my program does something.
Flex passes a token upon finding a comment, and Bison then looks to see if that token fits into a particular rule. If it does, then it takes an action associated with that rule.
Here's the thing: the input I'm receiving might actually have comments in the wrong places. In this case, I just want to ignore the comment rather than flagging an error.
My question:
How can I use a token if it fits into a rule, but ignore it if it doesn't? Can I make a token "optional"?
(Note: The only way I can think of of doing this right now is scattering the comment token in every possible place in every possible rule. There MUST be a better solution than this. Maybe some rule involving the root?)

One solution may be to use bison's error recovery (see the Bison manual).
To summarize, bison defines the terminal token error to represent an error (say, a comment token returned in the wrong place). That way, you can (for example) close parentheses or braces after the wayward comment is found. However, this method will probably discard a certain amount of parsing, because I don't think bison can "undo" reductions. ("Flagging" the error, as with printing a message to stderr, is not related to this: you can have an error without printing an error--it depends on how you define yyerror.)
You may instead want to wrap each terminal in a special nonterminal:
term_wrap: comment TERM
This effectively does what you're scared to do (put in a comment in every single rule), but it does it in fewer places.
To force myself to eat my own dog food, I made up a silly language for myself. The only syntax is print <number> please, but if there's (at least) one comment (##) between the number and the please, it prints the number in hexadecimal, instead.
Like this:
print 1 please
1
## print 2 please
2
print ## 3 please
3
print 4 ## please
0x4
print 5 ## ## please
0x5
print 6 please ##
6
My lexer:
%{
#include <stdio.h>
#include <stdlib.h>
#include "y.tab.h"
%}
%%
print return PRINT;
[[:digit:]]+ yylval = atoi(yytext); return NUMBER;
please return PLEASE;
## return COMMENT;
[[:space:]]+ /* ignore */
. /* ditto */
and the parser:
%debug
%error-verbose
%verbose
%locations
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str) {
fprintf(stderr, "error: %s\n", str);
}
int yywrap() {
return 1;
}
extern int yydebug;
int main(void) {
yydebug = 0;
yyparse();
}
%}
%token PRINT NUMBER COMMENT PLEASE
%%
commands: /* empty */
|
commands command
;
command: print number comment please {
if ($3) {
printf("%#x", $2);
} else {
printf("%d", $2);
}
printf("\n");
}
;
print: comment PRINT
;
number: comment NUMBER {
$$ = $2;
}
;
please: comment PLEASE
;
comment: /* empty */ {
$$ = 0;
}
|
comment COMMENT {
$$ = 1;
}
;
So, as you can see, not exactly rocket science, but it does the trick. There's a shift/reduce conflict in there, because of the empty string matching comment in multiple places. Also, there's no rule to fit comments in between the final please and EOF. But overall, I think it's a good example.

Treat comments as whitespace at the lexer level.
But keep two separate rules, one for whitespace and one for comments, both returning the same token ID.
The rule for comments (+ optional whitespace) keeps track of the comment in a dedicated structure.
The rule for whitespace resets the structure.
When you enter that “specific place”, look if the last whitespace was a comment or trigger an error.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

lex parser not displaying hex correctly - parsing

Related

Strange lexing issue keywords vs identifiers regex matching

Lex parsing to determine

Flex: Unrecognized rule error

Parser - Segmentation fault when calling yytext

Bison: How to ignore a token if it doesn't fit into a rule

Categories

Resources