Eliminate characters between numbers in Lex code - flex-lexer

How can I eliminate characters between two or more integer numbers in lex code?
Ex:12bd35
output:12 35

Lex builds lexical analyzers, which are intended to split the input into separate tokens. Once you recognize a token, you can ignore it, which is somewhat similar to "eliminating characters". But you always need to recognise them.
So you might start with the following minimalist scanner:
%option noinput nounput noyywrap
%%
[[:digit:]]+ { ECHO; fputc(' ', yyout); } /* print numbers.
[^[:digit:]]+ ; /* ignore everything else. */
And then modify it to fit your actual need.

Related

flex scanner push-back overflow with automata

I am having a hard time with this problem.
"Write a flex code which recognizes a chain with alphabet {0,1}, with at least 5 char's, and to every consecutive 5 char's there will bee at least 3 1's"
I thought I have solved, but I am new using flex, so I am getting this "flex scanner push-back overflow".
here's my code
%{
#define ACCEPT 1
#define DONT 2
%}
delim [ \t\n\r]
ws {delim}+
comb01 00111|{comb06}1
comb02 01011|{comb07}1
comb03 01101|{comb08}1
comb04 01110|({comb01}|{comb09})0
comb05 01111|({comb01}|{comb09})1
comb06 10011|{comb10}1
comb07 10101|{comb11}1
comb08 10110|({comb02}|{comb12})0
comb09 10111|({comb02}|{comb12})1
comb10 11001|{comb13}1
comb11 11010|({comb03}|{comb14})0
comb12 11011|({comb03}|{comb14})1
comb13 11100|({comb04}|{comb15})0
comb14 11101|({comb04}|{comb15})1
comb15 11110|({comb05}|{comb16})0
comb16 11111|({comb05}|{comb16})1
accept {comb01}|{comb02}|{comb03}|{comb04}|{comb05}|{comb06}|{comb07}|{comb08}|{comb09}|{comb10}|{comb11}|{comb12}|{comb13}|{comb14}|{comb15}|{comb16}
string [^ \t\n\r]+
%%
{ws} { ;}
{accept} {return ACCEPT;}
{string} {return DONT;}
%%
void main () {
int i;
while (i = yylex ())
switch (i) {
case ACCEPT:
printf ("%-20s: ACCEPT\n", yytext);
break;
case DONT:
printf ("%-20s: Reject\n", yytext);
break;
}
}
Flex definitions are macros, and flex implements them that way: when it sees {defn} in a pattern, it replaces it with whatever defn was defined as (in parentheses, usually, to avoid operator precedence issues). It doesn't expand the macros in the macro definition, so the macro substitution might contain more definition references which in turn need to be substituted.
Since macro substitution is unconditional, it is not possible to use recursive macros, including macros which are indirectly recursive. Which yours are. Flex doesn't check for this condition, unlike the C preprocessor; it just continues substituting in an endless loop until it runs out of space.
(Flex is implemented using itself; it does the macro substitution using unput. unput will not resize the input buffer, so "runs out of space" here means that flex's internal flex's input buffer became full of macro substitutions.)
The strategy you are using would work fine as a context-free grammar. But that's not flex. Flex is about regular expressions. The pattern you want to match can be described by a regular expression -- the "grammar" you wrote with flex macros is a regular grammar -- but it is not a regular expression and flex won't make one out of it for you, unfortunately. That's your job.
I don't think it's going to be a very pretty regular expression. In fact, I think it's likely to be enormous. But I didn't try working it out..
There are flex tricks you could use to avoid constructing the regular expression. For example, you could build your state machine out of flex start conditions and then scan one character at a time, where each character scanned does a state transition or throws an error. (Use more() if you want to return the entire string scanned at the end.)

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.
This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).
My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

Lex parsing to determine

This is an extension to a previous question. I'm trying to parse a .txt file and determine if each line is valid or invalid depending on my rules. the text files will contain an assortment of random strings, hex, integers and decimals seperated by a single space such as:
5 -0xA98F 0XA98H text hello 2.3 -12 0xabc
I'm trying to identify valid hex, integers and decimals and get an output like so.
5 valid
-0xA98F valid
0xA98H invalid
text invalid
hello invalid
2.3 valid
-12 valid
0xabc invalid
My current code however displays like so:
5 valid
-0xa98f valid
0xA98 valid <--- issue 1 just remoives the H
2.3 valid <--- ignores text and hello
-12 valid
0xabc invalid
here is the code I current have:
%{
#include <iostream>
using namespace std;
%}
Decimal [+-]?[0-9]+\.[0-9]+?
Integers [+-]?[0-9]+
Hex [-]?[0][xX][0-9A-F]+
%%
[ \t\n] ;
{Decimal} {cout << yytext << "Valid" << endl; }
{Integers} { cout << yytext << "Valid" << endl; }
{Hex} {cout << yytext << " Valid" << endl;}
. ;
%%
main() {
FILE *myfile= fopen("something.txt", "r");
if (!myfile) {
cout << "Error" << endl;
return -1;
}
yyin = myfile;
yylex();
fclose(yyin);
}
The key to using flex for problems like this is understanding the "maximal-munch" rule. The rule is simple: Flex always picks the action corresponding to the pattern which matches the longest string (starting with the current input point; flex never "searches" for a match.) If more than one pattern matches the same longest substring, then the first pattern in the flex description is chosen. That means that the order of rules is important.
This is described at more length in the Flex manual section on How the Input is Matched.
So let's suppose that you are interested in matching complete words, where "words" are non-empty sequences of arbitrary non-whitespace characters separated by whitespace. (So, for example, the line 3, 4 and 5. would contain only one valid strings.)
It's easy to identify the four possibilities:
Decimal integers
Decimal floating point
Hexadecimal integers
Anything other word.
We also need to ignore whitespace, other than recognizing it as a word separator.
If we put the rules in that order, we can be confident that the correct rule will be chosen for each line, because of the maximal munch rule.
So here's the entire flex file (except for the definition of main):
%option noinput nounput noyywrap nodefault
%%
[[:space:]]+ { /* Ignore whitespace */ }
[+-]?[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal integer */ }
[+-]?[[:digit:]]+"."[[:digit:]]* {
printf("%s valid\n", yytext); /* Decimal point */ }
[+-]?"."[[:digit:]]+ { printf("%s valid\n", yytext); /* Decimal point */ }
[+-]0[xX][[:xdigit:]]+ { printf("%s valid\n", yytext); /* Hexadecimal integer */ }
[^[:space:]]+ { printf("%s invalid\n", yytext); /* Any word not matched by above rules */ }
Notes
I've used ordinary printf statements here. You're free to use C++ streams, of course, but I prefer to use either stdio.h or iostreams, but not both. It might be considered cleaner to #include <stdio.h>, but in fact Flex already does that because it needs it for its own purposes.
The %option statement tells flex that you don't need yywrap (which means you don't need to provide one or link with -lfl), that you don't use input or unput (which means you can compile with -Wall without getting unused function warnings) and that you don't expect flex to need to insert a default rule (which saves you from embarrassing errors, because flex will warn you if there is anything which might not match any rule.)
I used [[:xdigit:]]+ in the hexadecimal pattern, which allows both upper and lower-case hex digits. If that's not desired, you could replace it with [0-9A-F] as in your original code, but your examples seem to indicate that your original code was not correct. Of course, you could write out the posix character classes, but I find them more readable. See the Flex manual section on Patterns for a complete list.

Valid regular expression for identifier using flex

I'm trying to make a regular expression that will only work when a valid identifier name is given, using flex (the name cannot start with a number). I'm using this code :
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
%}
%%
"if" { printf("IF "); }
[a-zA-Z_][a-zA-Z_0-9]* { printf("%s ", yytext); }
%%
int main() {
yylex();
}
but it is not working. how to make sure that flex accepts only a valid identifier?
When I provide the input:
if
abc
9abc
I see the following output:
IF
abc
9abc
but I expected:
IF
abc
(nothing)
Your patterns do not match all possible inputs.
In such cases, (f)lex adds a default catch-all rule, of the form
.|\n { ECHO; }
In other words, any character not recognized by your patterns will simply be printed on stdout. That will be the case with the newline characters in your input, as well as with the digit 9. After the 9 is recognized by the default rule, the remaining input will again be recognized by your identifier rule.
So you probably wanted something like this:
%option warn nodefault
%%
[[:space:]]+ ; /* Ignore whitespace */
"if" { /* TODO: Handle an "if" token */ }
[[:alpha:]_][[:alnum:]_]* { /* TODO: Handle an identifier token */ }
. { /* TODO: Handle an error */ }
Instead of printing information to stdout in an action as a debugging or learning aid, I strongly suggest you use the -T (or --trace) option when you are building your scanner. That will automatically output debugging information in a consistent and complete manner; it would have told you that the default rule was being matched, for example.
Notes:
%option nodefault tells flex not to insert a default rule. I recommend always using it, because it will keep you out of trouble. The warn option ensures that a warning is issued in this case; I think that warn is default flex behaviour but the manual suggests using it and it cannot hurt.
It's good style to use standard character class expressions. Inside a character class ([…]), [:xxx:] matches anything for which the standard library function isxxx would return true. So [[:space:]]+ matches one or more whitespace characters, including space, tab, and newline (and some others), [[:alpha:]_] matches any letter or an underscore, and [[:alnum:]_]* matches any number (including 0) of letters, digits, or underscores. See the Patterns section of the manual.

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Resources