flex scanner push-back overflow with automata - flex-lexer

I am having a hard time with this problem.
"Write a flex code which recognizes a chain with alphabet {0,1}, with at least 5 char's, and to every consecutive 5 char's there will bee at least 3 1's"
I thought I have solved, but I am new using flex, so I am getting this "flex scanner push-back overflow".
here's my code
%{
#define ACCEPT 1
#define DONT 2
%}
delim [ \t\n\r]
ws {delim}+
comb01 00111|{comb06}1
comb02 01011|{comb07}1
comb03 01101|{comb08}1
comb04 01110|({comb01}|{comb09})0
comb05 01111|({comb01}|{comb09})1
comb06 10011|{comb10}1
comb07 10101|{comb11}1
comb08 10110|({comb02}|{comb12})0
comb09 10111|({comb02}|{comb12})1
comb10 11001|{comb13}1
comb11 11010|({comb03}|{comb14})0
comb12 11011|({comb03}|{comb14})1
comb13 11100|({comb04}|{comb15})0
comb14 11101|({comb04}|{comb15})1
comb15 11110|({comb05}|{comb16})0
comb16 11111|({comb05}|{comb16})1
accept {comb01}|{comb02}|{comb03}|{comb04}|{comb05}|{comb06}|{comb07}|{comb08}|{comb09}|{comb10}|{comb11}|{comb12}|{comb13}|{comb14}|{comb15}|{comb16}
string [^ \t\n\r]+
%%
{ws} { ;}
{accept} {return ACCEPT;}
{string} {return DONT;}
%%
void main () {
int i;
while (i = yylex ())
switch (i) {
case ACCEPT:
printf ("%-20s: ACCEPT\n", yytext);
break;
case DONT:
printf ("%-20s: Reject\n", yytext);
break;
}
}

Flex definitions are macros, and flex implements them that way: when it sees {defn} in a pattern, it replaces it with whatever defn was defined as (in parentheses, usually, to avoid operator precedence issues). It doesn't expand the macros in the macro definition, so the macro substitution might contain more definition references which in turn need to be substituted.
Since macro substitution is unconditional, it is not possible to use recursive macros, including macros which are indirectly recursive. Which yours are. Flex doesn't check for this condition, unlike the C preprocessor; it just continues substituting in an endless loop until it runs out of space.
(Flex is implemented using itself; it does the macro substitution using unput. unput will not resize the input buffer, so "runs out of space" here means that flex's internal flex's input buffer became full of macro substitutions.)
The strategy you are using would work fine as a context-free grammar. But that's not flex. Flex is about regular expressions. The pattern you want to match can be described by a regular expression -- the "grammar" you wrote with flex macros is a regular grammar -- but it is not a regular expression and flex won't make one out of it for you, unfortunately. That's your job.
I don't think it's going to be a very pretty regular expression. In fact, I think it's likely to be enormous. But I didn't try working it out..
There are flex tricks you could use to avoid constructing the regular expression. For example, you could build your state machine out of flex start conditions and then scan one character at a time, where each character scanned does a state transition or throws an error. (Use more() if you want to return the entire string scanned at the end.)

Related

What is the difference between word and word in quotations in FLEX [duplicate]

I am writing a simple scanner in flex. I want my scanner to print out "integer type seen" when it sees the keyword "int". Is there any difference between the following two ways?
1st way:
%%
int printf("integer type seen");
%%
2nd way:
%%
"int" printf("integer type seen");
%%
So, is there a difference between writing if or "if"? Also, for example when we see a == operator, we print something. Is there a difference between writing == or "==" in the flex file?
There's no difference in these specific cases -- the quotes(") just tell lex to NOT interpret any special characters (eg, for regular expressions) in the quoted string, but if there are no special characters involved, they don't matter:
[a-z] printf("matched a single letter\n");
"[a-z]" printf("matched the 5-character string '[a-z]'\n");
0* printf("matched zero or more zero characters\n");
"0*" printf("matched a zero followed by an asterisk\n");
Characters that are special and mean something different outside of quotes include . * + ? | ^ $ < > [ ] ( ) { } /. Some of those only have special meaning if they appear at certain places, but its generally clearer to quote them regardless of where they appear if you want to match the literal characters.

Eliminate characters between numbers in Lex code

How can I eliminate characters between two or more integer numbers in lex code?
Ex:12bd35
output:12 35
Lex builds lexical analyzers, which are intended to split the input into separate tokens. Once you recognize a token, you can ignore it, which is somewhat similar to "eliminating characters". But you always need to recognise them.
So you might start with the following minimalist scanner:
%option noinput nounput noyywrap
%%
[[:digit:]]+ { ECHO; fputc(' ', yyout); } /* print numbers.
[^[:digit:]]+ ; /* ignore everything else. */
And then modify it to fit your actual need.

Lex won't recognize double operators- !=, :=, <<, etc. Can I give a Lex expression precedence?

Trying to parse operators (+, -, =, <<, !=), using states like
%{
%}
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]
DOUBOP [":="|".."|"<<"|">>"|"<>"|"<="|">="|"=>"|"**"|"!="|"{:"|"}:"|"\-"]
and later on
{DOUBOP} { printf("%s (operator)\n", yytext); }
{OP} { printf("%s (operator)\n", yytext); }
but Lex is identifying operators like "<<" as "<" and "<". I thought since it was in double quotes this would work, but I see that's not the case.
Is there anyway I can give a regular expression precedence, ie have lex check for a double operator first, and then a single operator?
Thanks in advance.
[...] is a character class, not an eccentric type of parenthesis. If you want to parenthesize a sub-expression in a pattern, use ordinary parentheses. In this case, however, parentheses are not necessary. (Indeed, most of the quotes aren't necessary either, but they don't hurt and some of them would be useful.)
"==" recognises the two character-sequence consisting of two equal signs. "=="|"++" recognizes either two equal signs or two plus signs.
By contrast, ["=="] recognises a single character, which could be either a quote or an equals sign. Since a character class is a set, the fact that each of those appears twice is irrelevant (although I think it would save a lot of grief if flex issued a warning). Similarly, ["=="|"<<"] recognises a single character if it is a quote, an equals sign, a vertical bar or a less than sign.
Flex pattern syntax is documented in the flex manual. It differs in a few ways from regexes in other systems, so it's worth reading the short document. However, character classes are mostly the same in all regex syntaxes in common use, especially the use of square brackets to delimit the set.
An easier way is to put all single characters together, and run the * command on the end up curly braces.
i.e.
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]*

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Create a Print Function

I'm learning Bison and at this time the only thing that I do was the rpcalc example, but now I want to implement a print function(like printf of C), but I don't know how to do this and I'm planning to have a syntax like this print ("Something here");, but I don't know how to build the print function and I don't know how to create that ; as a end of line. Thanks for your help.
You first need to ask yourself:
What are the [sub-]parts of my 'print ("something");' syntax ?
Once you identify these parts, "simply" describe them in the form of grammar syntax rules, along with applicable production rules. And then let Bison generate the parser for you; that's about it.
To put you on your way:
The semi-column is probably a element you will use to separate statemements (such a one "call" to print from another).
'print' itself is probably a keyword, or preferably a native function name of your language.
The print statement appears to take a literal string as [one of] its arguments. a literal string starts and ends with a double quote (and probably allow for escaped quotes within itself)
etc.
The bolded and italic expressions above are some of the entities (the 'symbols' in parser lingo) you'll likely need to define in the syntax for your language. For that you'll use Bison grammar rules, such as
stmt : print_stmt ';' | input_stmt ';'| some_other_stmt ';' ;
prnt_stmt : print '(' args ')'
{ printf( $3 ); }
;
args : arg ',' args;
...
Since the question asked about the semi-column, maybe some confusion was from the different uses thereof; see for example above how the ';' belong to your language's syntax whereby the ; (no quotes) at the end of each grammar rule are part of Bison's language.
Note: this is of course a simplistic implementation, aimed at showing the essential. Also the Bison syntax may be a tat off (been there / done it, but a long while back ;-) I then "met" ANTLR never to return to Bison, although I do see how its lightweight and fully self contained nature can make it appropriate in some cases)

Resources