concatenating to EOF in flex - flex-lexer

I have the following line:
<INITIAL><<EOF>> {return 0;}
and I need to ignore the last EOL - \n or \r\n before the EOF.
I can't figure out how to concatenate it to EOFso that it will be a valid regular expression.. I've tried:
<INITIAL>((\n)|(\r\n))*<<EOF>> {return 0;}
but it says it's an "unrecognized rule".

<<EOF>> is not really a pattern symbol, since it cannot be part of a pattern. Logically, the EOF marker is not a character; the <<EOF>> pseudo-pattern is the only flex pattern which can be matched by an empty string.
There is no flex pattern symbol which represents end of input and thus it is not possible to express a pattern "followed by EOF".
So you need to work from a different perspective: detect a pattern which is not followed by EOF.
If a pattern is not followed by EOF, it must be followed by at least one character. That we can write using the trailing context operator. Once we've matched those instances of the pattern, any remaining match for the pattern can only be used if that match is followed by EOF, because of the longest match rule:
\r?\n/(.|\n) { /* A new line NOT followed by EOF */ }
\r?\n { /* A new line followed by EOF */ }
We needed to use .|\n in the trailing context because . doesn't match \n. The parentheses are unnecessary because of the precedence of the trailing context operator.
Forcing the detection of trailing context after a newline will make interactive use of this scanner annoying, since if a newline token is returned by the first rule, it will not actually be returned until another line is read.
By the way, there is no need for
<INITIAL><<EOF>> {return 0;}
That is the flex default behaviour on end-of-file, and you only need an <<EOF>> rule if you need to do something prior to returning 0.


Flex expression required for validating certain expression based upon the first three characters only

For my parser, for the purpose of this question, any line starting with a single lowercase letter among a set of lowercase letters, followed by the character '=' followed by any other character is a valid line. So, the following are valid lines (all starting from first column):
b=50 70
q=20 Hello There
Any other line is not valid. My need is to match the complement. How do I write a flex expression to match the invalid lines. My confusion arises from the ^ which means start of line as well as complement the expression.
I thought ^[abq][=].+ would match the acceptable line so merely complementing it with ^ will do. But ^ at the start of the expression always implies match at start of the line. I made a few other attempts but that did not work too. Though not relevant, the expression is used as the first step to discard invalid SDP lines. See here for details from the relevant SDP RFC, if it matters.
The simplest approach is to always match entire lines (or use different start conditions to lexically analyse the rest of valid lines). Although flex does not have a negation operator (the [^…] negative character class is not an operator), in this case the expressions are pretty simple and can be expressed easily enough. Note that it doesn't matter that the various "invalid line" patterns are not disjoint, since it doesn't matter which one matches a particular invalid line. So here are three patterns which I believe collectively match all invalid lines
[^abqz\n].* { /* Starts with the wrong letter */ }
.[^=\n] { /* Second character not = */ }
.$ { /* Only one character in line */ }

Flex regular expression for comments

I'm trying to learn flex and having trouble with a regular expression to catch comments.
Assuming a comment begins with // and runs to the end of the line, I would like the program to recognize the entire comment and set yytext equal to it.
So far ["//".*$] is not cutting the mustard.
Thank you
Putting your text in square brackets creates a character class matching any one character from among those between the brackets. Also, quotation marks are not special in Flex's regex syntax. You want something along these lines:
/* definitions (for more readable rules) */
/* The \134 are octal escapes for the '/' character, for clarity: */
CMNT_START \134\134
/* rules */
{CMNT_START}.*$ /* yytext automatically contains the matched text*/;

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
#include <stdio.h>
%option main warn debug
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
end_of_stmt: ';'
| '\n'
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

flex usage of (?r-s:pattern)

I am trying to use the regular expression (?r-s:pattern) as mentioned in the Flex manual.
Following code works only when i input small letter 'a' and not the caps 'A'
[(?i:a)] { printf("color"); }
\n { printf("NEWLINE\n"); return EOL;}
. { printf("Mystery character %s\n", yytext); }
Mystery character A
Reverse is also true i.e. if i change the line (?i:a) to (?i:A) it only considers 'A' as valid input and not 'a'.
If I remove the square brackets i.e. [] it gives error as
"ex1.lex", line 2: unrecognized rule
If I enclose the "(?i:a)" then it compiles but after executing it always goes to last rule i.e. "Mystery character..."
Please let me know how to use it properly.
I guess I am late.. :) Anyway, which flex version are you using, I have version 2.5.35 installed and correctly recognizes above pattern. Perhaps you're using old version!!!
Now regarding the enclosing with [] brackets. It works because as per [] regex rule it will try to match any of individual (, ?, i, :, a or ). Thats why a gets recognized and not A (because it is not in the list).
The way I read the manual, the rule without the square brackets should perform the case-insensitive matching you're looking for--I can't explain why you get an error at compile time. But you can achieve the same behavior in one of two ways. One, you can enumerate the upper and lower case characters in the character class:
[Aa] { printf("color"); }
Two, you can specify the case-insensitive scanner option, either on the command line as -i or --case-insensitive or in your .l file:
%option case-insensitive
[a] {printf("color"); }
