what does Exclamation mark mean? - flex-lexer

I am reading some flex file for a dialect of datalog, the original file is ol_lexer.lex there is a code snippet in the action part:
<INITIAL>%%.* ; // Ignore %% comments
<INITIAL>^#!.* ; // Ignore '#' directives
I know the second line for matching the preprocessing directives such as
#define PI 3.14
but I don't know what's meaning of the mark "!" here, or why the second pattern needs the exclamation mark ?

The second line ignores lines that start with #! (so it would not match #define ... as that doesn't have a ! after the #). The ! doesn't have any special meaning here - it just matches an exclamation mark.
I assume the purpose of this rule is to allow shebang line (such as #!/usr/bin/env myinterpreter).

Related

Flex confusing to transform string character by character

I want to use flex to transform a string based on simple rules. I have rules like the first character stays the same and the second and third characters might change. Like if the second character was a letter, it becomes the number listed in the rules below. If the third is a digit, it becomes a certain letter.
%%
/*^[a-z] {char *yycopy = strdup( yytext ); unput(yycopy[0]);}*/
[ajs] {putchar('1');}
[bkt] {putchar('2');}
[clu] {putchar('3');}
[dmv] {putchar('4');}
[1] {putchar('j');}
[2] {putchar('k');}
[3] {putchar('l');}
[4] {putchar('m');} /*more number rules till 9*/
%%
int yywrap(void){return 1;}
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
while (yylex());
}
If there are different rules for characters in different positions within the string, how can I use start conditions to change a particular character (i.e. the rules for the second and third character are different).
You switch start condition by using the BEGIN action. Flex never automatically changes start condition, so you when you need to return to the initial start condition (called INITIAL), you have to do so explicitly (BEGIN(INITIAL)).
You need to declare start condition names in the (f)lex prologue, usually with the %x command. (%s is also possible but with different semantics. See the Flex manual for details.)
You indicate that a start condition applies to a rule by starting the rule with a start condition name in angle brackets. You can put more than one start condition inside the angle brackets; separate them with commas and don't use spaces. Don't put a space after the angle brackets either; they are part of the pattern and (f)lex patterns cannot include unquoted space characters.
BEGIN is a macro and it does not require parentheses around the start condition name, but I suggest always using them anyway, so you don't have to worry about what the macro expands to. Start condition names are small integers (either enum constants or preprocessor macros) but nothing guarantees their value, so don't make assumptions.
That's about it. So you could implement your astro numerological codifier with:
%x SECOND THIRD REST
%%
[a-z] ECHO; BEGIN(SECOND);
<SECOND>[ajs] putchar('1'); BEGIN(THIRD);
/* More SECOND rules */
<THIRD>1 putchar('j'); BEGIN(REST);
/* More THIRD rules */
<*>.*\n? ECHO; BEGIN(INITIAL);
(I deliberately did not add any <REST> rules beacause the fallback at the end covers it. I also deliberately left out the anchor in the first rule because my rules guarantee that the INITIAL start condition is 9nly in force at the beginning of a line. See the last rule. The last rule specifies an optional newline in case the file does not end with a newline, which occasionally happens although it's technically invalid.)

End of line lex

I am writing an interpreter for assembly using lex and yacc. The problem is that I need to parse a word that will strictly be at the end of the file. I've read that there is an anchor $, which can help. However it doesn't work as I expected. I've wrote this in my lex file:
ABC$ {printf("QWERTY\n");}
The input file is:
ABC
without spaces or any other invisible symbols. So I expect the outputput to be QWERTY, however what I get is:
ABC
which I guess means that the program couldn't parse it. Then I thought, that $ might be a regular symbol in lex, so I changed the input file into this:
ABC$
So, if $ isn't a special symbol, then it will be parsed as a normal symbol, and the output will be QWERTY. This doesn't happen, the output is:
ABC$
The question is whether $ in lex is a normal symbol or special one.
In (f)lex, $ matches zero characters followed by a newline character.
That's different from many regex libraries where $ will match at the end of input. So if your file does not have a newline at the end, as your question indicates (assuming you consider newline to be an invisible character), it won't be matched.
As #sepp2k suggests in a comment, the pattern also won't be matched if the input file happens to use Windows line endings (which consist of the sequence \r\n), unless the generated flex file was compiled for Windows. So if you created the file on Windows and run the flex-generated scanner in a Unix environment, the \r will also cause the pattern to fail to match. In that case, you can use (f)lex's trailing context operator:
ABC/\r?\n { puts("Matched ABC at the end of a line"); }
See the flex documentation for patterns for a full description of the trailing context operator. (Search for "trailing context" on that page; it's roughly halfway down.) $ is exactly equivalent to /\n.
That still won't match ABC at the very end of the file. Matching strings at the very end of the file is a bit tricky, but it can be done with two patterns if it's ok to recognise the string other than at the end of the file, triggering a different action:
ABC/. { /* Do nothing. This ABC is not at the end of a line or the file */ }
ABC { puts("ABC recognised at the end of a line"); }
That works because the first pattern will match as long as there is some non-newline character following ABC. (. matches any character other than a newline. See the above link for details.) If you also need to work with Windows line endings, you'll need to modify the trailing context in the first pattern.

Flex regular expression for comments

I'm trying to learn flex and having trouble with a regular expression to catch comments.
Assuming a comment begins with // and runs to the end of the line, I would like the program to recognize the entire comment and set yytext equal to it.
So far ["//".*$] is not cutting the mustard.
Thank you
Putting your text in square brackets creates a character class matching any one character from among those between the brackets. Also, quotation marks are not special in Flex's regex syntax. You want something along these lines:
/* definitions (for more readable rules) */
/* The \134 are octal escapes for the '/' character, for clarity: */
CMNT_START \134\134
%%
/* rules */
{CMNT_START}.*$ /* yytext automatically contains the matched text*/;

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

flex usage of (?r-s:pattern)

I am trying to use the regular expression (?r-s:pattern) as mentioned in the Flex manual.
Following code works only when i input small letter 'a' and not the caps 'A'
%%
[(?i:a)] { printf("color"); }
\n { printf("NEWLINE\n"); return EOL;}
. { printf("Mystery character %s\n", yytext); }
%%
OUTPUT
a
colorNEWLINE
A
Mystery character A
NEWLINE
Reverse is also true i.e. if i change the line (?i:a) to (?i:A) it only considers 'A' as valid input and not 'a'.
If I remove the square brackets i.e. [] it gives error as
"ex1.lex", line 2: unrecognized rule
If I enclose the "(?i:a)" then it compiles but after executing it always goes to last rule i.e. "Mystery character..."
Please let me know how to use it properly.
I guess I am late.. :) Anyway, which flex version are you using, I have version 2.5.35 installed and correctly recognizes above pattern. Perhaps you're using old version!!!
Now regarding the enclosing with [] brackets. It works because as per [] regex rule it will try to match any of individual (, ?, i, :, a or ). Thats why a gets recognized and not A (because it is not in the list).
The way I read the manual, the rule without the square brackets should perform the case-insensitive matching you're looking for--I can't explain why you get an error at compile time. But you can achieve the same behavior in one of two ways. One, you can enumerate the upper and lower case characters in the character class:
%%
[Aa] { printf("color"); }
%%
Two, you can specify the case-insensitive scanner option, either on the command line as -i or --case-insensitive or in your .l file:
%%
%option case-insensitive
[a] {printf("color"); }
%%

Resources