flex regex use {-} with trailing context

flex regex use {-} with trailing context - flex-lexer

I'm dealing with nested comments. I would like to skip as many irrelevant characters as possible at a time. But the pattern <COMMENT>[^\n]{-}\//\*{-}\*/\/ is illegal. Any suggestion?
update:
Given the following case:
/**comment* text/**comment text**/comment*/
After capturing the first /*, it enters COMMENT condition. Now I want to eat up as many characters (not /* or */) as possible with one match. Since flex choose the longest match, I don't know how to match isolated *s (not * followed by a /) and isolated /s. And \//\* is not a character class, so we can't compute the difference between it and another class.

%%
"/\*" { printf("OPEN_COMMENT [%s]\n", yytext); }
"\*/" { printf("CLOSE_COMMENT [%s]\n", yytext); }
[^*/]+ { printf("TEXT [%s]\n", yytext); }
"\*" { printf("TEXT [%s]\n", yytext); }
"/" { printf("TEXT [%s]\n", yytext); }
%%
Here longest-match rule helps us. Idea is to match free-standing single * and / symbols separately, while all other text is consumed in bulk (thus not degrading performance).
Result for above example:
OPEN_COMMENT [/*]
TEXT [*]
TEXT [comment]
TEXT [*]
TEXT [ text]
OPEN_COMMENT [/*]
TEXT [*]
TEXT [comment text]
TEXT [*]
CLOSE_COMMENT [*/]
TEXT [comment]
CLOSE_COMMENT [*/]
TEXT [
]

Related

Match rule only at the begining of a file

Problem
I'm writing a sort of a script language interpreter.
I would like it to be able to handle (ignore) things like shebang, utf-bom, or other such thing that can appear on the beginning of a file.
The problem is that I cannot be sure that my growing grammar won't at some point have a rule that could match one of those things. (It's unlikely but you don't get reliable programs by just ignoring unlikely problems.)
Therefore, I would like to do it properly and ignore those things only if they are at the beginning of a file.
Let's focus on a shebang in the example.
I've written some simple grammar that illustrates the problems I'm facing.
Lexer:
%%
#!.+ { printf("shebang: \"%s\"\n", yytext + 2); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Input file:
#!my-program
# some multiline
thingy #
aaa bbb
ccc#!not a shebang#ddd
eee
Expected output:
shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Actual output:
thingy: "#!my-program
#"
id: some
id: multiline
id: thingy
thingy: "#
aaa bbb
ccc#"
error: '!'
id: not
id: a
id: shebang
error: '#'
id: ddd
id: eee
My (bad?) solution
I figured that this is a good case to use start conditions.
I managed to use them to write a lexer that does work, however, it's rather ugly:
%s MAIN
%%
<INITIAL>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<INITIAL>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
<MAIN>#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
Notice that I had to specify the start condition MAIN before the rule #[^#]*#.
It's because it would otherwise collide with the shebang rule #!.+.
Unfortunately, the INITIAL start condition is inclusive, which means I had to specifically exclude from it any rule that would cause problems. I have to remember about it every time I write a new rule (AKA I'll forget about it).
Is there some way to make the INITIAL exclusive or choose a different start condition to be the default?

Here's a simpler solution, assuming you're using Flex (as per your flex-lexer tag):
%option noinput nounput noyywrap nodefault yylineno
%{
#define YY_USER_INIT BEGIN(STARTUP);
%}
%x STARTUP
%%
<STARTUP>#!.* { BEGIN(INITIAL); printf("Shebang: \"%s\"\n", yytext+2); }
<STARTUP>.|\n { BEGIN(INITIAL); yyless(0); }
/* Rest is INITIAL */
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
Test:
rici$ flex -o shebang.c shebang.l
rici$ gcc -Wall -o shebang shebang.c -lfl
rici$ ./shebang <<"EOF"
> #!my-program
> # some multiline
> thingy #
> aaa bbb
> ccc#!not a shebang#ddd
> eee
> EOF
Shebang: "my-program"
thingy: "# some multiline
thingy #"
id: aaa
id: bbb
id: ccc
thingy: "#!not a shebang#"
id: ddd
id: eee
Notes:
The %option line:
prevents "Unused function" warnings;
removes the need for yywrap;
shows an error if there's some possible input which doesn't match any pattern;
counts input lines in the global yylineno
The macro YY_USER_INIT is executed precisely once, when the scanner starts up. It executes before any of Flex's initialization code; fortunately, Flex's initialization code does not change the start condition if it's already been set.
yyless(0) causes the current token to be rescanned. (The argument doesn't have to be 0; it truncates the current token to that length and efficiently puts the rest back into the input stream.)
The library -lfl includes yywrap() (although in this case, it's not used), and a simple main() definition rather similar to the one in your example.
(1) and (2) are Flex extensions. (3) and (4) should be available in any lex which conforms to Posix, with the exception that the Posix lex libary is linked with -ll.

There is an indirect way to select a different start condition.
The start condition is an integer variable (e.g. INITIAL is 0).
You can get its current value using a macro YY_START and if it equals INITIAL, change it to another value, effectively replacing it.
%x BEGINNING
%s MAIN
%%
%{
if (YY_START == INITIAL)
BEGIN(BEGINNING);
%}
<BEGINNING>#!.+ { printf("shebang: \"%s\"\n", yytext + 2); BEGIN(MAIN); }
<BEGINNING>""/(?s:.) { BEGIN(MAIN); }
[[:alnum:]_]+ { printf("id: %s\n", yytext); return 1; }
#[^#]*# { printf("thingy: \"%s\"\n", yytext); return 2; }
[[:space:]] ;
. { printf("error: '%c'\n", yytext[0]); }
%%
int main() { while (yylex()); return 0; }
int yywrap() { return 1; }
The disadvantages of this solution are:
The code block will execute every time you call yylex (not only the first time, when it's actually needed). The overhead is small enough to be ignored, though.
Flex will still generate the whole state machine as if INITIAL was used. I have no idea if this creates a lot of code or if it would matter in a big parser.

How to add local variables to yylex function in flex lexer?

I was writing a lexer file that matches simple custom delimited strings of the form xyz$this is stringxyz. This is nearly how I did it:
%{
char delim[16];
uint8_t dlen;
%}
%%
.*$ {
dlen = yyleng-1;
strncpy(delim, yytext, dlen);
BEGIN(STRING);
}
<STRING>. {
if(yyleng >= dlen) {
if(strncmp(delim, yytext[yyleng-dlen], dlen) == 0) {
BEGIN(INITIAL);
return STR;
}
}
yymore();
}
%%
Now I wanted to convert this to reentrant lexer. But I don't know how to make delim and dlen as local variables inside yylex apart from modifying generated lexer. Someone please help me how should I do this.
I don't recommend to store these in yyextra because, these variables need not persist across multiple calls to yylex. Hence I would prefer an answer that guides me towards declaring these as local variables.

In the (f)lex file, any indented lines between the %% and the first rule are copied verbatim into yylex() prior to the first statement, precisely to allow you to declare and initialize local variables.
This behaviour is guaranteed by the Posix specification; it is not a flex extension: (emphasis added)
Any such input (beginning with a <blank>or within "%{" and "%}" delimiter lines) appearing at the beginning of the Rules section before any rules are specified shall be written to lex.yy.c after the declarations of variables for the yylex() function and before the first line of code in yylex(). Thus, user variables local to yylex() can be declared here, as well as application code to execute upon entry to yylex().
A similar statement is in the Flex manual section 5.2, Format of the Rules Section
The strategy you propose will work, certainly, but it's not very efficient. You might want to consider using input() to read characters one at a time, although that's not terribly efficient either. In any event, delim is unnecessary:
%%
int dlen;
[^$\n]{1,16}\$ {
dlen = yyleng-1;
yymore();
BEGIN(STRING);
}
<STRING>. {
if(yyleng > dlen * 2) {
if(memcmp(yytext, yytext + yyleng - dlen, dlen) == 0) {
/* Remove the delimiter from the reported value of yytext. */
yytext += dlen + 1;
yyleng -= 2 * dlen + 1;
yytext[yyleng] = 0;
return STR;
}
}
yymore();
}
%%

How to achieve capturing groups in flex lex?

I wanted to match for a string which starts with a '#', then matches everything until it matches the character that follows '#'. This can be achieved using capturing groups like this: #(.)[^(?1)]*(?1)(EDIT this regex is also erroneous). This matches #$foo$, does not match #%bar&, matches first 6 characters of #"foo"bar.
But since flex lex does not support capturing groups, what is the workaround here?

As you say, (f)lex does not support capturing groups, and it certainly doesn't support backreferences.
So there is no simple workaround, but there are workarounds. Here are a few possibilities:
You can read the input one character at a time using the input() function, until you find the matching character (but you have to create your own buffer to store the characters, because characters read by input() are not added to the current token). This is not the most efficient because reading one character at a time is a bit clunky, but it's the only interface that (f)lex offers. (The following snippet assumes you have some kind of expandable stringBuilder; if you are using C++, this would just be replaced with a std::string.)
#. { StringBuilder sb = string_builder_new();
int delim = yytext[1];
for (;;) {
int next = input();
if (next == delim) break;
if (next == EOF ) { /* Signal error */; break; }
string_builder_addchar(next);
}
yylval = string_builder_release();
return DELIMITED_STRING;
}
Even less efficiently, but perhaps more conveniently, you can get (f)lex to accumulate the characters in yytext using yymore(), matching one character at a time in a start condition:
%x DELIMITED
%%
int delim;
#. { delim = yytext[1]; BEGIN(DELIMITED); }
<DELIMITED>.|\n { if (yytext[0] == delim) {
yylval = strdup(yytext);
BEGIN(INITIAL);
return DELIMITED_STRING;
}
yymore();
}
<DELIMITED><<EOF>> { /* Signal unterminated string error */ }
The most efficient solution (in (f)lex) is to just write one rule for each possible delimiter. While that's a lot of rules, they could be easily generated with a small script in whatever scripting language you prefer. And, actually, there are not that many rules, particularly if you don't allow alphabetic and non-printing characters to be delimiters. This has the additional advantage that if you want Perl-like parenthetic delimiters (#(Hello) instead of #(Hello(), you can just modify the individual pattern to suit (as I've done below). [Note 1] Since all the actions are the same; it might be easier to use a macro for the action, making it easier to modify.
/* Ordinary punctuation */
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#![^!]*! { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\.[^.]*\. { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Matched pairs */
#<[^>]*> { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\[[^]]*] { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Trap errors */
# { /* Report unmatched or invalid delimiter error */ }
If I were writing a script to generate these rules, I would use hexadecimal escapes for all the delimiter characters rather than trying to figure out which ones needed escapes.
Notes:
Perl requires nested balanced parentheses in constructs like that. But you can't do that with regular expressions; if you wanted to reproduce Perl behaviour, you'd need to use some variation on one of the other suggestions. I'll try to revisit this answer later to address that feature.

Why won't my JavaCC lexer/parser accept this input?

I am creating a lexer/parser which should accept strings that belong to an infinite set of languages.
One such string is "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>".
The set of languages is defined as follows:
Base language, L0
A string from L0 consists of several blocks separated by space characters.
At least one block must be present.
A block is an odd-length sequence of lowercase letters (a-z).
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Example of string belonging to L0:
zyx abcba m xyzvv
There is one space character between zyx and abcba, there are three spaces
between abcba and m, and only one between m and xyzvv. No other space characters are present in the string.
Language L1
A string from L1 consists of several blocks separated by space characters.
At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an even-length sequence of uppercase letters (A-Z). A block of the
second kind must have the shape <2U>. . .</2U>, where . . . stands
for any string from L0.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Example of string belonging to L1:
YZ <2U>abc zzz</2U> ABBA <2U>kkkkk</2U> KM
Note that five spaces separate YZ and <2U>abc zzz</2U>, and three spaces divide abc from zzz. Otherwise single spaces are used as separators. There is no space in front of YZ and no space follows KM.
Language L2
A string from L2 consists of several blocks separated by space characters.
At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an odd-length sequence of lowercase letters (a-z). A block of the
second kind must have the shape <2L>. . .</2L>, where . . . stands
for any string from L1.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Example of string belonging to L2:
abc <2L>AA ZZ <2U>a bcd</2U></2L> z <2L><2U>abcde</2U></2L>
Single spaces are used as separators inside the sentence given above, but any other odd number of spaces would also lead to a valid L2 sentence.
Languages L{2k + 1}, k > 0
A string from L{2k + 1} consists of several blocks separated by space characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an even-length sequence of uppercase letters (A-Z). A block of the
second kind must have the shape <2U>. . .</2U>, where . . . stands
for any string from L{2k}.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
Languages L{2k + 2}, k > 0
A string from L{2k + 2} consists of several blocks separated by space
characters. At least one block must be present.
There are two kinds of blocks. A block of the first kind must be
an odd-length sequence of lowercase letters (a-z). A block of the
second kind must have the shape <2L>. . .</2L>, where . . . stands
for any string from L{2k + 1}.
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
The code for my lexer/parser is as follows:
PARSER_BEGIN(Assignment)
/** A parser which determines if user's input belongs to any one of the set of acceptable languages. */
public class Assignment {
public static void main(String[] args) {
try {
Assignment parser = new Assignment(System.in);
parser.Start();
System.out.println("YES"); // If the user's input belongs to any of the set of acceptable languages, then print YES.
} catch (ParseException e) {
System.out.println("NO"); // If the user's input does not belong to any of the set of acceptable languages, then print NO.
}
}
}
PARSER_END(Assignment)
//** A token which matches any lowercase letter from the English alphabet. */
TOKEN :
{
< #L_CASE_LETTER: ["a"-"z"] >
}
//* A token which matches any uppercase letter from the English alphabet. */
TOKEN:
{
< #U_CASE_LETTER: ["A"-"Z"] >
}
//** A token which matches an odd number of lowercase letters from the English alphabet. */
TOKEN:
{
< ODD_L_CASE_LETTER: <L_CASE_LETTER>(<L_CASE_LETTER><L_CASE_LETTER>)* >
}
//** A token which matches an even number of uppercase letters from the English alphabet. */
TOKEN:
{
< EVEN_U_CASE_LETTERS: (<U_CASE_LETTER><U_CASE_LETTER>)+ >
}
//* A token which matches the string "<2U>" . */
TOKEN:
{
< OPEN_UPPER: "<2U>" >
}
//* A token which matches the string "</2U>". */
TOKEN:
{
< CLOSE_UPPER: "</2U>" >
}
//* A token which matches the string "<2L>". */
TOKEN:
{
< OPEN_LOWER: "<2L>" >
}
//* A token which matches the string "</2L>". */
TOKEN:
{
< CLOSE_LOWER: "</2L>" >
}
//* A token which matches an odd number of white spaces. */
TOKEN :
{
< ODD_WHITE_SPACE: " "(" "" ")* >
}
//* A token which matches an EOL character. */
TOKEN:
{
< EOL: "\n" | "\r" | "\r\n" >
}
/** This production matches strings which belong to the base language L^0. */
void Start() :
{}
{
LOOKAHEAD(3)
<ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)* <EOL> <EOF>
|
NextLanguage()
|
LOOKAHEAD(3)
NextLanguageTwo()
|
EvenLanguage()
}
/** This production matches strings which belong to language L^1. */
void NextLanguage():
{}
{
(<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF>
|
(<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF>
}
/** This production matches either an even number of uppercase letters, or a string from L^0, encased within the tags <2U> and </2U>. */
void UpperOrPseudoStart() :
{}
{
<EVEN_U_CASE_LETTERS>
|
<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>
}
/** This production matches strings from L^0, in a similar way to Start(); however, the strings that it matches do not have EOL or EOF characters after them. */
void PseudoStart() :
{}
{
<ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)*
}
/** This production matches strings which belong to language L^2. */
void NextLanguageTwo() :
{}
{
(<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF>
|
(<OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF>
}
/** This production matches either an odd number of lowercase letters, or a string from L^1, encased within the tags <2L> and </2L>. */
void LowerOrPseudoNextLanguage() :
{}
{
<ODD_L_CASE_LETTER>
|
<OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>
}
/** This production matches strings from L^1, in a similar way to NextLanguage(); however, the strings that it matches do not have EOL or EOF characters after them. */
void PseudoNextLanguage() :
{}
{
(<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())*
|
(<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())*
}
/** This production matches strings which belong to any of the languages L^{2k + 2}, where k > 0 (the infinite set of even languages). */
void EvenLanguage() :
{}
{
(<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF>
|
(CommonPattern())+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF>
}
/** This production is an auxiliary production that helps when parsing strings from any of the even set of languages. */
void EvenLanguageAuxiliary() :
{}
{
CommonPattern()
|
<ODD_L_CASE_LETTER>
}
void CommonPattern() :
{}
{
<OPEN_LOWER> <EVEN_U_CASE_LETTERS> <ODD_WHITE_SPACE> <OPEN_UPPER> <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> CommonPattern())+ <CLOSE_UPPER> <CLOSE_LOWER>
}
Several times now, I have inputted the string "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>".
However, each time, NO is printed out on the terminal.
I have looked through my code carefully several times, checking the order in which I think the input string should be parsed; but, I haven't been able to find any errors in my logic or reasons why the string isn't being accepted.
Could I have some suggestions as to why it isn't being accepted, please?

The following steps helped to solve the problem.
Run the following code:
javacc -debug_parser Assignment.jj
javac Assignment*.java
Then, run the lexer/parser (by typing java Assignment) and then input the string:
"a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>"
The resulting trace of parser actions shows that the production NextLangaugeTwo() is called on this string, rather than the desired EvenLanguage() production.
Tracing through NextLangaugeTwo() shows that it matches the first eight tokens in the input string.
So, using a lookahead of 9, although inefficient, causes the input string to be accepted. That is, modify the Start() production by changing the second lookahead value (just above the call to NextLanguageTwo()) from 3 to 9.

Are any of your inputs being accepted? I have copied your code over to my computer and have found that any correct input (as far as I can tell from the definition of your language), it always outputs 'NO'.

Reducing insane flex lexer expansion?

I have written a flex lexer to handle the text in BYOND's .dmi file format. The contents inside are (key, value) pairs delimited by '='. Valid keys are all essentially keywords (such as "width"), and invalid keys are not errors: they are just ignored.
Interestingly, the current state of BYOND's .dmi parser uses everything prior to the '=' as its keyword, and simply ignores any excess junk. This means "\twidth123" is recognized as "width".
The crux of my problem is in allowing for this irregularity. In doing so my generated lexer expands from ~40-50KB to ~13-14MB. For reference, I present the following contrived example:
%option c++ noyywrap
fill [^=#\n]*
%%
{fill}version{fill} { return 0; }
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
{fill}state{fill} { return 0; }
{fill}dirs{fill} { return 0; }
{fill}frames{fill} { return 0; }
{fill}delay{fill} { return 0; }
{fill}loop{fill} { return 0; }
{fill}rewind{fill} { return 0; }
{fill}movement{fill} { return 0; }
{fill}hotspot{fill} { return 0; }
%%
fill is the rule that is used to merge the keywords with "anything before the =". Running flex on the above yields a ~13MB lex.yy.cc on my computer. Simply removing the kleene star (*) in the fill rule yields a 45KB lex.yy.cc file; however, obviously, this then makes the lexer incorrect.
Are there any tricks, flex options, or lexer hacks to avoid this insane expansion? The only things I can think of are:
Disallow "width123" to represent "width", which is undesirable as then technically-correct files could not be parsed.
Make one rule that is simply [^=\n]+ to return some identifier token, and pick out the keyword in the parser. This seems suboptimal to me as well, particularly because different keywords have different value types and it seems most natural to be able to handle "'width' '=' INT" and "'version' '=' FLOAT" in the parser instead of "ID '=' VALUE" followed by picking out the keyword in the identifier, making sure the value is of the right type, etc.
I could make the rule {fill}(width|height|version|...){fill}, which does indeed keep the generated file small. However, while regular expression parsers tend to produce "captures," flex just gives me yytext and re-parsing that for a keyword to produce the desired token seems to be very undesirable in terms of algorithmic complexity.

Make fill a separate rule of its own that does nothing, and remove it from all the other rules, and separate its definition from whitespace for clarity:
whitespace [ \t\f]
fill [^#=\n]
%%
{whitespace}+ ;
{fill}+ ;
I would probably also avoid building the keywords into the lexer and just use an identifier [a-zA-Z]+ rule that does a table lookup. And finally add a rule to catch the =:
. return yytext[0];
to let the parser handle all special characters.

This is not really a problem flex is "good at", but it can be solved if it is precisely defined. In particular, it is important to know which of the keywords should be returned if the random string of letters before the = contains more than one keyword. For example, suppose the input is:
garbage_widtheight_moregarbage = 42
Now, is that setting the width or the height?
Remember that flex scanners will choose the rule with longest match, and of rules with equally long matches, the first one in the lexical description.
So the model presented in the OP:
fill [^=#\n]*
%%
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
/* SNIP */
will always prefer width to height, because the matches will be the same length (both terminate at the last character before the =), and the width pattern comes first in the file. If the rules were written in the opposite order, height would be preferred.
On the other hand, if you removed the second {fill}:
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
then the last keyword in the input (in this case, height) will be preferred, because that one has the longer match.
The most likely requirement, however, is that the first keyword be recognized, so neither of the preceding will work. In order to match the first keyword, it is necessary to first match the shortest possible sequence of {fill}. And since flex does not implement non-greedy repetition, that can only be done with a character-by-character span.
Here's an example, using start conditions. Note that we hold onto the keyword token until we actually find the =, in case the = is not found.
/* INITIAL: beginning of a line
* FIND_EQUAL: keyword recognized, looking for the =
* VALUE: = recognized, lexing the right-hand side
* NEXT_LINE: find the next line and continue the scan
*/
%x FIND_EQUAL VALUE
%%
int keyword;
"[#=]".* /* Skip comments and lines with no recognizable keyword */
version { keyword = KW_VERSION; BEGIN(FIND_EQUAL); }
width { keyword = KW_WIDTH; BEGIN(FIND_EQUAL); }
height { keyword = KW_HEIGHT; BEGIN(FIND_EQUAL); }
/* etc. */
.|\n /* Skip any other single character, or newline */
<FIND_EQUAL>{
[^=#\n]*"=" { BEGIN(VALUE); return keyword; }
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
}
<VALUE>{
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
[[:blank:]]+ ; /* Ignore space and tab characters */
[[:digit:]]+ { yylval.ival = atoi(yytext);
BEGIN(NEXT_LINE); return INTEGER;
}
[[:digit:]]+"."[[:digit:]]*|"."[[:digit:]]+ {
yylval.fval = atod(yytext);
BEGIN(NEXT_LINE); return FLOAT;
}
\"([^"]|\\.)*\" { char* s = malloc(yyleng - 1);
yylval.sval = s;
/* Remove quotes and escape characters */
yytext[yyleng - 1] = '\0';
do {
if (*++yytext == '\\') ++yytext;
*s++ = *yytext;
} while (*yytext);
BEGIN(NEXT_LINE); return STRING;
}
/* Other possible value token types */
. BEGIN(NEXT_LINE); /* bad character in value */
}
<NEXT_LINE>.*\n? BEGIN(INITIAL);
In the escape-removal code, you might want to translate things like \n. And you might also want to avoid string values with physical newlines. And a bunch of etceteras. It's only intended as a model.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

flex regex use {-} with trailing context - flex-lexer

Related

Match rule only at the begining of a file

How to add local variables to yylex function in flex lexer?

How to achieve capturing groups in flex lex?

Why won't my JavaCC lexer/parser accept this input?

Reducing insane flex lexer expansion?

Categories

Resources