Reducing insane flex lexer expansion? - flex-lexer

I have written a flex lexer to handle the text in BYOND's .dmi file format. The contents inside are (key, value) pairs delimited by '='. Valid keys are all essentially keywords (such as "width"), and invalid keys are not errors: they are just ignored.
Interestingly, the current state of BYOND's .dmi parser uses everything prior to the '=' as its keyword, and simply ignores any excess junk. This means "\twidth123" is recognized as "width".
The crux of my problem is in allowing for this irregularity. In doing so my generated lexer expands from ~40-50KB to ~13-14MB. For reference, I present the following contrived example:
%option c++ noyywrap
fill [^=#\n]*
%%
{fill}version{fill} { return 0; }
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
{fill}state{fill} { return 0; }
{fill}dirs{fill} { return 0; }
{fill}frames{fill} { return 0; }
{fill}delay{fill} { return 0; }
{fill}loop{fill} { return 0; }
{fill}rewind{fill} { return 0; }
{fill}movement{fill} { return 0; }
{fill}hotspot{fill} { return 0; }
%%
fill is the rule that is used to merge the keywords with "anything before the =". Running flex on the above yields a ~13MB lex.yy.cc on my computer. Simply removing the kleene star (*) in the fill rule yields a 45KB lex.yy.cc file; however, obviously, this then makes the lexer incorrect.
Are there any tricks, flex options, or lexer hacks to avoid this insane expansion? The only things I can think of are:
Disallow "width123" to represent "width", which is undesirable as then technically-correct files could not be parsed.
Make one rule that is simply [^=\n]+ to return some identifier token, and pick out the keyword in the parser. This seems suboptimal to me as well, particularly because different keywords have different value types and it seems most natural to be able to handle "'width' '=' INT" and "'version' '=' FLOAT" in the parser instead of "ID '=' VALUE" followed by picking out the keyword in the identifier, making sure the value is of the right type, etc.
I could make the rule {fill}(width|height|version|...){fill}, which does indeed keep the generated file small. However, while regular expression parsers tend to produce "captures," flex just gives me yytext and re-parsing that for a keyword to produce the desired token seems to be very undesirable in terms of algorithmic complexity.

Make fill a separate rule of its own that does nothing, and remove it from all the other rules, and separate its definition from whitespace for clarity:
whitespace [ \t\f]
fill [^#=\n]
%%
{whitespace}+ ;
{fill}+ ;
I would probably also avoid building the keywords into the lexer and just use an identifier [a-zA-Z]+ rule that does a table lookup. And finally add a rule to catch the =:
. return yytext[0];
to let the parser handle all special characters.

This is not really a problem flex is "good at", but it can be solved if it is precisely defined. In particular, it is important to know which of the keywords should be returned if the random string of letters before the = contains more than one keyword. For example, suppose the input is:
garbage_widtheight_moregarbage = 42
Now, is that setting the width or the height?
Remember that flex scanners will choose the rule with longest match, and of rules with equally long matches, the first one in the lexical description.
So the model presented in the OP:
fill [^=#\n]*
%%
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
/* SNIP */
will always prefer width to height, because the matches will be the same length (both terminate at the last character before the =), and the width pattern comes first in the file. If the rules were written in the opposite order, height would be preferred.
On the other hand, if you removed the second {fill}:
{fill}width{fill} { return 0; }
{fill}height{fill} { return 0; }
then the last keyword in the input (in this case, height) will be preferred, because that one has the longer match.
The most likely requirement, however, is that the first keyword be recognized, so neither of the preceding will work. In order to match the first keyword, it is necessary to first match the shortest possible sequence of {fill}. And since flex does not implement non-greedy repetition, that can only be done with a character-by-character span.
Here's an example, using start conditions. Note that we hold onto the keyword token until we actually find the =, in case the = is not found.
/* INITIAL: beginning of a line
* FIND_EQUAL: keyword recognized, looking for the =
* VALUE: = recognized, lexing the right-hand side
* NEXT_LINE: find the next line and continue the scan
*/
%x FIND_EQUAL VALUE
%%
int keyword;
"[#=]".* /* Skip comments and lines with no recognizable keyword */
version { keyword = KW_VERSION; BEGIN(FIND_EQUAL); }
width { keyword = KW_WIDTH; BEGIN(FIND_EQUAL); }
height { keyword = KW_HEIGHT; BEGIN(FIND_EQUAL); }
/* etc. */
.|\n /* Skip any other single character, or newline */
<FIND_EQUAL>{
[^=#\n]*"=" { BEGIN(VALUE); return keyword; }
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
}
<VALUE>{
"#".* { BEGIN(INITIAL); }
\n { BEGIN(INITIAL); }
[[:blank:]]+ ; /* Ignore space and tab characters */
[[:digit:]]+ { yylval.ival = atoi(yytext);
BEGIN(NEXT_LINE); return INTEGER;
}
[[:digit:]]+"."[[:digit:]]*|"."[[:digit:]]+ {
yylval.fval = atod(yytext);
BEGIN(NEXT_LINE); return FLOAT;
}
\"([^"]|\\.)*\" { char* s = malloc(yyleng - 1);
yylval.sval = s;
/* Remove quotes and escape characters */
yytext[yyleng - 1] = '\0';
do {
if (*++yytext == '\\') ++yytext;
*s++ = *yytext;
} while (*yytext);
BEGIN(NEXT_LINE); return STRING;
}
/* Other possible value token types */
. BEGIN(NEXT_LINE); /* bad character in value */
}
<NEXT_LINE>.*\n? BEGIN(INITIAL);
In the escape-removal code, you might want to translate things like \n. And you might also want to avoid string values with physical newlines. And a bunch of etceteras. It's only intended as a model.

Related

How to add local variables to yylex function in flex lexer?

I was writing a lexer file that matches simple custom delimited strings of the form xyz$this is stringxyz. This is nearly how I did it:
%{
char delim[16];
uint8_t dlen;
%}
%%
.*$ {
dlen = yyleng-1;
strncpy(delim, yytext, dlen);
BEGIN(STRING);
}
<STRING>. {
if(yyleng >= dlen) {
if(strncmp(delim, yytext[yyleng-dlen], dlen) == 0) {
BEGIN(INITIAL);
return STR;
}
}
yymore();
}
%%
Now I wanted to convert this to reentrant lexer. But I don't know how to make delim and dlen as local variables inside yylex apart from modifying generated lexer. Someone please help me how should I do this.
I don't recommend to store these in yyextra because, these variables need not persist across multiple calls to yylex. Hence I would prefer an answer that guides me towards declaring these as local variables.
In the (f)lex file, any indented lines between the %% and the first rule are copied verbatim into yylex() prior to the first statement, precisely to allow you to declare and initialize local variables.
This behaviour is guaranteed by the Posix specification; it is not a flex extension: (emphasis added)
Any such input (beginning with a <blank>or within "%{" and "%}" delimiter lines) appearing at the beginning of the Rules section before any rules are specified shall be written to lex.yy.c after the declarations of variables for the yylex() function and before the first line of code in yylex(). Thus, user variables local to yylex() can be declared here, as well as application code to execute upon entry to yylex().
A similar statement is in the Flex manual section 5.2, Format of the Rules Section
The strategy you propose will work, certainly, but it's not very efficient. You might want to consider using input() to read characters one at a time, although that's not terribly efficient either. In any event, delim is unnecessary:
%%
int dlen;
[^$\n]{1,16}\$ {
dlen = yyleng-1;
yymore();
BEGIN(STRING);
}
<STRING>. {
if(yyleng > dlen * 2) {
if(memcmp(yytext, yytext + yyleng - dlen, dlen) == 0) {
/* Remove the delimiter from the reported value of yytext. */
yytext += dlen + 1;
yyleng -= 2 * dlen + 1;
yytext[yyleng] = 0;
return STR;
}
}
yymore();
}
%%

How to achieve capturing groups in flex lex?

I wanted to match for a string which starts with a '#', then matches everything until it matches the character that follows '#'. This can be achieved using capturing groups like this: #(.)[^(?1)]*(?1)(EDIT this regex is also erroneous). This matches #$foo$, does not match #%bar&, matches first 6 characters of #"foo"bar.
But since flex lex does not support capturing groups, what is the workaround here?
As you say, (f)lex does not support capturing groups, and it certainly doesn't support backreferences.
So there is no simple workaround, but there are workarounds. Here are a few possibilities:
You can read the input one character at a time using the input() function, until you find the matching character (but you have to create your own buffer to store the characters, because characters read by input() are not added to the current token). This is not the most efficient because reading one character at a time is a bit clunky, but it's the only interface that (f)lex offers. (The following snippet assumes you have some kind of expandable stringBuilder; if you are using C++, this would just be replaced with a std::string.)
#. { StringBuilder sb = string_builder_new();
int delim = yytext[1];
for (;;) {
int next = input();
if (next == delim) break;
if (next == EOF ) { /* Signal error */; break; }
string_builder_addchar(next);
}
yylval = string_builder_release();
return DELIMITED_STRING;
}
Even less efficiently, but perhaps more conveniently, you can get (f)lex to accumulate the characters in yytext using yymore(), matching one character at a time in a start condition:
%x DELIMITED
%%
int delim;
#. { delim = yytext[1]; BEGIN(DELIMITED); }
<DELIMITED>.|\n { if (yytext[0] == delim) {
yylval = strdup(yytext);
BEGIN(INITIAL);
return DELIMITED_STRING;
}
yymore();
}
<DELIMITED><<EOF>> { /* Signal unterminated string error */ }
The most efficient solution (in (f)lex) is to just write one rule for each possible delimiter. While that's a lot of rules, they could be easily generated with a small script in whatever scripting language you prefer. And, actually, there are not that many rules, particularly if you don't allow alphabetic and non-printing characters to be delimiters. This has the additional advantage that if you want Perl-like parenthetic delimiters (#(Hello) instead of #(Hello(), you can just modify the individual pattern to suit (as I've done below). [Note 1] Since all the actions are the same; it might be easier to use a macro for the action, making it easier to modify.
/* Ordinary punctuation */
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#:[^:]*: { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#![^!]*! { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\.[^.]*\. { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Matched pairs */
#<[^>]*> { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
#\[[^]]*] { yylval = strndup(yytext + 2, yyleng - 3); return DELIMITED_STRING; }
/* Trap errors */
# { /* Report unmatched or invalid delimiter error */ }
If I were writing a script to generate these rules, I would use hexadecimal escapes for all the delimiter characters rather than trying to figure out which ones needed escapes.
Notes:
Perl requires nested balanced parentheses in constructs like that. But you can't do that with regular expressions; if you wanted to reproduce Perl behaviour, you'd need to use some variation on one of the other suggestions. I'll try to revisit this answer later to address that feature.

(f)lex the difference between PRINTA$ and PRINT A$

I am parsing BASIC:
530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I
The patterns that are used in this case are:
FOR { return TOK_FOR; }
TO { return TOK_TO; }
NEXT { return TOK_NEXT; }
(many lines later...)
[A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
return IDENTIFIER;
}
(many lines later...)
[ \t\r\l] { /* eat non-string whitespace */ }
The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is:
530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI
Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER.
The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit.
Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique.
Here's a simple pattern for recognising keywords, using trailing context:
tail [[:alnum:]]*[$%!#]?
%%
FOR/{tail} { return TOK_FOR; }
TO/{tail} { return TOK_TO; }
NEXT/{tail} { return TOK_NEXT; }
/* etc. */
[[:alpha:]]{tail} { /* Handle an ID */ }
Effectively, that just extends the keyword match without extending the matched token.
But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?

Does each call to `yylex()` generate a token or all the tokens for the input?

I am trying to understand how flex works under the hood.
In the following first example, it seems that main() calls yylex() only once, and yylex() generates all the tokens for the entire input.
In the second example, it seems that main() calls yylex() once per token generated, and yylex() generates a token per call.
Does each call to yylex() generate a token or all the tokens for the input?
Why is yylex() called different number of times in the two examples?
I heard that yylex() is like a coroutine, and each call to it will resume with the rest of the input left from last call and generate a token. In that sense, how does the first example calls yylex() just once and generate all the tokens in the input?
/* just like Unix wc */
%{
int chars = 0;
int words = 0;
int lines = 0;
%}
%%
[a-zA-Z]+ { words++; chars += strlen(yytext); }
\n { chars++; lines++; }
. { chars++; }
%%
main(int argc, char **argv)
{
yylex();
printf("%8d%8d%8d\n", lines, words, chars);
}
$ ./a.out
The boy stood on the burning deck
shelling peanuts by the peck
^D
2 12 63
$
and
/* recognize tokens for the calculator and print them out */
%{
enum yytokentype {
NUMBER = 258,
ADD = 259,
SUB = 260,
MUL = 261,
DIV = 262,
ABS = 263,
EOL = 264
};
int yylval;
%}
%%
"+" { return ADD; }
"-" { return SUB; }
"*" { return MUL; }
"/" { return DIV; }
"|" { return ABS; }
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
\n { return EOL; }
[ \t] { /* ignore whitespace */ }
. { printf("Mystery character %c\n", *yytext); }
%%
main(int argc, char **argv)
{
int tok;
while(tok = yylex()) {
printf("%d", tok);
if(tok == NUMBER) printf(" = %d\n", yylval);
else printf("\n");
}
}
$ ./a.out
a / 34 + |45
Mystery character a
262
258 = 34
259
263
258 = 45
264
^D
$
Flex doesn't decide when the scanner will return (except for the default EOF rule). The scanner which it builds performs lexical actions in a loop until some action returns. So it is entirely up to you how you want to structure your scanner.
However, the classic yyparse/yylex processing model consists of the parser calling yylex() every time it needs a new token. So it expects yylex() to return immediately once it finds a token.
In your first code example, there is no parser and the scanner action is limited to printing out the token. While the example is perfectly correct, relying on the scanner loop to repeatedly execute actions, I'd prefer the second model even if you don't (yet) intend to add a parser, because it will make it easier to decouple token handling from token generation.
That doesn't mean that every lexical action will contain a return statement, though. Some lexical patterns correspond to non-tokens (comments and whitespace, for example), and the corresponding action will most likely do nothing (other than possibly recording input position) so that the scanner will continue to search fir a token to return.
(F)lex scanners are not easy to make into coroutines, so if a coroutine is really required (for example, to incrementally parse a asynchronous input), then another tool might be preferred.
Bison does offer the possiblity to generate a "push parser" in which the scanner calls the parser every time it finds a token, rather than returning to the parser. But neither the "push" nor the traditional "pull" model have anything to do with coroutines, IMHO, and the use of the word to describe parser/scanner interaction strikes me as imprecise and unuseful (although I have a lot of respect for the author you might be quoting.)

How to get such pattern matching of regular expression in lex

Hi I want to check a specific pattern in regular expression but I'm failed to do that. Input should be like
noun wordname:wordmeaning
I'm successful getting noun and wordname but couldn't design a pattern for word meaning. My code is :
int state;
char *meaning;
char *wordd;
^verb { state=VERB; }
^adj { state = ADJ; }
^adv { state = ADV; }
^noun { state = NOUN; }
^prep { state = PREP; }
^pron { state = PRON; }
^conj { state = CONJ; }
//my try but failed
[:\a-z] {
meaning=yytext;
printf(" Meaning is getting detected %s", meaning);
}
[a-zA-Z]+ {
word=yytext;
}
Example input:
noun john:This is a name
Now word should be equal to john and meaning should be equal to This is a name.
Agreeing that lex states (also known as start conditions) are the way to go (odd, but there are no useful tutorials).
Briefly:
your application can be organized as states, using one for "noun", one for "john" and one for the definition (after the colon).
at the top of the lex file, declare the states, e.g.,
%s TYPE NAME VALUE
the capitals are not necessary, but since you are defining constants, that is a good convention.
next to the patterns, put those state names in < > brackets to tell lex that the patterns are used only in those states. You can list more than one state, comma-separated, when it matters. But your lex file probably does not need that.
one state is predefined: INITIAL.
your program switches states using the BEGIN() macro, in actions, e.g.,
{ BEGIN(TYPE); }
if your input is well-formed, it's simple: as each "type" is recognized, it begins the NAME state.
in the NAME state, your lexer looks for whatever you think a name should be, e.g.,
<NAME>[[:alpha:]][[:alnum:]]+ { my_name = strdup(yytext); }
the name ends with a colon, so
<NAME>":" { BEGIN(VALUE); }
the value is then everything until the end of the line, e.g.,
<VALUE>.* { my_value = strdup(yytext); BEGIN(INITIAL); }
whether you switch to INITIAL or TYPE depends on what other things you might add to your lexer (such as ignoring comment lines and whitespace).
Further reading:
Start conditions (flex documentation)
Introduction to Flex

Resources