Conditionals for flex - flex-lexer

Is is possible to place conditional statements for rules in flex? I need this in order to match a specific rule only when some condition is true. Something like this:
%option c++
%option noyywrap
%%
/*if (condition) {*/
[0-9] {}
/*}*/
[0-9]{2} {}
. /* eat up any unmatched character */
%%
void yylex(void);
int main()
{
FlexLexer *lexer = new yyFlexLexer();
lexer->yylex();
delete lexer;
}
Or is it possible to modify the final c++ generated code, in order to match only some specific regex rules?
UPDATE:
Using start conditions doesn't seem to help. What I want is, depending on some external variable (like isNthRegexActive), to be able to match or not a specific regex.
For example, if I have 4 regex rules, and the first and 2nd are not active, the program should only check for the other 2, and always check for all of them (don't stop at first match - maybe use REJECT).
Example for 4 rules:
/* Declared at the top */
isActive[0] = false;
isActive[1] = false;
isActive[2] = true;
isActive[3] = true;
%%
[0-9]{4} { printf("1st: %s\n", yytext); REJECT;}
[0-3]{2}[0-3]{2} { printf("2nd: %s\n", yytext); REJECT; }
[1-3]{2}[0-3]{2} { printf("3rd: %s\n", yytext); REJECT; }
[1-2]{2}[1-2]{2} { printf("4th: %s\n", yytext); REJECT; }
.
%%
For the input: 1212 the result should be:
3rd: 1212
4th: 1212

Don't use REJECT unless you absolutely have no alternative. It massively slows down the lexical scan, artificially limits the size of the token buffer, and makes your lexical analysis very hard to reason about.
You might be thinking that a scanner generated by (f)lex tests each regular expression one at a time in order to select the best one for a given match. It does not work that way; that would be much too slow. What it does is, effectively, check all the regular expressions in parallel by using a precompiled deterministic state machine represented as a lookup table.
The state machine does exactly one transition for each input byte, using a simple O(1) lookup into the the transition table. (There are different table compression techniques which trade off table size against the constant in the O(1) lookup, but that doesn't change the basic logic.) What that all means is that you can use as many regular expressions as you wish; the time to do the lexical analysis does not depend on the number or complexity of the regular expressions. (Except for caching effects: if your transition table is really big, you might start running into cache misses during the transition lookups. In such cases, you might prefer a compression algorithm which compresses more.)
In most cases, you can use start conditions to achieve conditional matching, although you might need a lot of start conditions if there are a more than a few interacting conditions. Since the scanner can only have one active start condition, you'll need to generate a different start condition for each legal combination of the conditions you want to consider. That's usually most easily achieved through automatic generation of your scanner rules, but it can certainly be done by hand if there aren't too many.
It's hard to provide a more concrete suggestion without knowing what kinds of conditions you need to check.

Related

Is my lexer doing too much -- is it doing the work of the parser?

My input consists of a series of names, each on a new line. Each name consists of a firstname, optional middle initial, and lastname. The name fields are separated by tabs. Here is a sample input:
Sally M. Smith
Tom V. Jones
John Doe
Below are the rules for my Flex lexer. It works fine but I am concerned that my lexer is doing too much: it is determining that a token is a firstname or a middle initial or a lastname. Should that determination be done in the parser, not the lexer? Am I abusing the Flex state capability? What I am seeking is a critique of my lexer. I am just a beginner, how would a parsing expert create lexer rules for this input?
<INITIAL>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(FIRSTNAME); }
\t { BEGIN MI_STATE; }
. { BEGIN JUNK_STATE; }
}
<MI_STATE>{
[A-Z]\. { yylval.strval = strdup(yytext); return(MI); }
\t { BEGIN LASTNAME_STATE; }
. { BEGIN JUNK_STATE; }
}
<LASTNAME_STATE>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(LASTNAME); }
\n { BEGIN INITIAL; return EOL; }
. { BEGIN JUNK_STATE; }
}
<JUNK_STATE>. { printf("JUNK: %s\n", yytext); }
You can use lexer states as you do in this question. But it's better to use them as a means to conditionally activate rules. For examples, think of handling multi-line comments or here documents or (for us silverbacks) embedded SQL.
In your question, there's no lexical difference between a given name and a family name -- they both are matched by [a-zA-Z]+, as would be middle names, if you were to extend your lexer.
Short answer: yes, lex NAME tokens and let the parser determine whether you have three NAME tokens on a line.
Yes; your lexer is parsing. The main evidence is that it's implementing identical rules in different start states. Two rules have exactly the same pattern.
The purpose of start states in the context of lexing is to modify the lexical grammar in order to shield the parser from certain differences. It works with the parser. For instance, say you had some document language in which $ shifts into math expression mode, which has different tokenizing rules. The lexer still just returns tokens in math mode; it doesn't try to parse the math expressions. It is the parser which will determine that, if the brackets are balanced, then another $ can shift out of math mode.
In your code the rules for returning a last name and first name are identical; you have used to start state to handle phrase structure syntax: the fact that the last name comes later than the first name.
Another bit of telltale evidence that the lexer is parsing is that the lexer itself is handling all of the start condition changes. In our $...$ math mode example, we might have the lexer shift into a start state when it sees the $ symbol. However, if the lexer also recognizes the end of math mode, then that is evidence it is parsing the math mode expression. The end can only be recognized by following the nested phrase structure grammar of math mode expressions. The way you would handle that would be for the lexer to expose a function lex_end_math_mode(). When the parser processes and reduces the entire math mode expression, it calls this function to tell the lexer to switch back to the lexical syntax outside of math mode. The math-mode-terminating dollar sign would likely also appear as a token visible to the parser, though the leading one might not. So that is to say, the parser parses math_mode_expr : expr '$': a math mode expression followed by a required dollar sign to end math mode. The action for that rule would include the call to lex_end_math_mode, so the lexer returns to the tokenization rules outside of math mode for scanning the next token after the closing $.
There is no right or wrong answer because it's all parsing. Every grammar that is divided into tokens and phrase structure rules could be expressed by a unified grammar which includes the rules for the token structure.
Why we often use a design which separates lexical scanning and parsing is that:
Unifying the lexical and phrase structure grammar into one will turn LL(1) into LL(k). A recursive-descent parser then needs to look k symbols ahead to make parsing decisions. For instance if you're parsing C with this holistic approach, you need to treat int as a reserved keyword. That requires four symbols of lookahead: you have to recognize i, n, t, and then if the next symbol indicates that the token has ended, you treat that as the keyword, otherwise an identifier.
Performance: lexical scanning uses efficient techniques tailored to that task, which take advantage of the restriction that the lexical grammar is regular.
Correspondence to spec: if you have a language whose specification is described in terms of a lexical grammar separate from a phrase structure grammar, then if you implement it that way, features of your code are more readily traceable to features of requirement spec. You may be able to write unit tests which separately show that the lexing and parsing obeys the spec.
Schooling: people who went through a CS program that included a course on compiler construction had separate lexing and parsing drilled into their heads, and whenever it comes up in their subsequent career, they just lean on that wisdom. They are never confronted with situations in which they recognize it as not being a good approach, and don't question it.
Whatever works in your individual situations with whatever you're parsing overrules the theory. If it's convenient for you to recognize some phrase-like fragments in the lexer, and you're able to convince yourself that it's a clean approach, then by all means do it that way.

How to prevent Flex from ignoring previous analysis?

I recently started using Lex, as a simple way to explain the problem I encoutered, supposing that I'm trying to realize a lexical analyser with Flex that print all the letters and also all the bigrams in a given text, that seems very easy and simple, but once I implemented it, I've realised that it shows bigrams first and only shows letters when they are single, example: for the following text
QQQZ ,JQR
The result is
Bigram QQ
Bigram QZ
Bigram JQ
Letter R
Done
This is my lex code
%{
%}
letter[A-Za-z]
Separ [ \t\n]
%%
{letter} {
printf(" Letter %c\n",yytext[0]);
}
{letter}{2} {
printf(" Bigram %s\n",yytext);
}
%%
main()
{ yylex();
printf("Done");
}
My question is How can realise the two analysis seperatly, knowing that my actual problem isn't as simple as this example
Lexical analysers divide the source text into separate tokens. If your problem looks like that, then (f)lex is an appropriate tool. If your problem does not look like that, then (f)lex is probably not the correct tool.
Doing two simultaneous analyses of text is not really a use case for (f)lex. One possibility would be to use two separate reentrant lexical analysers, arranging to feed them the same inputs. However, that will be a lot of work for a problem which could easily be solved in a few lines of C.
Since you say that your problem is different from the simple problem in your question, I did not bother to either write the simple C code or the rather more complicated code to generate and run two independent lexical analysers, since it is impossible to know whether either of those solutions is at all relevant.
If your problem really is matching two (or more) different lexemes from the same starting position, you could use one of two strategies, both quite ugly (IMHO):
I'm assuming the existence of handler functions:
void handle_letter(char ch);
void handle_bigram(char* s); /* Expects NUL-terminated string */
void handle_trigram(char* s); /* Expects NUL-terminated string */
For historical reasons, lex implements the REJECT action, which causes the current match to be discarded. The idea was to let you process a match, and then reject it in order to process a shorter (or alternate) match. With flex, the use of REJECT is highly discouraged because it is extremely inefficient and also prevents the lexer from resizing the input buffer, which arbitrarily limits the length of a recognisable token. However, in this particular use case it is quite simple:
[[:alpha:]][[:alpha:]][[:alpha:]] handle_trigram(yytext); REJECT;
[[:alpha:]][[:alpha:]] handle_bigram(yytext); REJECT;
[[:alpha:]] handle_letter(*yytext);
If you want to try this solution, I recommend using flex's debug facility (flex -d ...) in order to see what is going on.
See debugging options and REJECT documentation.
The solution I would actually recommend, although the code is a bit clunkier, is to use yyless() to reprocess part of the recognised token. This is quite a bit more efficient than REJECT; yyless() just changes a single pointer, so it has no impact on speed. Without REJECT, we have to know all the lexeme handlers which will be needed, but that's not very difficult. A complication is the interface for handle_bigram, which requires a NUL-terminated string. If your handler didn't impose this requirement, the code would be simpler.
[[:alpha:]][[:alpha:]][[:alpha:]] { handle_trigram(yytext);
char tmp = yytext[2];
yytext[2] = 0;
handle_bigram(yytext);
yytext[2] = tmp;
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]][[:alpha:]] { handle_bigram(yytext);
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]] handle_letter(*yytext);
See yyless() documentation

How does Flex distinguish between A, AB, and ABC?

I made this experiment for Flex to see if I enter ABC, if it will see all A, AB, ABC or only ABC or only the first match in the list of expressions.
%{
#include <stdio.h>
%}
%%
A puts("got A");
AB puts("got AB");
ABC puts("got ABC");
%%
int main(int argc, char **argv)
{
yylex();
return 0;
}
When I enter ABC after compiling and running the program, it responds with "Got ABC" which really surprises me since I thought lex doesn't keep track of visited text, and only finds the first match; but actually, it seems to find the longest match.
What strategy does Flex use to respond to A if and only if there is no longer match?
The fact that (F)lex uses the maximal-munch principle should hardly be surprising, since it is well documented in the Flex manual:
When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text…. If it finds two or more matches of the same length, the rule listed first in the flex input file is chosen.
(First paragraph of the section "How the input is matched")
The precise algorithm is exceedingly simple: every time a token is requested, flex scans the text, moving through the DFA. Every time it hits an accepting state, it records the current text position. When no more transitions are possible, it returns to the last recorded accept position, and that becomes the end of the token.
The consequence is that (F)lex can scan the same text multiple times, although it only scans once for each token.
A set of lexical rules which require excessive back-tracking will slow down the lexical scan. This is discussed in the Flex manual section Performance Considerations, along with some strategies to avoid the issue. However, except in pathological cases, the overhead from back-tracking is not noticeable.

Flex Lexer REGEX Optimiser

Is there a Flex REGEX optimiser? There is something similar as a perl module:
http://search.cpan.org/~rsavage/Regexp-Assemble-0.37/
but unfortunately it doesn't support lex regex syntax. Pretty much what I want to do is to have a tool which to optimise the regex
TOMA|TOMOV
to
TOM(A|OV)
Thank you all in advance.
There is really no need to do that. Flex compiles the regexes into a single deterministic finite state machine (FSM), and aside from minor cache effects with very large scanner definitions, there is no performance penalty for alternation or repetition operators.
Flex does not minimize the FSM, but that would only reduce the size of the FSM tables, not the speed of lexical analysis (aside from the aforementioned cache effects, if applicable).
Even though the FSM is not minimized, the NFA-to-DFA conversion process will perform the particular transformation you suggest. As a consequence, the follow two rules produce exactly the same lexical analyzer:
TOMA|TOMOV { /* do something */ }
TOM(A|OV) { /* do something */ }
Although there is no real performance penalty, you should attempt to avoid the following, which unnecessarily duplicates the action code:
TOMA { /* do something */ }
TOMOV { /* do the same thing */ }
You might also find the discussion in this question useful.

How do you write a lexer parser where identifiers may begin with keywords?

Suppose you have a language where identifiers might begin with keywords. For example, suppose "case" is a keyword, but "caser" is a valid identifier. Suppose also that the lexer rules can only handle regular expressions. Then it seems that I can't place keyword rules ahead of the identifier rule in the lexer, because this would parse "caser" as "case" followed by "r". I also can't place keyword lexing rules after the identifier rule, since the identifier rule would match the keywords, and the keyword rules would never trigger.
So, instead, I could make a keyword_or_identifier rule in the lexer, and have the parser determine if a keyword_or_identifier is a keyword or an identifier. Is this what is normally done?
I realize that "use a different lexer that has lookahead" is an answer (kind of), but I'm also interested in how this is done in a traditional DFA-based lexer, since my current lexer seems to work that way.
Most lexers, starting with the original lex, match alternatives as follows:
Use the longest match.
If there are two or more alternatives which tie for the longest match, use the first one in the lexer definition.
This allows the following style:
"case" { return CASE; }
[[:alpha:]][[:alnum:]]* { return ID; }
If the input pattern is caser, then the second alternative will be used because it's the longest match. If the input pattern is case r, then the first alternative will be used because both of them match case, and by rule (2) above, the first one wins.
Although this may seem a bit arbitrary, it's consistent with the DFA approach, mostly. First of all, a DFA doesn't stop the first time it reaches an accepting state. If it did, then patterns like [[:alpha:]][[:alnum:]]* would be useless, because they enter an accepting state on the first character (assuming its alphabetic). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they back up until the last accepting state. (See below.)
A given DFA state may be accepting because of two different rules, but that's not a problem either; only the first accepting rule is recorded.
To be fair, this is slightly different from the mathematical model of a DFA, which has a transition for every symbol in every state (although many of them may be transitions to a "sink" state), and which matches a complete input depending on whether or not the automaton is in an accepting state when the last symbol of the input is read. The lexer model is slightly different, but can easily be formalized as well.
The only difficulty in the theoretical model is "back up to the last accepting state". In practice, this is generally done by remembering the state and input position every time an accepting state is reached. This does mean that it may be necessary to rewind the input stream, possibly by an arbitrary amount.
Most languages do not require backup very often, and very few require indefinite backup. Some lexer generators can generate faster code if there are no backup states. (flex will do this if you use -Cf or -CF.)
One common case which leads to indefinite backup is failing to provide an appropriate error return for string literals:
["][^"\n]*["] { return STRING; }
/* ... */
. { return INVALID; }
Here, the first pattern will match a string literal starting with " if there is a matching " on the same line. (I omitted \-escapes for simplicity.) If the string literal is unterminated, the last pattern will match, but the input will need to be rewound to the ". In most cases, it's pointless trying to continue lexical analysis by ignoring an unmatched "; it would make more sense to just ignore the entire remainder of the line. So not only is backing up inefficient, it also is likely to lead to an explosion of false error messages. A better solution might be:
["][^"\n]*["] { return STRING; }
["][^"\n]* { return INVALID_STRING; }
Here, the second alternative can only succeed if the string is unterminated, because if the string is terminated, the first alternative will match one more character. Consequently, it doesn't even matter in which order these alternatives appear, although everyone I know would put them in the same order I did.

Resources