Flex Lexer REGEX Optimiser - flex-lexer

Is there a Flex REGEX optimiser? There is something similar as a perl module:
http://search.cpan.org/~rsavage/Regexp-Assemble-0.37/
but unfortunately it doesn't support lex regex syntax. Pretty much what I want to do is to have a tool which to optimise the regex
TOMA|TOMOV
to
TOM(A|OV)
Thank you all in advance.

There is really no need to do that. Flex compiles the regexes into a single deterministic finite state machine (FSM), and aside from minor cache effects with very large scanner definitions, there is no performance penalty for alternation or repetition operators.
Flex does not minimize the FSM, but that would only reduce the size of the FSM tables, not the speed of lexical analysis (aside from the aforementioned cache effects, if applicable).
Even though the FSM is not minimized, the NFA-to-DFA conversion process will perform the particular transformation you suggest. As a consequence, the follow two rules produce exactly the same lexical analyzer:
TOMA|TOMOV { /* do something */ }
TOM(A|OV) { /* do something */ }
Although there is no real performance penalty, you should attempt to avoid the following, which unnecessarily duplicates the action code:
TOMA { /* do something */ }
TOMOV { /* do the same thing */ }
You might also find the discussion in this question useful.

Related

Conditionals for flex

Is is possible to place conditional statements for rules in flex? I need this in order to match a specific rule only when some condition is true. Something like this:
%option c++
%option noyywrap
%%
/*if (condition) {*/
[0-9] {}
/*}*/
[0-9]{2} {}
. /* eat up any unmatched character */
%%
void yylex(void);
int main()
{
FlexLexer *lexer = new yyFlexLexer();
lexer->yylex();
delete lexer;
}
Or is it possible to modify the final c++ generated code, in order to match only some specific regex rules?
UPDATE:
Using start conditions doesn't seem to help. What I want is, depending on some external variable (like isNthRegexActive), to be able to match or not a specific regex.
For example, if I have 4 regex rules, and the first and 2nd are not active, the program should only check for the other 2, and always check for all of them (don't stop at first match - maybe use REJECT).
Example for 4 rules:
/* Declared at the top */
isActive[0] = false;
isActive[1] = false;
isActive[2] = true;
isActive[3] = true;
%%
[0-9]{4} { printf("1st: %s\n", yytext); REJECT;}
[0-3]{2}[0-3]{2} { printf("2nd: %s\n", yytext); REJECT; }
[1-3]{2}[0-3]{2} { printf("3rd: %s\n", yytext); REJECT; }
[1-2]{2}[1-2]{2} { printf("4th: %s\n", yytext); REJECT; }
.
%%
For the input: 1212 the result should be:
3rd: 1212
4th: 1212
Don't use REJECT unless you absolutely have no alternative. It massively slows down the lexical scan, artificially limits the size of the token buffer, and makes your lexical analysis very hard to reason about.
You might be thinking that a scanner generated by (f)lex tests each regular expression one at a time in order to select the best one for a given match. It does not work that way; that would be much too slow. What it does is, effectively, check all the regular expressions in parallel by using a precompiled deterministic state machine represented as a lookup table.
The state machine does exactly one transition for each input byte, using a simple O(1) lookup into the the transition table. (There are different table compression techniques which trade off table size against the constant in the O(1) lookup, but that doesn't change the basic logic.) What that all means is that you can use as many regular expressions as you wish; the time to do the lexical analysis does not depend on the number or complexity of the regular expressions. (Except for caching effects: if your transition table is really big, you might start running into cache misses during the transition lookups. In such cases, you might prefer a compression algorithm which compresses more.)
In most cases, you can use start conditions to achieve conditional matching, although you might need a lot of start conditions if there are a more than a few interacting conditions. Since the scanner can only have one active start condition, you'll need to generate a different start condition for each legal combination of the conditions you want to consider. That's usually most easily achieved through automatic generation of your scanner rules, but it can certainly be done by hand if there aren't too many.
It's hard to provide a more concrete suggestion without knowing what kinds of conditions you need to check.

How to prevent Flex from ignoring previous analysis?

I recently started using Lex, as a simple way to explain the problem I encoutered, supposing that I'm trying to realize a lexical analyser with Flex that print all the letters and also all the bigrams in a given text, that seems very easy and simple, but once I implemented it, I've realised that it shows bigrams first and only shows letters when they are single, example: for the following text
QQQZ ,JQR
The result is
Bigram QQ
Bigram QZ
Bigram JQ
Letter R
Done
This is my lex code
%{
%}
letter[A-Za-z]
Separ [ \t\n]
%%
{letter} {
printf(" Letter %c\n",yytext[0]);
}
{letter}{2} {
printf(" Bigram %s\n",yytext);
}
%%
main()
{ yylex();
printf("Done");
}
My question is How can realise the two analysis seperatly, knowing that my actual problem isn't as simple as this example
Lexical analysers divide the source text into separate tokens. If your problem looks like that, then (f)lex is an appropriate tool. If your problem does not look like that, then (f)lex is probably not the correct tool.
Doing two simultaneous analyses of text is not really a use case for (f)lex. One possibility would be to use two separate reentrant lexical analysers, arranging to feed them the same inputs. However, that will be a lot of work for a problem which could easily be solved in a few lines of C.
Since you say that your problem is different from the simple problem in your question, I did not bother to either write the simple C code or the rather more complicated code to generate and run two independent lexical analysers, since it is impossible to know whether either of those solutions is at all relevant.
If your problem really is matching two (or more) different lexemes from the same starting position, you could use one of two strategies, both quite ugly (IMHO):
I'm assuming the existence of handler functions:
void handle_letter(char ch);
void handle_bigram(char* s); /* Expects NUL-terminated string */
void handle_trigram(char* s); /* Expects NUL-terminated string */
For historical reasons, lex implements the REJECT action, which causes the current match to be discarded. The idea was to let you process a match, and then reject it in order to process a shorter (or alternate) match. With flex, the use of REJECT is highly discouraged because it is extremely inefficient and also prevents the lexer from resizing the input buffer, which arbitrarily limits the length of a recognisable token. However, in this particular use case it is quite simple:
[[:alpha:]][[:alpha:]][[:alpha:]] handle_trigram(yytext); REJECT;
[[:alpha:]][[:alpha:]] handle_bigram(yytext); REJECT;
[[:alpha:]] handle_letter(*yytext);
If you want to try this solution, I recommend using flex's debug facility (flex -d ...) in order to see what is going on.
See debugging options and REJECT documentation.
The solution I would actually recommend, although the code is a bit clunkier, is to use yyless() to reprocess part of the recognised token. This is quite a bit more efficient than REJECT; yyless() just changes a single pointer, so it has no impact on speed. Without REJECT, we have to know all the lexeme handlers which will be needed, but that's not very difficult. A complication is the interface for handle_bigram, which requires a NUL-terminated string. If your handler didn't impose this requirement, the code would be simpler.
[[:alpha:]][[:alpha:]][[:alpha:]] { handle_trigram(yytext);
char tmp = yytext[2];
yytext[2] = 0;
handle_bigram(yytext);
yytext[2] = tmp;
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]][[:alpha:]] { handle_bigram(yytext);
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]] handle_letter(*yytext);
See yyless() documentation

Which exactly part of parsing should be done by the lexical analyser?

Does there exist a formal definition of the purpose, or at a clear best practice of usage, of lexical analysis (lexer) during/before parsing?
I know that the purpose of a lexer is to transform a stream of characters to a stream of tokens, but can't it happen that in some (context-free) languages the intended notion of a "token" could nonetheless depend on the context and "tokens" could be hard to identify without complete parsing?
There seems to be nothing obviously wrong with having a lexer that transforms every input character into a token and lets the parser do the rest. But would it be acceptable to have a lexer that differentiates, for example, between a "unary minus" and a usual binary minus, instead of leaving this to the parser?
Are there any precise rules to follow when deciding what shall be done by the lexer and what shall be left to the parser?
Does there exist a formal definition of the purpose [of a lexical analyzer]?
No. Lexical analyzers are part of the world of practical programming, for which formal models are useful but not definitive. A program which purports to do something should do that thing, of course, but "lexically analyze my programming language" is not a sufficiently precise requirements statement.
… or a clear best practice of usage
As above, the lexical analyzer should do what it purports to do. It should also not attempt to do anything else. Code duplication should be avoided. Ideally, the code should be verifiable.
These best practices motivate the use of a mature and well-document scanner framework whose input language doubles as a description of the lexical grammar being analyzed. However, practical considerations based on the idiosyncracies of particular programming languages normally result in deviations from this ideal.
There seems to be nothing obviously wrong with having a lexer that transforms every input character into a token…
In that case, the lexical analyzer would be redundant; the parser could simply use the input stream as is. This is called "scannerless parsing", and it has its advocates. I'm not one of them, so I won't enter into a discussion of pros and cons. If you're interested, you could start with the Wikipedia article and follow its links. If this style fits your problem domain, go for it.
can't it happen that in some (context-free) languages the intended notion of a "token" could nonetheless depend on the context?
Sure. A classic example is found in EcmaScript regular expression "literals", which need to be lexically analyzed with a completely different scanner. EcmaScript 6 also defines string template literals, which require a separate scanning environment. This could motivate scannerless processing, but it can also be implemented with an LR(1) parser with lexical feedback, in which the reduce action of particular marker non-terminals causes a switch to a different scanner.
But would it be acceptable to have a lexer that differentiates, for example, between a "unary minus" and a usual binary minus, instead of leaving this to the parser?
Anything is acceptable if it works, but that particular example strikes me as not particular useful. LR (and even LL) expression parsers do not require any aid from the lexical scanner to show the context of a minus sign. (Naïve operator precedence grammars do require such assistance, but a more carefully thought out op-prec architecture wouldn't. However, the existence of LALR parser generators more or less obviates the need for op-prec parsers.)
Generally speaking, for the lexer to be able to identify syntactic context, it needs to duplicate the analysis being done by the parser, thus violating one of the basic best practices of code development ("don't duplicate functionality"). Nonetheless, it can occasionally be useful, so I wouldn't go so far as to advocate an absolute ban. For example, many parsers for yacc/bison-like production rules compensate for the fact that a naïve grammar is LALR(2) by specially marking ID tokens which are immediately followed by a colon.
Another example, again drawn from EcmaScript, is efficient handling of automatic semicolon insertion (ASI), which can be done using a lookup table whose keys are 2-tuples of consecutive tokens. Similarly, Python's whitespace-aware syntax is conveniently handled by assistance from the lexical scanner, which must be able to understand when indentation is relevant (not inside parentheses or braces, for example).

Good practice to parse data in a custom format

I'm writing a program that takes in input a straight play in a custom format and then performs some analysis on it (like number of lines and words for each character). It's just for fun, and a pretext for learning cool stuff.
The first step in that process is writing a parser for that format. It goes :
####Play
###Act I
##Scene 1
CHARACTER 1. Line 1, he's saying some stuff.
#Comment, stage direction
CHARACTER 2, doing some stuff. Line 2, she's saying some stuff too.
It's quite a simple format. I read extensively about basic parser stuff like CFG, so I am now ready to get some work done.
I have written my grammar in EBNF and started playing with flex/bison but it raises some questions :
Is flex/bison too much for such a simple parser ? Should I just write it myself as described here : Is there an alternative for flex/bison that is usable on 8-bit embedded systems? ?
What is good practice regarding the respective tasks of the tokenizer and the parser itself ? There is never a single solution, and for such a simple language they often overlap. This is especially true for flex/bison, where flex can perform some intense stuff with regex matching. For example, should "#" be a token ? Should "####" be a token too ? Should I create types that carry semantic information so I can directly identify for example a character ? Or should I just process it with flex the simplest way then let the grammar defined in bison decide what is what ?
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool ?
This got me really confused. I am looking for an elegant, perhaps simple solution. Any guideline ?
By the way, about the programing language, I don't care much. For now I am using C because of flex/bison but feel free to advise me on anything more practical as long as it is a widely used language.
It's very difficult to answer those questions without knowing what your parsing expectations are. That is, an example of a few lines of text does not provide a clear understanding of what the intended parse is; what the lexical and syntactic units are; what relationships you would like to extract; and so on.
However, a rough guess might be that you intend to produce a nested parse, where ##{i} indicates the nesting level (inversely), with i≥1, since a single # is not structural. That violates one principle of language design ("don't make the user count things which the computer could count more accurately"), which might suggest a structure more like:
#play {
#act {
#scene {
#location: Elsinore. A platform before the castle.
#direction: FRANCISCO at his post. Enter to him BERNARDO
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
FRANCISCO: Bernardo?
or even something XML-like. But that would be a different language :)
The problem with parsing either of these with a classic scanner/parser combination is that the lexical structure is inconsistent; the first token on a line is special, but most of the file consists of unparsed text. That will almost inevitably lead to spreading syntactic information between the scanner and the parser, because the scanner needs to know the syntactic context in order to decide whether or not it is scanning raw text.
You might be able to avoid that issue. For example, you might require that a continuation line start with whitespace, so that every line not otherwise marked with #'s starts with the name of a character. That would be more reliable than recognizing a dialogue line just because it starts with the name of a character and a period, since it is quite possible for a character's name to be used in dialogue, even at the end of a sentence (which consequently might be the first word in a continuation line.)
If you do intend for dialogue lines to be distinguished by the fact that they start with a character name and some punctuation then you will definitely have to give the scanner access to the character list (as a sort of symbol table), which is a well-known but not particularly respected hack.
Consider the above a reflection about your second question ("What are the roles of the scanner and the parser?"), which does not qualify as an answer but hopefully is at least food for thought. As to your other questions, and recognizing that all of this is opinionated:
Is flex/bison too much for such a simple parser ? Should I just write it myself...
The fact that flex and bison are (potentially) more powerful than necessary to parse a particular language is a red herring. C is more powerful than necessary to write a factorial function -- you could easily do it in assembler -- but writing a factorial function is a good exercise in learning C. Similarly, if you want to learn how to write parsers, it's a good idea to start with a simple language; obviously, that's not going to exercise every option in the parser/scanner generators, but it will get you started. The question really is whether the language you're designing is appropriate for this style of parsing, not whether it is too simple.
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool?
Either can be elegant, or disastrous; elegance has more to do with how you structure your thinking about the problem at hand. Having said that, it is often better to build a semantic structure (commonly referred to as an AST -- abstract syntax tree) during the parse phase and then analyse that structure using other functions.
Rescanning the input file is very unlikely to be either elegant or effective.

Refactoring of decision trees using automatic learning

The problem is the following:
I developed an expression evaluation engine that provides a XPath-like language to the user so he can build the expressions. These expressions are then parsed and stored as an expression tree. There are many kinds of expressions, including logic (and/or/not), relational (=, !=, >, <, >=, <=), arithmetic (+, -, *, /) and if/then/else expressions.
Besides these operations, the expression can have constants (numbers, strings, dates, etc) and also access to external information by using a syntax similar to XPath to navigate in a tree of Java objects.
Given the above, we can build expressions like:
/some/value and /some/other/value
/some/value or /some/other/value
if (<evaluate some expression>) then
<evaluate some other expression>
else
<do something else>
Since the then-part and the else-part of the if-then-else expressions are expressions themselves, and everything is considered to be an expression, then anything can appear there, including other if-then-else's, allowing the user to build large decision trees by nesting if-then-else's.
As these expressions are built manually and prone to human error, I decided to build an automatic learning process capable of optimizing these expression trees based on the analysis of common external data. For example: in the first expression above (/some/value and /some/other/value), if the result of /some/other/value is false most of the times, we can rearrange the tree so this branch will be the left branch to take advantage of short-circuit evaluation (the right side of the AND is not evaluated since the left side already determined the result).
Another possible optimization is to rearrange nested if-then-else expressions (decision trees) so the most frequent path taken, based on the most common external data used, will be executed sooner in the future, avoiding unnecessary evaluation of some branches most of the times.
Do you have any ideas on what would be the best or recommended approach/algorithm to use to perform this automatic refactoring of these expression trees?
I think what you are describing is compiler optimizations which is a huge subject with everything from
inline expansion
deadcode elimination
constant propagation
loop transformation
Basically you have a lot of rewrite rules that are guaranteed to preserve the functionality of the code/xpath.
In the question on rearranging of the nested if-else I don't think you need to resort to machine-learning.
One (I think optimal) approach would be to use Huffman coding of your links huffman_coding
Take each path as a letter and we then encode them with Huffman coding and get a so called Huffman tree. This tree will have the least evaluations running on a (large enough) sample with the same distribution you made the Huffman tree from.
If you have restrictions on ``evaluate some expression''-expresssion or that they have different computational cost etc. You probably need another approach.
And remember, as always when it comes to optimization you should be careful and only do things that really matter.

Resources