how to return something when there is not match in flex(lexer) - flex-lexer

I am using flex(the lexer) to do some lexical analysis.
What I need is:
If none of the rules are matched, then a value is returned to indicate such thing has happened.
This is like the default syntax in the switch control flow structure in many programming language.
Is there a way to do such kind of stuff?
EDIT 1:
Reference from the official doc
If no match is found, then the default rule is executed:
the next character in the input is considered matched and copied to the standard output.
But how can I change the default rule?

In acacia-lex it is done in the following way:
Lexer has run method:
#Override
public void run() {
Token token;
while ((token = this.findNext()).isFound()) {
System.out.println("LEXER RES = " + token.toString());
}
}
When nothing is found, there is no default rule. Lexer method run just completed its job.
To continue lexing, at the end of tokens specification is needed token "DOT" -> ".". So if no other tokens match, DOT will match and Lexer run will continue its job.

The default rule only applies if no other rule matches. So you can simply insert your own rule which matches any single character as the last rule:
.|\n { /* Your default action. */ }
It must go at the end because (F)lex will give priority to earlier rules in the file which have the same match. You need to explicitly mention \n (unless you are certain that some other rule will match it) because in (F)lex, . matches any character except a newline.
If you are using Flex, and you don't want the default rule to ever be used, it is advisable to put
%option nodefault
into your prologue. That will suppress the default rule and produce a warning if there is some input which might not match any rule. (If you ignore the warning, a runtime error will be produced for such input.)

Related

Handling new lines in Flex/Bison

I am trying to make a C-like language using Flex/Bison. My problem is that I can't find a proper way to handle new lines. I have to ignore all new lines so I don't returnt them as a token to Bison because that would make the grammar rules so difficult to make but I am asked in some rules to make a mandatory change of line. For example:
Program "identifier" -> mandatory change of line
Function "identifier"("parameters") -> mandatory change of line
If I return \n as a token to flex then i have to put new lines in all of my grammar rules and that's surely not practical. I tried to make a variable work like a switch or something but it didn't quite work.
Any help or suggestion?
If the required newline is simply aesthetic -- that is, if it isn't required in order to avoid an ambiguity -- then the easiest way to enforce it is often just to track token locations (which is something that bison and flex can help you with) so that you can check in your reduction action that two consecutive tokens were not on the same line:
func_defn: "function" IDENT '(' opt_arg_list ')' body "end" {
if (#5.last_line == #6.first_line) {
yyerror("Body of function must start on a new line");
/* YYABORT; */ /* If you want to kill the parse at this point. */
}
// ...
}
Bison doesn't require any declarations or options in order to use locations; it will insert location support if it notices that you use #N in any action (which is how you refer to the location of a token). However, it is sometimes useful to insert a %locations declaration to force location support. Normally no other change is necessary to your grammar.
You do have to insert a little bit of code in your lexer in order to report the location values to the parser. Locations are communicated through a global variable called yylloc, whose value is of type YYLTYPE. By default, YYLTYPE is a struct with four int members: first_line, first_column, last_line, last_column. (See the Bison manual for more details.) These fields need to be set in your lexer for every token. Fortunately, flex allows you to define the macro YY_USER_ACTION, which contains code executed just before every action (even empty actions), which you can use to populate yylloc. Here's one which will work for many simple lexical analysers; you can put it in the code block at the top of your flex file.
/* Simple YY_USER_ACTION. Will not work if any action includes
* yyless(), yymore(), input() or REJECT.
*/
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.last_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
yylloc.last_line = yylineno; \
yylloc.last_column = yytext + yyleng - strrchr(yytext, '\n'); \
}
If the simple location check described above isn't sufficient for your use case, then you can do it through what's called "lexical feedback": a mechanism where the parser not only collects information from the lexical scanner, but also communicates back to the lexer when some kind of lexical change is needed.
Lexical feedback is usually discouraged because it can be fragile. It's always important to remember that the parser and the scanner are not necessarily synchronised. The parser often (but not always) needs to know the next token after the current production, so the lexical state when a production's action is being executed might be the lexical state after the next token, rather than the state after the last token in the production. But it might not; many parser generators, including Bison, try to execute an action immediately if they can figure out that the same action will be executed regardless of the next token. Unfortunately, that's not always predictable. In the case of Bison, for example, changing the parsing algorithm from the default LALR(1) to Canonical LR(1) or to GLR can also change a particular reduction action from immediate to deferred.
So if you're going to try to communicate with the scanner, you should try to do so in a way that will work whether or not the scanner has already been asked for the lookahead token. One way to do this is to put the code which communicates with the scanner in a Mid-Rule Action one token earlier than the token which you want to influence. [Note 1]
In order to make newlines "mostly optional", we need to tell the lexer when it should return a newline instead of ignoring it. One way to do this is to export a function which the lexer can call. We put the definition of that function into the generated parser and its declaration into the generated header file:
/* Anything in code requires and code provides sections is also
* copied into the generated header. So we can use it to declare
* exported functions.
*/
%code requires {
#include <stdbool.h>
bool need_nl(void);
}
%%
// ...
/* See [Note 2], below. */
/* Program directive. */
prog_decl: "program" { need_nl_flag = true; } IDENT '\n'
/* Function definition */
func_defn: "function" IDENT
'(' opt_arg_list { need_nl_flag = true; } ')' '\n'
body
"end"
// ...
%%
static bool need_nl_flag = false;
/* The scanner should call this function when it sees a newline.
* If the function returns true, the newline should be returned as a token.
* The function resets the value of the flag, so it must not be called for any
* other purpose. (This interface allows us to set the flag in a parser action
* without having to worry about clearing it later.)
*/
bool need_nl(void) {
bool temp = need_nl_flag;
need_nl_flag = false;
return temp;
}
// ...
Then we just need a small adjustment to the scanner in order to call that function. This uses Flex's set difference operator {-} to make a character class containing all whitespace other than a newline. Because we put that rule first, the second rule will only be used for whitespace including at least one newline character. Note that we only return one newline token for any sequence of blank lines.
([[:space:]]{-}[\n])+ { /* ignore whitespace */ }
[[:space:]]+ { if (need_nl()) return '\n'; }
Notes
That's not something you can do without thought, either: it might also be an error to change the scanner configuration too soon. In the action, you can check whether or not the lookahead token has already been read by looking at the value of yychar. If yychar is YYEMPTY, then no lookahead token has been read. If it is YYEOF, then an attempt was made to read a lookahead token but the end of input was encountered. Otherwise, the lookahead token has already been read.
It might seem tempting to use two actions, one before the token prior to the one you want to affect, and one just before that token. The first action could execute only if yychar is not YYEMPTY, indicating that the lookahead token has already been read and the scanner is about to read the token you want to change, while the second action will only execute if yychar at that point is YYEMPTY. But it's entirely possible that for a particular parse both of those conditions are true, or that neither is true.
Bison does have one configuration which you can use to make the lookahead decision completely predictable. If you set %define lr.default-reduction accepting, then Bison will always attempt to read a lookahead symbol, and you can be sure that placing the action one token early will work. Unless you are using the parser interactively, there is no real cost for enabling this option. But it won't work with old Bison versions or with other parser generators such as byacc.
For this grammar, we could have put the mid-rule actions just before the '\n' tokens rather than one token earlier (as long as the parser is never converted to a GLR or Canonical-LR parser). That's because in both rules, the MRA will go in between two tokens, and (presumably) there are no other rules which might apply up to the first of these tokens. Under those circumstances Bison can certainly know that the MRA can be reduced without examining the lookahead token to see if it is \n: either the next token is a newline and the reduction was required, or the next token is not a newline, which will be a syntax error. Since Bison does not guarantee to detect syntax errors before reduction actions are run, it can reduce the MRA action before knowing whether the parse will succeed.
There is a pattern called trailing context you cant try : https://people.cs.aau.dk/~marius/sw/flex/Flex-Regular-Expressions.html
"identifier"/[\n]
"function-identifier"/[\n]

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

How to go backward to a certain position in Flex?

For example, my lexer recognizes a function call pattern:
//i.e. hello(...), foo(...), bar(...)
FUNCALL [a-zA-Z0-9]*[-_]*[a-zA-Z0-9]+[-_*][a-zA-Z0-9]*\(.\)
Now that flex recognizes the pattern, but it goes passed the last character in the pattern (i.e. after stored foo(...) inside yytext, the lexer will point to the next character after foo(...))
How can I reset the lexer pointer back to the beginning of the function pattern? i.e. after recognizing foo(..), I want to the lexer to point to the start of foo(..), so I can start tokenizing it.
I need to do this because for each regex pattern, only one token can be returned for each pattern. i.e. after matching foo(...), I can only return either foo or ( or ) with return statement but not all.
Flex has a trailing context pattern match (manual excerpt below) Read and understand the limitations before you use this.
`r/s'
an `r' but only if it is followed by an `s'. The text matched by
`s' is included when determining whether this rule is the longest
match, but is then returned to the input before the action is
executed. So the action only sees the text matched by `r'. This
type of pattern is called "trailing context". (There are some
combinations of `r/s' that flex cannot match correctly. *Note
Limitations::, regarding dangerous trailing context.)
Presumably something like this:
FUNCALL [a-zA-Z0-9]*[-_]*[a-zA-Z0-9]+[-_*][a-zA-Z0-9]*/\(.\)
You may find that it makes more sense to change your parser so you don't need to do this.

Resources