Handling new lines in Flex/Bison - token

I am trying to make a C-like language using Flex/Bison. My problem is that I can't find a proper way to handle new lines. I have to ignore all new lines so I don't returnt them as a token to Bison because that would make the grammar rules so difficult to make but I am asked in some rules to make a mandatory change of line. For example:
Program "identifier" -> mandatory change of line
Function "identifier"("parameters") -> mandatory change of line
If I return \n as a token to flex then i have to put new lines in all of my grammar rules and that's surely not practical. I tried to make a variable work like a switch or something but it didn't quite work.
Any help or suggestion?

If the required newline is simply aesthetic -- that is, if it isn't required in order to avoid an ambiguity -- then the easiest way to enforce it is often just to track token locations (which is something that bison and flex can help you with) so that you can check in your reduction action that two consecutive tokens were not on the same line:
func_defn: "function" IDENT '(' opt_arg_list ')' body "end" {
if (#5.last_line == #6.first_line) {
yyerror("Body of function must start on a new line");
/* YYABORT; */ /* If you want to kill the parse at this point. */
}
// ...
}
Bison doesn't require any declarations or options in order to use locations; it will insert location support if it notices that you use #N in any action (which is how you refer to the location of a token). However, it is sometimes useful to insert a %locations declaration to force location support. Normally no other change is necessary to your grammar.
You do have to insert a little bit of code in your lexer in order to report the location values to the parser. Locations are communicated through a global variable called yylloc, whose value is of type YYLTYPE. By default, YYLTYPE is a struct with four int members: first_line, first_column, last_line, last_column. (See the Bison manual for more details.) These fields need to be set in your lexer for every token. Fortunately, flex allows you to define the macro YY_USER_ACTION, which contains code executed just before every action (even empty actions), which you can use to populate yylloc. Here's one which will work for many simple lexical analysers; you can put it in the code block at the top of your flex file.
/* Simple YY_USER_ACTION. Will not work if any action includes
* yyless(), yymore(), input() or REJECT.
*/
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.last_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
yylloc.last_line = yylineno; \
yylloc.last_column = yytext + yyleng - strrchr(yytext, '\n'); \
}
If the simple location check described above isn't sufficient for your use case, then you can do it through what's called "lexical feedback": a mechanism where the parser not only collects information from the lexical scanner, but also communicates back to the lexer when some kind of lexical change is needed.
Lexical feedback is usually discouraged because it can be fragile. It's always important to remember that the parser and the scanner are not necessarily synchronised. The parser often (but not always) needs to know the next token after the current production, so the lexical state when a production's action is being executed might be the lexical state after the next token, rather than the state after the last token in the production. But it might not; many parser generators, including Bison, try to execute an action immediately if they can figure out that the same action will be executed regardless of the next token. Unfortunately, that's not always predictable. In the case of Bison, for example, changing the parsing algorithm from the default LALR(1) to Canonical LR(1) or to GLR can also change a particular reduction action from immediate to deferred.
So if you're going to try to communicate with the scanner, you should try to do so in a way that will work whether or not the scanner has already been asked for the lookahead token. One way to do this is to put the code which communicates with the scanner in a Mid-Rule Action one token earlier than the token which you want to influence. [Note 1]
In order to make newlines "mostly optional", we need to tell the lexer when it should return a newline instead of ignoring it. One way to do this is to export a function which the lexer can call. We put the definition of that function into the generated parser and its declaration into the generated header file:
/* Anything in code requires and code provides sections is also
* copied into the generated header. So we can use it to declare
* exported functions.
*/
%code requires {
#include <stdbool.h>
bool need_nl(void);
}
%%
// ...
/* See [Note 2], below. */
/* Program directive. */
prog_decl: "program" { need_nl_flag = true; } IDENT '\n'
/* Function definition */
func_defn: "function" IDENT
'(' opt_arg_list { need_nl_flag = true; } ')' '\n'
body
"end"
// ...
%%
static bool need_nl_flag = false;
/* The scanner should call this function when it sees a newline.
* If the function returns true, the newline should be returned as a token.
* The function resets the value of the flag, so it must not be called for any
* other purpose. (This interface allows us to set the flag in a parser action
* without having to worry about clearing it later.)
*/
bool need_nl(void) {
bool temp = need_nl_flag;
need_nl_flag = false;
return temp;
}
// ...
Then we just need a small adjustment to the scanner in order to call that function. This uses Flex's set difference operator {-} to make a character class containing all whitespace other than a newline. Because we put that rule first, the second rule will only be used for whitespace including at least one newline character. Note that we only return one newline token for any sequence of blank lines.
([[:space:]]{-}[\n])+ { /* ignore whitespace */ }
[[:space:]]+ { if (need_nl()) return '\n'; }
Notes
That's not something you can do without thought, either: it might also be an error to change the scanner configuration too soon. In the action, you can check whether or not the lookahead token has already been read by looking at the value of yychar. If yychar is YYEMPTY, then no lookahead token has been read. If it is YYEOF, then an attempt was made to read a lookahead token but the end of input was encountered. Otherwise, the lookahead token has already been read.
It might seem tempting to use two actions, one before the token prior to the one you want to affect, and one just before that token. The first action could execute only if yychar is not YYEMPTY, indicating that the lookahead token has already been read and the scanner is about to read the token you want to change, while the second action will only execute if yychar at that point is YYEMPTY. But it's entirely possible that for a particular parse both of those conditions are true, or that neither is true.
Bison does have one configuration which you can use to make the lookahead decision completely predictable. If you set %define lr.default-reduction accepting, then Bison will always attempt to read a lookahead symbol, and you can be sure that placing the action one token early will work. Unless you are using the parser interactively, there is no real cost for enabling this option. But it won't work with old Bison versions or with other parser generators such as byacc.
For this grammar, we could have put the mid-rule actions just before the '\n' tokens rather than one token earlier (as long as the parser is never converted to a GLR or Canonical-LR parser). That's because in both rules, the MRA will go in between two tokens, and (presumably) there are no other rules which might apply up to the first of these tokens. Under those circumstances Bison can certainly know that the MRA can be reduced without examining the lookahead token to see if it is \n: either the next token is a newline and the reduction was required, or the next token is not a newline, which will be a syntax error. Since Bison does not guarantee to detect syntax errors before reduction actions are run, it can reduce the MRA action before knowing whether the parse will succeed.

There is a pattern called trailing context you cant try : https://people.cs.aau.dk/~marius/sw/flex/Flex-Regular-Expressions.html
"identifier"/[\n]
"function-identifier"/[\n]

Related

Grammar conflict with same prefix

Here's my grammar to the for statements:
FOR x>0 {
//somthing
}
// or
FOR x = 0; x > 0; x++ {
//somthing
}
it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression,
however the codes right after FOR will become useless because of confliction.
ForStmt
: FOR {
printf("for_begin_%d:\n", n);
} Expression {
printf("ifeq for_exit_%d\n", n);
} ForBlock
| FOR ForClause ForBlock
;
ForClause
: InitExpression ';' {
printf("for_begin_%d:\n", n);
} Expression ';' Expression { printf("ifeq for_exit_%d\n", n); }
;
I had tried to change it to something like:
ForStart
: FOR
| FOR InitExpression
;
or use a flag to mention where to print the for_begin label,
but also fail to resolve the conflict.
How to make it not conflict?
How can the parser know which alternative of the FOR statement it sees?
While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.
But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.
Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.
But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.
Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.
So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.
As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.
That would make the grammar look something like this:
ForLabel
: %empty { $$ = n; printf("for_begin_%d:\n", n++); }
ForStmt
: FOR
ForLabel[label]
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
| FOR
ForLabel
InitExpress ';'
ForLabel[label]
Expression ';'
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
;
([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser.
For the sake of brevity, I have reduced this language to a minimum so that my questions are still open.
The language has implicit and explicit strings, arrays and objects. An implicit string is just a sequence of characters that does not contain <, {, [ or ]. An explicit string looks like <salt<text>salt> where salt is an arbitrary identifier (i.e. [a-zA-Z][a-zA-Z0-9]*) and text is an arbitrary sequence of characters that does not contain the salt.
An array starts with [, followed by objects and/or strings and ends with ].
All characters within an array that don't belong to an array, object or explicit string do belong to an implicit string and the length of each implicit string is maximal and greater than 0.
An object starts with { and ends with } and consists of properties. A property starts with an identifier, followed by a colon, then optional whitespaces and then either an explicit string, array or object.
So the string [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ] represents an array with 6 items: " name:test ", "<html>[]</html>", " ", { name: "test" }, "bla" and " " (the object is notated in json).
As one can see, this language is not context free due to the explicit string (that I don't want to miss). However, the syntax tree is nonambiguous.
So my question is: Is a property a token that may be returned by a tokenizer? Or should the tokenizer return T_identifier, T_colon when he reads an object property?
The real language allows even prefixes in the identifier of a property, e.g. ns/name:<a<test>a> where ns is the prefix for a namespace.
Should the tokenizer return T_property_prefix("ns"), T_property_prefix_separator, T_property_name("name"), T_property_colon or just T_property("ns/name") or even T_identifier("ns"), T_slash, T_identifier("name"), T_colon?
If the tokenizer should recognize properties (which would be useful for syntax highlighters), he must have a stack, because name: is not a property if it is in an array. To decide whether bla: in [{foo:{bar:[test:baz]} bla:{}}] is a property or just an implicit string, the tokenizer must track when he enters and leave an object or array.
Thus, the tokenizer would not be a finite state machine any more.
Or does it make sense to have two tokenizers - the first, which separates whitespaces from alpha-numerical character sequences and special characters like : or [, the second, which uses the first to build more semantical tokens? The parser could then operate on top of the second tokenizer.
Anyways, the tokenizer must have an infinite lookahead to see when an explicit string ends. Or should the detection of the end of an explicit string happen inside the parser?
Or should I use a parser generator for my undertaking? Since my language is not context free, I don't think there is an appropriate parser generator.
Thanks in advance for your answers!
flex can be requested to provide a context stack, and many flex scanners use this feature. So, while it may not fit with a purist view of how a scanner scans, it is a perfectly acceptable and supported feature. See this chapter of the flex manual for details on how to have different lexical contexts (called "start conditions"); at the very end is a brief description of the context stack. (Don't miss the sentence which notes that you need %option stack to enable the stack.) [See Note 1]
Slightly trickier is the requirement to match strings with variable end markers. flex does not have any variable match feature, but it does allow you to read one character at a time from the scanner input, using the function input(). That's sufficient for your language (at least as described).
Here's a rough outline of a possible scanner:
%option stack
%x SC_OBJECT
%%
/* initial/array context only */
[^][{<]+ yylval = strdup(yytext); return STRING;
/* object context only */
<SC_OBJECT>{
[}] yy_pop_state(); return '}';
[[:alpha:]][[:alnum:]]* yylval = strdup(yytext); return ID;
[:/] return yytext[0];
[[:space:]]+ /* Ignore whitespace */
}
/* either context */
<*>{
[][] return yytext[0]; /* char class with [] */
[{] yy_push_state(SC_OBJECT); return '{';
"<"[[:alpha:]][[:alnum:]]*"<" {
/* We need to save a copy of the salt because yytext could
* be invalidated by input().
*/
char* salt = strdup(yytext);
char* saltend = salt + yyleng;
char* match = salt;
/* The string accumulator code is *not* intended
* to be a model for how to write string accumulators.
*/
yylval = NULL;
size_t length = 0;
/* change salt to what we're looking for */
*salt = *(saltend - 1) = '>';
while (match != saltend) {
int ch = input();
if (ch == EOF) {
yyerror("Unexpected EOF");
/* free the temps and do something */
}
if (ch == *match) ++match;
else if (ch == '>') match = salt + 1;
else match = salt;
/* Don't do this in real code */
yylval = realloc(yylval, ++length);
yylval[length - 1] = ch;
}
/* Get rid of the terminator */
yylval[length - yyleng] = 0;
free(salt);
return STRING;
}
. yyerror("Invalid character in object");
}
I didn't test that thoroughly, but here is what it looks like with your example input:
[ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ]
Token: [
Token: STRING: -- name:test --
Token: STRING: --<html>[]</html>--
Token: STRING: -- --
Token: {
Token: ID: --name--
Token: :
Token: STRING: --test--
Token: }
Token: STRING: --bla--
Token: STRING: -- --
Token: ]
Notes
In your case, unless you wanted to avoid having a parser, you don't actually need a stack since the only thing that needs to be pushed onto the stack is an object context, and a stack with only one possible value can be replaced with a counter.
Consequently, you could just remove the %option stack and define a counter at the top of the scan. Instead of pushing the start condition, you increment the counter and set the start condition; instead of popping, you decrement the counter and reset the start condition if it drops to 0.
%%
/* Indented text before the first rule is inserted at the top of yylex */
int object_count = 0;
<*>[{] ++object_count; BEGIN(SC_OBJECT); return '{';
<SC_OBJECT[}] if (!--object_count) BEGIN(INITIAL); return '}'
Reading the input one character at a time is not the most efficient. Since in your case, a string terminate must start with >, it would probably be better to define a separate "explicit string" context, in which you recognized [^>]+ and [>]. The second of these would do the character-at-a-time match, as with the above code, but would terminate instead of looping if it found a non-matching character other than >. However, the simple code presented may turn out to be fast enough, and anyway it was just intended to be good enough to do a test run.
I think the traditional way to parse your language would be to have the tokenizer return T_identifier("ns"), T_slash, T_identifier("name"), T_colon for ns/name:
Anyway, I can see three reasonable ways you could implement support for your language:
Use lex/flex and yacc/bison. The tokenizers generated by lex/flex do not have stack so you should be using T_identifier and not T_context_specific_type. I didn't try the approach so I can't give a definite comment on whether your language could be parsed by lex/flex and yacc/bison. So, my comment is try it to see if it works. You may find information about the lexer hack useful: http://en.wikipedia.org/wiki/The_lexer_hack
Implement a hand-built recursive descent parser. Note that this can be easily built without separate lexer/parser stages. So, if the lexemes depend on context it is easy to handle when using this approach.
Implement your own parser generator which turns lexemes on and off based on the context of the parser. So, the lexer and the parser would be integrated together using this approach.
I once worked for a major network security vendor where deep packet inspection was performed by using approach (3), i.e. we had a custom parser generator. The reason for this is that approach (1) doesn't work for two reasons: firstly, data can't be fed to lex/flex and yacc/bison incrementally, and secondly, HTTP can't be parsed by using lex/flex and yacc/bison because the meaning of the string "HTTP" depends on its location, i.e. it could be a header value or the protocol specifier. The approach (2) didn't work because data can't be fed incrementally to recursive descent parsers.
I should add that if you want to have meaningful error messages, a recursive descent parser approach is heavily recommended. My understanding is that the current version of gcc uses a hand-built recursive descent parser.

Jison Lexer - Detect Certain Keyword as an Identifier at Certain Times

"end" { return 'END'; }
...
0[xX][0-9a-fA-F]+ { return 'NUMBER'; }
[A-Za-z_$][A-Za-z0-9_$]* { return 'IDENT'; }
...
Call
: IDENT ArgumentList
{{ $$ = ['CallExpr', $1, $2]; }}
| IDENT
{{ $$ = ['CallExprNoArgs', $1]; }}
;
CallArray
: CallElement
{{ $$ = ['CallArray', $1]; }}
;
CallElement
: CallElement "." Call
{{ $$ = ['CallElement', $1, $3]; }}
| Call
;
Hello! So, in my grammar I want "res.end();" to not detect end as a keyword, but as an ident. I've been thinking for a while about this one but couldn't solve it. Does anyone have any ideas? Thank you!
edit: It's a C-like programming language.
There's not quite enough information in the question to justify the assumptions I'm making here, so this answer may be inexact.
Let's suppose we have a somewhat Lua-like language in which a.b is syntactic sugar for a["b"]. Furthermore, since the . must be followed by a lexical identifier -- in other words, it is never followed by a syntactic keyword -- we'd like to inhibit keyword recognition in this context.
That's a pretty simple rule. It's simple enough that the lexer could implement it without any semantic information at all; all that it says is that the token which follows a . must be an identifier. In this context, keywords should be treated as identifiers, and anything else other than an identifier is an error.
We can do this with start conditions. Specifically, we define a start condition which is only used after a . token:
%x selector
%%
/* White space and comment rules need to explicitly include
* the selector condition
*/
<INITIAL,selector>\s+ ;
/* Other rules, including keywords, are unmodified */
"end" return "END";
/* The dot rule triggers a new start condition */
"." this.begin("selector"); return ".";
/* Outside of the start condition, identifiers don't change state. */
[A-Za-z_]\w* yylval = yytext; return "ID";
/* Only identifiers are valid in this start condition, and if found
* the start condition is changed back. Anything else is an error.
*/
<selector>[A-Za-z_]\w* yylval = yytext; this.popState(); return "ID";
<selector>. parse_error("Expecting identifier");
Modify your parser, so it always knows what it is expecting to read next (that will be some set of tokens, you can compute this using the notion of First(x) for x being any nonterminal).
When lexing, have the lexer ask the parser what set of tokens it expects next.
Your keywork reconizer for 'end' asks the parser, and it either ways "expecting 'end'" at which pointer the lexer simply hands on the 'end' lexeme, or it says "expecting ID" at which point it hands the parser an ID with name text "end".
This may or may not be convenient to get your parser to do. But you need something like this.
We use a GLR parser; our parser accepts multiple tokens in the same place. Our solution is to generate both the 'end' keyword and and the identifier with text "end" and shove them both into the GLR parser. It can handle local ambiguity; the multiple parses caused by this proceed until the parser with the wrong assumption encounters a syntax error, and then it just vanishes, by fiat. The last standing parser is the one with the right set of assumptions. This scheme is somewhat like the first one, just that we hand the parser the choices and it decides rather than making the lexer decide.
You might be able to send your parser a "two-interpretation" lexeme, e.g., a keyword-in-context lexeme, which in essence claims it it both a keyword and/or an identifier. With a single token lookahead internally, the parser can likely decide easily and restamp the lexeme. Not as general as the GLR solution, but probably works in a lot of cases.

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

how to return something when there is not match in flex(lexer)

I am using flex(the lexer) to do some lexical analysis.
What I need is:
If none of the rules are matched, then a value is returned to indicate such thing has happened.
This is like the default syntax in the switch control flow structure in many programming language.
Is there a way to do such kind of stuff?
EDIT 1:
Reference from the official doc
If no match is found, then the default rule is executed:
the next character in the input is considered matched and copied to the standard output.
But how can I change the default rule?
In acacia-lex it is done in the following way:
Lexer has run method:
#Override
public void run() {
Token token;
while ((token = this.findNext()).isFound()) {
System.out.println("LEXER RES = " + token.toString());
}
}
When nothing is found, there is no default rule. Lexer method run just completed its job.
To continue lexing, at the end of tokens specification is needed token "DOT" -> ".". So if no other tokens match, DOT will match and Lexer run will continue its job.
The default rule only applies if no other rule matches. So you can simply insert your own rule which matches any single character as the last rule:
.|\n { /* Your default action. */ }
It must go at the end because (F)lex will give priority to earlier rules in the file which have the same match. You need to explicitly mention \n (unless you are certain that some other rule will match it) because in (F)lex, . matches any character except a newline.
If you are using Flex, and you don't want the default rule to ever be used, it is advisable to put
%option nodefault
into your prologue. That will suppress the default rule and produce a warning if there is some input which might not match any rule. (If you ignore the warning, a runtime error will be produced for such input.)

Resources