Gnu FLex: How does yyunput works - flex-lexer

I have got a problem understanding flex yyunput behavior.
I want to put back some charackters
For exemple:
My scanner found CALL{space}{cc}
cc N?Z|N?C|P[OE]?|M
%%
CALL{blank}{cc} {BEGIN CON; return yy::ez80asm_parser::make_CALL(loc);}
CALL{mmode}{blank}{cc} {BEGIN CON; return yy::ez80asm_parser::make_CALL(loc);}
CALL {BEGIN ARG; return yy::ez80asm_parser::make_CALL(loc);}
and I want to give back the {cc} so it will be scanned next time.
What are the both arguments of yyunput has to be? I couldn't found any helpfully information about that funktion.
Any hints are wellcome
Jürgen

You can't "give back the {cc}" because the regular expression doesn't have pieces. (Flex does not do captures, either, so it wouldn't help to put parentheses around it.)
If you just want to rescan part of a token, it is much better to use yyless than unput, since yyless mostly just changes a pointer. With a single call to yyless you can return as many characters as you like, so you only need to know how many characters to return. (More precisely, you tell it how many characters you want to keep in yytext; the remainder are returned and yytext is truncated accordingly.)
For reference, unput is a macro whose single argument is a single character which will be pushed onto the beginning of the unconsumed input, overwriting yytext as it goes. (In the C++ API, it calls the internal member function ::yyunput, supplying it an additional necessary argument. Don't call this function directly.)
If you need to push several characters onto the input, you need to unput them one at a time, starting with the last one. Since unput destroys the value of yytext, you need to make sure that you've already copied it if you need it before calling unput.
In your case, I think neither of these is appropriate. What you probably want to do is to not include the {cc} pattern in match in the first place, which you can do with flex's trailing context operator /. (That assumes that you don't need to include the characters matched by {cc} in the semantic value you will be returning; in the example provided, yytext does not appear to be part of the semantic value, so the assumption should be safe.) To do so, you might write something like:
CALL{mmode}?{blank}/{cc} {BEGIN CON; return yy::ez80asm_parser::make_CALL(loc);}
CALL {BEGIN ARG; return yy::ez80asm_parser::make_CALL(loc);}
(Note: I combined your first two patterns into a single one since they seem to have the same action, but if you actually need the characters matched by {mmode} you might not want to do that.)
If that doesn't work, for whatever reason, use yyless. You'll need to know how many characters you want to return to the input, so I imagine you would end up with something like:
CALL{mmode}?{blank}{cc} { BEGIN CON;
int to_keep = yyleng - 1;
switch (yytext[to_keep]) {
case 'C': case 'Z':
if (yytext[to_keep - 1] == 'N') --to_keep;
break;
case 'E': case 'O': --to_keep; break
case 'P': case 'N': break;
default: assert(false); /* internal error */
}
yyless(to_keep);
return yy::ez80asm_parser::make_CALL(loc);
}
For details on the trailing context operator, see the Flex manual section on patterns (search for the word "trailing"; there is an important note towards the end as well) as well as the first paragraph of the following chapter on matching. yyless and unput are both documented in the chapter on actions, which includes examples of their usage.

Related

Grammar conflict with same prefix

Here's my grammar to the for statements:
FOR x>0 {
//somthing
}
// or
FOR x = 0; x > 0; x++ {
//somthing
}
it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression,
however the codes right after FOR will become useless because of confliction.
ForStmt
: FOR {
printf("for_begin_%d:\n", n);
} Expression {
printf("ifeq for_exit_%d\n", n);
} ForBlock
| FOR ForClause ForBlock
;
ForClause
: InitExpression ';' {
printf("for_begin_%d:\n", n);
} Expression ';' Expression { printf("ifeq for_exit_%d\n", n); }
;
I had tried to change it to something like:
ForStart
: FOR
| FOR InitExpression
;
or use a flag to mention where to print the for_begin label,
but also fail to resolve the conflict.
How to make it not conflict?
How can the parser know which alternative of the FOR statement it sees?
While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.
But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.
Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.
But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.
Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.
So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.
As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.
That would make the grammar look something like this:
ForLabel
: %empty { $$ = n; printf("for_begin_%d:\n", n++); }
ForStmt
: FOR
ForLabel[label]
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
| FOR
ForLabel
InitExpress ';'
ForLabel[label]
Expression ';'
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
;
([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)

why we need both Look Ahead symbol and read ahead symbol in Compiler

well i was reading some common concepts regarding parsing in compiler..i came across look ahead and read ahead symbol i search and read about them but i am stuck like why we need both of them ? would be grateful for any kind suggestion
Lookahead symbol: when node being considered in parse tree is for a terminal, and the
terminal matches lookahead symbol,then we advance in both parse and
input
read aheadsymbol: lexical analyzer may need to read some character
before it can decide on the token to be returned
One of these is about parsing and refers to the next token to be produced by the lexical scanner. The other one, which is less formal, is about lexical analysis and refers to the next character in the input stream. It should be clear which is which.
Note that while most parsers only require a single lookahead token, it is not uncommon for lexical analysis to have to backtrack, which is equivalent to examining several unconsumed input characters.
I hope I got your question right.
Consider C.
It has several punctuators that begin the same way:
+, ++, +=
-, --, -=, ->
<, <=, <<, <<=
...
In order to figure out which one it is when you see the first + or - or <, you need to look ahead one character in the input (and then maybe one more for <<=).
A similar thing can happen at a higher level:
{
ident1 ident2;
ident3;
ident4:;
}
Here ident1, ident3 and ident4 can begin a declaration, an expression or a label. You can't tell which one immediately. You can consult your existing declarations to see if ident1 or ident3 is already known (as a type or variable/function/enumeration), but it's still ambiguous because a colon may follow and if it does, it's a label because it's permitted to use the same identifier for both a label and a type/variable/function/enumeration (those two name spaces do not intersect), e.g.:
{
typedef int ident1;
ident1 ident2; // same as int ident2
int ident3 = 0;
ident3; // unused expression of value 0
ident1:; // unused label
ident2:; // unused label
ident3:; // unused label
}
So, you may very well need to look ahead a character or a token (or "unread" one) to deal with situations like these.

What's the purpose of this gen_server.erl code?

unregister_name({local,Name}) ->
_ = (catch unregister(Name));
unregister_name({global,Name}) ->
_ = global:unregister_name(Name);
unregister_name({via, Mod, Name}) ->
_ = Mod:unregister_name(Name);
unregister_name(Pid) when is_pid(Pid) ->
Pid.
This is from gen_server.erl. If _ always matches and the match always evaluates to the right hand side expression, what are the _ = expression() lines doing here?
Typically _ = ... matches are used to quiet dialyzer warnings about unmatched function return values when its -Wunmatched_returns option is used. As the documentation explains:
-Wunmatched_returns
Include warnings for function calls which ignore a structured return value or
do not match against one of many possible return value(s).
By explicitly matching the return value against the _ "don't care" variable, you can use this useful dialyzer option without having to see warnings for return values you don't care about.
In Erlang, last expression of function is its return value, so someone might be tempted to check, what global:unregister_name/1 or Mod:unregister_name(Name) return and try to pattern match on that.
The _ = expression() doesn't do anything in particular, but hints, that this return value should be ignored (for example, because they are not documented and might be subject to change). However in the last expression, Pid is returned explicitly. This means, that you can pattern match like this:
case unregister_name(Something) of
Pid when is_pid(Pid) -> foo();
_ -> bar()
end.
To sum up: those lines aren't doing anything there, but when someone else is reading the source code, they show original programmer intent.
Unfortunately, this particular function is not exported and in the original module never used in pattern match, so I don't have an example to back this up :)
And I'll note that I've since come across this:
The Power of Ten – Rules for Developing Safety Critical Code
Gerard J. Holzmann
NASA/JPL Laboratory for Reliable Software Pasadena, CA
91109
[...]
Rule: The return value of non-void functions must be checked by each calling function, and the validity of parameters must be checked
inside each function.
Rationale: This is possibly the most frequently
violated rule, and therefore somewhat more suspect as a general rule.
In its strictest form, this rule means that even the return value of
printf statements and file close statements must be checked. One can
make a case, though, that if the response to an error would rightfully
be no different than the response to success, there is little point in
explicitly checking a return value. This is often the case with calls
to printf and close. In cases like these, it can be acceptable to
explicitly cast the function return value to (void) – thereby
indicating that the programmer explicitly and not accidentally decides
to ignore a return value. In more dubious cases, a comment should be
present to explain why a return value is irrelevant. In most cases,
though, the return value of a function should not be ignored,
especially if error return values must be propagated up the function
call chain. Standard libraries famously violate this rule with
potentially grave consequences. See, for instance, what happens if you
accidentally execute strlen(0), or strcat(s1, s2, -1) with the
standard C string library – it is not pretty. By keeping the general
rule, we make sure that exceptions must be justified, with mechanical
checkers flagging violations. Often, it will be easier to comply with
the rule than to explain why noncompliance might be acceptable.

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

Lua source code manipulation: get innermost function() location for a given line

I've got a file with syntactically correct Lua 5.1 source code.
I've got a position (line and character offset) inside that file.
I need to get an offset in bytes to the closing parenthesis of the innermost function() body that contains that position (or figure out that the position belongs to the main chunk of the file).
I.e.:
local function foo()
^ result
print("bar")
^ input
end
local foo = function()
^ result
print("bar")
^ input
end
local foo = function()
return function()
^ result
print("bar")
^ input
end
end
...And so on.
How do I do that robustly?
EDIT: My original answer did not take into account the "innermost" requirement. I've since taken that into account
To make things "robust," there are a few considerations.
First of all, it's important that you skip over string and comment contents, to avoid incorrect output in situations like:
foo = function()
print(" function() ")
-- function()
print("bar")
^ input
end
This can be somewhat difficult, considering Lua's nested string and comment syntax. Consider, for example, a situation where the input begins in a nested string or comment:
foo = function()
print([[
bar = function()
print("baz")
^ input
end
]])
end
Consequently, if you want a completely robust system, it is not acceptable to only parse backwards until you hit the end of a function parameter list, because you may not have parsed backwards far enough to reach a [[ which would invalidate your match. It is therefore necessary to parse the entire file up to your position (unless you're okay with incorrect matches in these weird situations. If this is an editor plugin, these "incorrect" results may actually be desirable, because they would allow you to edit lua code which is stored in string literal form inside other lua code using the same plugin).
Because the particular syntax that you're trying to match doesn't have any kind of "nesting", a full-blown parser isn't needed. You will need to maintain a stack, however, to keep track of scope. With that in mind, all you need to do is step through the source file character-by-character from the beginning, applying the following logic:
Every time a " or ' is encountered, ignore the characters up to the closing " or '. Be careful to handle escapes like \" and \\
Every time a -- is encountered, ignore the characters up to the closing newline for the comment. Be careful to only do this if the comment is not a multiline comment.
Every time a multiline string opening symbol is encountered (such as [[, [=[, etc), or a multiline comment symbol is encountered (such as --[[ or --[=[, etc) ignore the characters up until the closing square brackets with the proper number of matching equals signs between them.
When a word boundary is encountered check to see if the characters after it could begin a block which ends with an end (for example, if, while, for, function, etc. DO NOT include repeat). If so, push the position on the scope stack. A "word boundary" in this case is any character which could not be used a lua identifier (this is to prevent matches in cases like abcfunction()). The beginning of the file is also considered a word boundary.
If a word boundary is encountered and it is followed by end, pop the top element of the stack. If the stack has no elements, complain about a syntax error.
When you finally step forward and reach your "input" position, pop elements from the stack until you find a function scope. Step forward from that position to the next ), ignoring )'s in comments (which could theoretically be found in an argument list if it spans multiple lines or contains inline --[[ ]] comments). That position is your result.
This should handle every case, including situations where the function syntactic sugar is used, like
function foo()
print("bar")
end
which you did not include in your example but which I imagine you still want to match.

Resources