Can anyone provide a clear explanation and some simple examples that show this error, apparently related to match-time capture (Cmt) ?
I don't understand the only mention that I can find, which is at
http://lua-users.org/lists/lua-l/2013-06/msg00086.html
Thanks
So this question is a bit old, but it's one of the first search results. There's not a lot on the Internet about this, and it may not be super obvious what's wrong.
The error message is a bit misleading, but what's happening - in formal PEG terms, at least as I understand them - there is a repetition operator applied to an parsing expression that can consume no input.
Or other words, LPeg has detected a loop that can match an empty string, which will never complete. There's a pure Lua implementation called LuLPeg which lacks this particular check, and if you execute your grammar it could easily enter an infinite loop.
I'm tinkering with little toy BASIC-ish language, and had the above issue with the following:
grammar = P{ "input",
input = V"block"^0 * -1,
block = V"expression"^0,
-- (define expression here)
}
With that idea that the root input is an optional block of code, and a block is zero or more expressions. This is pretty simplified, of course, I'm leaving out whitespace handling and that sort of thing. But what happens when you call grammar:match("")?
remaining string: "", at input. See if it matches a block.
remaining string: "", at block. See if it matches an expression.
remaining string: "", at expression. For the sake of time, let's say it doesn't match
remaining string: "", at block. Rule concludes with zero expressions, no input consumed.
remaining string: "", at input. One block consumed, check if more blocks match.
remaining string: "", at block. See if it matches an expression.
And so on. Since V"block" matches the empty string, input can find an infinite number of blocks to fulfil the rule V"block"^0. In this case, there's two good solutions: Set input to at most one block, or require block to be a least one expression and wherever there could be a block set it to ^0. So either:
grammar = P{ "input", -- blocks can be empty, input contains one (empty or otherwise) block
input = V"block" * -1,
block = V"expression"^0,
-- (define expression here)
}
Or:
grammar = P{ "input", -- blocks must be at least one expression, root can have one
input = V"block"^0 * -1,
block = V"expression"^1,
-- (define expression here)
}
In the first case, an empty string will match a block, which fulfills the the input rule. In the second, an empty string will fail block, fulfilling the input rule with zero matching blocks.
I haven't needed to use Cmt yet, but I believe what happened was old versions of LPeg assumed the function would either fail or consume input, even if the rule inside the Cmt call could match an empty string. More recent releases don't have this assumption.
Related
Here's my grammar to the for statements:
FOR x>0 {
//somthing
}
// or
FOR x = 0; x > 0; x++ {
//somthing
}
it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression,
however the codes right after FOR will become useless because of confliction.
ForStmt
: FOR {
printf("for_begin_%d:\n", n);
} Expression {
printf("ifeq for_exit_%d\n", n);
} ForBlock
| FOR ForClause ForBlock
;
ForClause
: InitExpression ';' {
printf("for_begin_%d:\n", n);
} Expression ';' Expression { printf("ifeq for_exit_%d\n", n); }
;
I had tried to change it to something like:
ForStart
: FOR
| FOR InitExpression
;
or use a flag to mention where to print the for_begin label,
but also fail to resolve the conflict.
How to make it not conflict?
How can the parser know which alternative of the FOR statement it sees?
While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.
But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.
Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.
But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.
Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.
So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.
As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.
That would make the grammar look something like this:
ForLabel
: %empty { $$ = n; printf("for_begin_%d:\n", n++); }
ForStmt
: FOR
ForLabel[label]
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
| FOR
ForLabel
InitExpress ';'
ForLabel[label]
Expression ';'
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
;
([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)
Groovy supports / as a division operator:
groovy> 1 / 2
===> 0.5
It supports / as a string delimiter, which can even be multiline:
groovy> x = /foo/
===> foo
groovy:000> x = /foo
groovy:001> bar/
===> foo
bar
Given this, why can't I evaluate a slashy-string literal in groovysh?
groovy:000> /foo/
groovy:001>
clearly groovysh thinks this is unterminated for some reason.
How does groovy avoid getting confused between division and strings? What does this code mean:
groovy> f / 2
Is this a function call f(/2 .../) where / is beginning a multiline slashy-string, or f divided by 2?
How does Groovy distinguish division from strings?
I'm not entirely sure how Groovy does it, but I'll describe how I'd do it, and I'd be very surprised if Groovy didn't work in a similar way.
Most parsing algorithms I've heard of (Shunting-yard, Pratt, etc) recognize two distinct kinds of tokens:
Those that expect to be preceded by an expression (infix operators, postfix operators, closing parentheses, etc). If one of these is not preceded by an expression, it's a syntax error.
Those that do not expect to be preceded by an expression (prefix operators, opening parentheses, identifiers, literals, etc). If one of these is preceded by an expression, it's a syntax error.
To make things easier, from this point onward I'm going to refer to the former kind of token as an operator and the latter as a non-operator.
Now, the interesting thing about this distinction is that it's made not based on what the token actually is, but rather on the immediate context, particularly the preceding tokens. Because of this, the same token can be interpreted very differently depending on its position in the code, and whether the parser classifies it as an operator or a non-operator. For example, the '-' token, if in an operator position, denotes a subtraction, but the same token in a non-operator position is a negation. There is no issue deciding whether a '-' is a subtraction operator or not, because you can tell based on its context.
The same is, in general, true for the '/' character in Groovy. If preceded by an expression, it's interpreted as an operator, which means it's a division. Otherwise, it's a non-operator, which makes it a string literal. So, you can generally tell if a '/' is a division or not, by looking at the token that immediately precedes it:
The '/' is a division if it follows an identifier, literal, postfix operator, closing parenthesis, or other token that denotes the end of an expression.
The '/' begins a string if it follows a prefix operator, infix operator, opening parenthesis, or other such token, or if it begins a line.
Of course, it isn't quite so simple in practice. Groovy is designed to be flexible in the face of various styles and uses, and therefore things like semicolons or parentheses are often optional. This can make parsing somewhat ambiguous at times. For example, say our parser comes across the following line:
println / foo
This is most likely an attempt to print a multiline string: foo is the beginning of a string being passed to println as an argument, and the optional parentheses around the argument list are left out. Of course, to a simple parser it looks like a division. I expect the Groovy parser can tell the difference by reading ahead to the following lines to see which interpretation does not give an error, but for something like groovysh that is literally impossible (since, as a repl, it doesn't yet have access to more lines), so it's forced to just guess.
Why can't I evaluate a slashy-string literal in groovysh?
As before, I don't know the exact reason, but I do know that because groovysh is a repl, it's bound to have more trouble with the more ambiguous rules. Even so, a simple single-line slashy-string is pretty unambiguous, so I believe something else may be going on here. Here is the result of me playing with various forms in groovysh:
> /foo - unexpected char: '/' # line 2, column 1.
> /foo/ - awaits further input
> /foo/bar - unexpected char: '/' # line 2, column 1.
> /foo/bar/ - awaits further input
> /foo/ + 'bar' - unexpected char: '/' # line 2, column 1.
> 'foo' + /bar/ - evaluates to 'foobar'
> /foo/ - evaluates to 'foo'
> /foo - awaits further input
> /foo/bar - Unknown property: bar
It appears that something strange happens when a '/' character is the first character in a line. The pattern it appears to follow (as far as I can tell) is this:
A slash as the first character of a line begins a strange parsing mode.
In this mode, every line that ends with a slash followed by nothing but whitespace causes the repl to await further lines.
On the first line that ends with something other than a slash (or whitespace following a slash), the error unexpected char: '/' # line 2, column 1. is printed.
I've also noticed a couple of interesting points regarding this:
Both forward slashes (/) and backslashes (\) appear to count, and seem to be completely interchangeable, in this special mode.
This does not appear to happen at all in groovyConsole or in actual Groovy files.
Putting any whitespace before the opening slash character causes groovysh to interpret it correctly, but only if the opening slash is a forward slash, not a backslash.
So, I personally expect that this is just a quirk of groovysh, either a bug or some under-documented feature I haven't heard about.
Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)
I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.
For example, my lexer recognizes a function call pattern:
//i.e. hello(...), foo(...), bar(...)
FUNCALL [a-zA-Z0-9]*[-_]*[a-zA-Z0-9]+[-_*][a-zA-Z0-9]*\(.\)
Now that flex recognizes the pattern, but it goes passed the last character in the pattern (i.e. after stored foo(...) inside yytext, the lexer will point to the next character after foo(...))
How can I reset the lexer pointer back to the beginning of the function pattern? i.e. after recognizing foo(..), I want to the lexer to point to the start of foo(..), so I can start tokenizing it.
I need to do this because for each regex pattern, only one token can be returned for each pattern. i.e. after matching foo(...), I can only return either foo or ( or ) with return statement but not all.
Flex has a trailing context pattern match (manual excerpt below) Read and understand the limitations before you use this.
`r/s'
an `r' but only if it is followed by an `s'. The text matched by
`s' is included when determining whether this rule is the longest
match, but is then returned to the input before the action is
executed. So the action only sees the text matched by `r'. This
type of pattern is called "trailing context". (There are some
combinations of `r/s' that flex cannot match correctly. *Note
Limitations::, regarding dangerous trailing context.)
Presumably something like this:
FUNCALL [a-zA-Z0-9]*[-_]*[a-zA-Z0-9]+[-_*][a-zA-Z0-9]*/\(.\)
You may find that it makes more sense to change your parser so you don't need to do this.