From what I was told using fail is not recommended and it will later be removed.
What should be properly used instead of fail in the following Parsers/Trifecta example?
parserNaturalNoLeadZero :: Parser Integer
parserNaturalNoLeadZero = do
digits <- some digit
if length digits > 1 && head digits == '0'
then fail "Leading Zeros"
else return $ read digits
As the documentation tells you, a new MonadFail class is being introduced to fulfill that role.
But, for stuff like parsers, the sensible choice is usually empty, which has been around for much longer.
Parsec:
unexpected
fail
empty
Trifecta:
unexpected
fail
empty
The only difference is the error message they produce.
Use unexpected on an unexpected token. unexpected "token" will result in an error message like "unexpected: 'token'".
Annotate parsers with the high-level constructs they represent using (<?>).
This is normally used at the end of a set alternatives where we want to return an error message in terms of a higher level construct rather than returning all possible characters.
parseExpr = ... <?> "expression"
parseId = ... <?> "identifier"
parseTy = ... <?> "type"
empty doesn't produce any error message. It can still be useful to backtrack and let another branch succeed or take care of reporting a meaningful error.
Use fail for other kinds of errors, libraries can't assume much about what goes into it so they'll probably treat its argument as a raw message.
Related
Grammar:
rule: (a b)? a c ;
Input:
a d
Question: Which error message correct at position 2 for given input?
1. expected "b", "c".
2. expected "c".
P.S.
I write parser and I have choice (dilemma) take into account that "b" expected at position or not take.
The #1 error (expected "b", "c") want to say that input "a b" expected but because it optional it may not expected but possible.
I don't know possible is the same as expected or not?
Which error message better and correct #1 or #2?
Thanks for answers.
P.S.
In first case I define marker of testing as limit of position.
if(_inputPos > testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
Limit moved in optional expressions:
OPTIONAL_EXPRESSION:
testing = _inputPos;
The "b" expression move _inputPos above the testing pos and add failure at _inputPos.
In second case I can define marker of testing as boolean flag.
if(!testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
The "b" expression in this case not add failure because it tested (inner for optional expression).
What you think what is better and correct?
Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).
Also failures not takes into account if rule not fails. They only reported if parsing fails.
P.S.
Changes at 06 Jan 2014
This question raised because it related to two different problems.
First problem:
Parsing expression grammar (PEG) describe only three atomic items of input:
terminal symbol
nonterminal symbol
empty string
This grammar does not provide such operation as lexical preprocessing an thus it does not provide such element as the token.
Second problem:
What is a grammar? Are two grammars can be considred equal if they accept the same input but produce different result?
Assume we have two grammar:
Grammar 1
rule <- type? identifier
Grammar 2
rule <- type identifier / identifier
They both accept the same input but produce (in PEG) different result.
Grammar 1 results:
{type : type, identifier : identifier}
{type : null, identifier : identifier}
Grammar 2 results:
{type : type, identifier : identifier}
{identifier : identifier}
Quetions:
Both grammar equal?
It is painless to do optimization of grammars?
My answer on both questions is negative. No equal, Not painless.
But you may ask. "But why this happens?".
I can answer to you. "Because this is not a problem. This is a feature".
In PEG parser expression ALWAYS consists from these parts.
ORDERED_CHOICE => SEQUENCE => EXPRESSION
And this explanation is the my answer on question "But why this happens?".
Another problem.
PEG parser does not recognize WHITESPACES because it does not have tokens and tokens separators.
Now look at this grammar (in short):
program <- WHITESPACE expr EOF
expr <- ruleX
ruleX <- 'X' WHITESPACE
WHITESPACE < ' '?
EOF <- ! .
All PEG grammar desribed in this manner.
First WHITESPACE at begin and other WHITESPACE (often) at the end of rule.
In this case in PEG optional WHITESPACE must be assumed as expected.
But WHITESPACE not means only space. It may be more complex [\t\n\r] and even comments.
But the main rule of error messages is the following.
If not possible to display all expected elements (or not possible to display at least one from all set of expected elements) in this case is more correct do not display anything.
More precisely required to display "unexpected" error mesage.
How you in PEG will display expected WHITESPACE?
Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'
What about start charcters of comments? They also may be part of WHITESPACE in some grammars.
In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.
Is this good or bad?
I think this is not bad and required some tricks to hide this nature of PEG parsers.
And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected.
But all other inner (except at the first position) must treated as expected.
Example 1:
List<int list; // type? ident
Here "List<int" is a "type". But missing ">" is not at the first position in optional "type?".
This failure take into account and report as "expected '>'"
This is because we not skip "type" but enter into "type" and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.
"List" was in "testing" position.
If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.
From this assumption has been asked main question.
You must just take into account that we are talking about PEG parsers and their error messages.
Here is your grammar:
What is clear here is that after the first a there are two possible inputs: b or c. Your error message should not prioritize one over the other.
The basic idea to produce an error message for an invalid input is to find the most far place you failed (if your grammar where d | (a b)? a c, d wouldn't be part of the error) and determine what are all possible inputs that could make you advance and say "expected '...' but got '...'". There are other approaches to try to recover the parser and force it to continue. If there is only one possible expected token, let's temporarily insert it into the token stream and continue as if it where there since ever. This would lead to better error detection as you can find errors beyond the point where the parser first stopped.
PEG-based parser generators usually provide limited error reporting on invalid inputs. From what I read, the parse dialect of rebol is inspired by PEG grammars extended with regular expressions.
For example, typing the following in JavaScript:
d8> function () {}
gives the following error, because no identifier was provided in declaring a global function:
(d8):1: SyntaxError: Unexpected token (
function () {}
^
The parser is able to pinpoint exactly the position during parsing where an expected token is missing. The character position of the expected token is used to position the arrow in the error message.
Does the parse dialect in rebol provides built-in facilities to report the line and column errors on invalid inputs?
Otherwise, are there examples out there of custom rolled out parse rules that provide such error reporting?
I've done very advanced Rebol parsers which manage live and mission-critical TCP servers, and doing proper error reporting was a requirement. So this is important!
Probably one of the most unique aspects of Rebol's PARSE is that you can include direct evaluation within the rules. So you can set variables to track the parse position, or the error messages, etc. (It's very easy because the nature of Rebol is that mixing code and data as the same thing is a core idea.)
So here's the way I did it. Before each match rule is attempted, I save the parse position into "here" (by writing here:) and then also save an error into a variable using code execution (by putting (error: {some error string}) in parentheses so that the parse dialect runs it). If the match rule succeeds, we don't need to use the error or position...and we just go on to the next rule. But if it fails we will have the last state we set to report after the failure.
Thus the pattern in the parse dialect is simply:
; use PARSE dialect handling of "set-word!" instances to save parse
; position into variable named "here"
here:
; escape out of the parse dialect using parentheses, and into the DO
; dialect to run arbitrary code. Here we run code that saves an error
; message string into a variable named "error"
(error: "<some error message relating to rule that follows>")
; back into the PARSE dialect again, express whatever your rule is,
; and if it fails then we will have the above to use in error reporting
what: (ever your) [rule | {is}]
That's basically what you need to do. Here is an example for phone numbers:
digit: charset "012345689"
phone-number-rule: [
here:
(error: "invalid area code")
["514" | "800" | "888" | "916" "877"]
here:
(error: "expecting dash")
"-"
here:
(error: "expecting 3 digits")
3 digit
here:
(error: "expecting dash")
"-"
here:
(error: "expecting 4 digits")
4 digit
(error: none)
]
Then you can see it in action. Notice that we set error to none if we reach the end of the parse rules. PARSE will return false if there is still more input to process, so if we notice there is no error set but PARSE returns false anyway... we failed because there was too much extra input:
input: "800-22r2-3333"
if not parse input phone-number-rule [
if none? error [
error: "too much data for phone number"
]
]
either error [
column: length? copy/part input here newline
print rejoin ["error at position:" space column]
print error
print input
print rejoin [head insert/dup "" space column "^^"}
print newline
][
print {all good}
]
The above will print the following:
error at position: 4
expecting 3 digits
800-22r2-3333
^
Obviously, you could do much more potent stuff, since whatever you put in parens will be evaluated just like normal Rebol source code. It's really flexible. I even have parsers which update progress bars while loading huge datasets... :-)
Here is a simple example of finding the position during parsing a string which could be used to do what you ask.
Let us say that our code is only valid if it contains a and b characters, anything else would be illegal input.
code-rule: [
some [
"a" |
"b"
]
[ end | mark: (print [ "Failed at position" index? mark ]) ]
]
Let's check that with some valid code
>> parse "aaaabbabb" code-rule
== true
Now we can try again with some invalid input
>> parse "aaaabbXabb" code-rule
Failed at position 7
== false
This is a rather simplified example language, but it should be easy to extend to more a complex example.
I'm implementing a PEG parser generator in Python, and I've had success so far, except with the "cut" feature, of which whomever knows Prolog must know about.
The idea is that after a cut (!) symbol has been parsed, then no alternative options should be attempted at the same level.
expre = '(' ! list ')' | atom.
Means that after the ( is seen, the parsing must succeed, or fail without trying the second option.
I'm using Python's (very efficient) exception system to force backtracking, so I tried having a special FailedCut exception that would abort the enclosing choice, but that didn't work.
Any pointers to how this functionality is implemented in other parser generators would be helpful.
Maybe the problem I've had has been lack of locality. The code generated for the left part of the rule would be something like:
cut_seen = False
try:
self.token('(')
cut_seen = True
self.call('list')
self.token(')')
except FailedParse as e:
if cut_seen:
raise FailedCut(e)
raise
Then the code generated for the choice (|) operator will skip the following choices if it catches a FailedCut. What I mean by lack of locality is that the choice catching the FailedCut may be deep up in calls, thus having an effect too-difficult to discern.
Instead of making the code generated for sequences try to inform enclosing choices of cuts, I could make the code generated for choices beware of them. That would make the scope of cuts very local, unlike Prolog's, but good enough for what I want in a PEG parser, which is to commit to an option after a certain token sequence has been seen, so the error reporting is refers to that location in the source, instead of to another location where some other option might have been available.
It just occurred to me that if the code generated for a rule/predicate catches FailedCut and translates it into a normal FailedParse exception, then the cuts will have the right scope.
In reference to #false's question, here's a complete example of what I want to work:
start = expre ;
expre = named | term ;
named = word ':' ! term;
term = word ;
In that grammar, word can be reached through named or term, but I would like the parser to commit to the named branch after it has seen the :.
The Solution
To be fair, I've published my work so far at https://bitbucket.org/apalala/grako/.
In the final solution, sequences are enclosed with this context manager:
#contextmanager
def _sequence(self):
self._push_cut()
try:
yield
except FailedParse as e:
if self._cut():
self.error(e, FailedCut)
else:
raise
finally:
self._pop_cut()
And options in a choice function are enclosed with this:
#contextmanager
def _option(self):
p = self._pos
try:
self._push_ast()
try:
yield
ast = self.ast
finally:
self._pop_ast()
self.ast.update(ast)
except FailedCut as e:
self._goto(p)
raise e.nested
except FailedParse:
self._goto(p)
Which forces an exit out of the choice instead of a return to try the next option.
The cuts themselves are implemented thus:
def _cut(self):
self._cut_stack[-1] = True
The full source code may be found on Bitbucket.
In a Prolog with ISO Prolog's exception handling (catch/3 and throw/1), a cut could be implemented as:
cut. % Simply succeeds
cut :-
throw(cut). % on backtracking throws an exception
This would require to catch that exception at appropriate places. For example, each goal (that is non-terminal) of a user defined predicate could now be wrapped with:
catchcut(Goal) :-
catch(Goal,cut,fail).
This is not the most efficient way to implement cut since it does not free resources upon success of !, but it might be sufficient for your purposes. Also, this method now might interfere with user-defined uses of catch/3. But you probably do not want to emulate the entire Prolog language in any case.
Also, consider to use Prolog's dcg-grammars directly. There is a lot of fine print that is not evident when implementing this in another language.
The solution proposed at the end of my question worked:
cut_seen = False
try:
self.token('(')
cut_seen = True
self.call('list')
self.token(')')
except FailedParse as e:
if cut_seen:
raise FailedCut(e)
raise
Then, any time a choice or optional is evaluated, the code looks like this:
p = self.pos
try:
# code for the expression
except FailedCut:
raise
except FailedParse:
self.goto(p)
Edit
The actual solution required keeping a "cut stack". The source code is int Bitbucket.
Just read it.
I'd suggested a deep cut_seen (like with modifying parser's state) and a save and restore state with local variables. This uses the thread's stack as "cut_seen stack".
But you have another solution, and I'm pretty sure you're fine already.
BTW: nice compiler – it's just the opposite of what I'm doing with pyPEG so I can learn alot ;-)
I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.
What is the prefered way to raise errors (ParseError) in Parsec? I got some code inside a parser that performs a check and if the check fails a ParseError should be returned (i.e. Left ParseError when running parse).
You can use Text.ParserCombinators.Parsec.Prim.unexpected and Control.Monad.fail for this. Both take a String argument signifying the error message and will return (in this case) a value of type GenParser tok st a.
For more, see Text.ParserCombinators.Parsec.Error, specifically Message. There you can read which function to use in which case (though both signify a parse error, they are semantically slightly different).