May someone help me understand the following behavior:
parseAll (parseIf, "If bla blablaa") should result in is expected. Instead I always get string matching regex 'is\b' expected but 'b' found.
I guess it has something to do with whitespaces because " If bla is blablaa" (notice the whitespaces at the beginning) results in the same behavior. I tried it with StandardTokenParsers and everything worked fine. But STP unfortunately doesn't support regex.
Follow-up question: How would I have to alter RegexParsers so it uses a sequence of Strings instead of a sequence of chars? That would make error reporting a lot more easy.
lazy val parseIf = roleGiverIf ~ giverRole
lazy val roleGiverIf =
kwIf ~> identifier | failure("""A rule must begin with if""")
lazy val giverRole =
kwIs ~> identifier | failure("""is expected""")
lazy val keyword =
kwIf | kwAnd | kwThen | kwOf | kwIs | kwFrom | kwTo
lazy val identifier =
not(keyword) ~ roleEntityLiteral
// ...
def roleEntityLiteral: Parser[String] =
"""([^"\p{Cntrl}\\]|\\[\\/bfnrt]|\\u[a-fA-F0-9]{4})\S*""".r
def kwIf: Parser[String] = "If\\b".r
def kwIs: Parser[String] = "is\\b".r
// ...
parseAll(parseIf, "If bla blablaa") match {
case Success(parseIf, _) => println(parseIf)
case Failure(msg, _) => println("Failure: " + msg)
case Error(msg, _) => println("Error: " + msg)
This problem is very weird. When you call | and both sides are failures, the side where the failure happened last is selected, ties favoring the left-sided one.
When you try to parse directly with giverRole, it produces the result you expect. If you add a successful match before the failure, though, it produces the result you are seeing.
The reason is rather subtle -- I only found it out by sprinkling log statements on all parsers. To understand it, you must understand how does RegexParser skip spaces. Specifically, spaces are skipped on accept. Because failure doesn't call accept, it doesn't skip spaces.
While the failure of kwIs happens on b, as the space as skipped, the failure of failure happens on the space after If. Here:
If bla blablaa
^ kwIs fails here
^ failure fails here
Therefore, the error message on kwIs gets precedence by the rule I mentioned.
You can get around this problem by making the parser skip the spaces without matching anything. It is important that this pattern always match, or you'll get an even more confusing error message. Here's a suggestion I think works:
"\\b|$".r ~ failure("is expected")
Another solution is to use acceptIf or acceptMatch instead of using the implicit regex accept, in which case you can provide a tailored error message.
Related
What approach would allow me to get the most on reporting lexing errors?
For a simple example I would like to write a grammar for the following text
(white space is ignored and string constants cannot have a \" in them for simplicity):
myvariable = 2
myvariable = "hello world"
Group myvariablegroup {
myvariable = 3
anothervariable = 4
}
Catching errors with a lexer
How can you maximize the error reporting potential of a lexer?
After reading this post: Where should I draw the line between lexer and parser?
I understood that the lexer should match as much as it can with regards to the parser grammar but what about lexical error reporting strategies?
What are the ordinary strategies for catching lexing errors?
I am imagining a grammar which would have the following "error" tokens:
GROUP_OPEN: 'Group' WS ID WS '{';
EMPTY_GROUP: 'Group' WS ID WS '{' WS '}';
EQUALS: '=';
STRING_CONSTANT: '"~["]+"';
GROUP_CLOSE: '}';
GROUP_ERROR: 'Group' .; // the . character is an invalid token
// you probably meant '{'
GROUP_ERROR2: .'roup' ; // Did you mean 'group'?
STRING_CONSTANT_ERROR: '"' .+; // Unterminated string constant
ID: [a-z][a-z0-9]+;
WS: [ \n\r\t]* -> skip();
SINGLE_TOKEN_ERRORS: .+?;
There are clearly some problems with your approach:
You are skipping WS (which is good), but yet you're using it in your other rules. But you're in the lexer, which leads us to...
Your groups are being recognized by the lexer. I don't think you want them to become a single token. Your groups belong in the parser.
Your grammar, as written, will create specific token types for things ending in roup, so croup for instance may never match an ID. That's not good.
STRING_CONSTANT_ERROR is much too broad. It's able to glob the entire input. See my UNTERMINATED_STRING below.
I'm not quite sure what happens with SINGLE_TOKEN_ERRORS... See below for an alternative.
Now, here are some examples of error tokens I use, and this works very well for error reporting:
UNTERMINATED_STRING
: '"' ('\\' ["\\] | ~["\\\r\n])*
;
UNTERMINATED_COMMENT_INLINE
: '/*' ('*' ~'/' | ~'*')*? EOF -> channel(HIDDEN)
;
// This should be the LAST lexer rule in your grammar
UNKNOWN_CHAR
: .
;
Note that these unterminated tokens represent single atomic values, they don't span logical structures.
Also, UNKNOWN_CHAR will be a single char no matter what, if you define it as .+? it will always match exactly one char anyway, since it will be trying to match as few chars as possible, and that minimum is one char.
Non-greedy quantifiers make sense when something follows them. For instance in the expression .+? '#', the .+? will be forced to consume characters until it encounters a # sign. If the .+? expression is alone, it won't have to consume more than a single character to match, and therefore will be equivalent to ..
I use the following code in the lexer (.NET ANTLR):
partial class MyLexer
{
public override IToken Emit()
{
CommonToken token;
RecognitionException ex;
switch (Type)
{
case UNTERMINATED_STRING:
Type = STRING;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_STRING, Line, Column, "Unterminated string: " + GetTokenTextForDisplay(token), ex);
return token;
case UNTERMINATED_COMMENT_INLINE:
Type = COMMENT_INLINE;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_COMMENT_INLINE, Line, Column, "Unterminated comment: " + GetTokenTextForDisplay(token), ex);
return token;
default:
return base.Emit();
}
}
// ...
}
Notice that when the lexer encounters a bad token type, it explicitly changes it it to a valid token, so the parser can actually make sense of it.
Now, it is the job of the parser to identify bad structure. ANTLR is smart enough to perform single-token deletion and single-token insertion while trying to resynchronize itself with an invalid input. This is also the reason why I'm letting UNKNOWN_CHAR slip though to the parser, so it can discard it with an error message.
Just take the errors it generates and alter them in order to present something nicer to the user.
So, just make your groups into a parser rule.
An example:
Consider the following input:
Group ,ygroup {
Here, the , is clearly a typo (user pressed , instead of m).
If you use UNKNOWN_CHAR: .; you will get the following tokens:
Group of type GROUP
, of type UNKNOWN_CHAR
ygroup of type ID
{ of type '{ '
The parser will be able to figure out the UNKNOWN_CHAR token needs to be deleted and will correctly match a group (defined as GROUP ID '{' ...).
ANTLR will insert so-called error nodes at the points where it finds unexpected tokens (in this case between GROUP and ID). These nodes are then ignored for the purposes of parsing, but you can retrieve them with your visitors/listeners to handle them (you can use a visitor's VisitErrorNode method for instance).
PEG-based parser generators usually provide limited error reporting on invalid inputs. From what I read, the parse dialect of rebol is inspired by PEG grammars extended with regular expressions.
For example, typing the following in JavaScript:
d8> function () {}
gives the following error, because no identifier was provided in declaring a global function:
(d8):1: SyntaxError: Unexpected token (
function () {}
^
The parser is able to pinpoint exactly the position during parsing where an expected token is missing. The character position of the expected token is used to position the arrow in the error message.
Does the parse dialect in rebol provides built-in facilities to report the line and column errors on invalid inputs?
Otherwise, are there examples out there of custom rolled out parse rules that provide such error reporting?
I've done very advanced Rebol parsers which manage live and mission-critical TCP servers, and doing proper error reporting was a requirement. So this is important!
Probably one of the most unique aspects of Rebol's PARSE is that you can include direct evaluation within the rules. So you can set variables to track the parse position, or the error messages, etc. (It's very easy because the nature of Rebol is that mixing code and data as the same thing is a core idea.)
So here's the way I did it. Before each match rule is attempted, I save the parse position into "here" (by writing here:) and then also save an error into a variable using code execution (by putting (error: {some error string}) in parentheses so that the parse dialect runs it). If the match rule succeeds, we don't need to use the error or position...and we just go on to the next rule. But if it fails we will have the last state we set to report after the failure.
Thus the pattern in the parse dialect is simply:
; use PARSE dialect handling of "set-word!" instances to save parse
; position into variable named "here"
here:
; escape out of the parse dialect using parentheses, and into the DO
; dialect to run arbitrary code. Here we run code that saves an error
; message string into a variable named "error"
(error: "<some error message relating to rule that follows>")
; back into the PARSE dialect again, express whatever your rule is,
; and if it fails then we will have the above to use in error reporting
what: (ever your) [rule | {is}]
That's basically what you need to do. Here is an example for phone numbers:
digit: charset "012345689"
phone-number-rule: [
here:
(error: "invalid area code")
["514" | "800" | "888" | "916" "877"]
here:
(error: "expecting dash")
"-"
here:
(error: "expecting 3 digits")
3 digit
here:
(error: "expecting dash")
"-"
here:
(error: "expecting 4 digits")
4 digit
(error: none)
]
Then you can see it in action. Notice that we set error to none if we reach the end of the parse rules. PARSE will return false if there is still more input to process, so if we notice there is no error set but PARSE returns false anyway... we failed because there was too much extra input:
input: "800-22r2-3333"
if not parse input phone-number-rule [
if none? error [
error: "too much data for phone number"
]
]
either error [
column: length? copy/part input here newline
print rejoin ["error at position:" space column]
print error
print input
print rejoin [head insert/dup "" space column "^^"}
print newline
][
print {all good}
]
The above will print the following:
error at position: 4
expecting 3 digits
800-22r2-3333
^
Obviously, you could do much more potent stuff, since whatever you put in parens will be evaluated just like normal Rebol source code. It's really flexible. I even have parsers which update progress bars while loading huge datasets... :-)
Here is a simple example of finding the position during parsing a string which could be used to do what you ask.
Let us say that our code is only valid if it contains a and b characters, anything else would be illegal input.
code-rule: [
some [
"a" |
"b"
]
[ end | mark: (print [ "Failed at position" index? mark ]) ]
]
Let's check that with some valid code
>> parse "aaaabbabb" code-rule
== true
Now we can try again with some invalid input
>> parse "aaaabbXabb" code-rule
Failed at position 7
== false
This is a rather simplified example language, but it should be easy to extend to more a complex example.
I'm trying to get following to work. So I have strings that are inside parentheses. The strings can contain any characters, and hence the string that I want to parse can also contain parentheses. I think the regex currently matches also the last parentheses that is supposed to be matched by <~ ")", and thus the parsing fails. What am I missing here?
private def parser: Parser[Any] = a ~ b ~ c ^^ {
<do stuff here>
}
private def a: Parser[String] = "\"[^\"]*\"".r | "[^(),>]*".r
private def b: Parser[String] = opt("(" ~> ".*".r <~ ")") ^^ {
case Some(y) => y.trim
case None => ""
}
private def c: Parser[String] = rep(".#" ~> "[^>.]*".r) ^^ (new String(_).trim)
This is supposed to parse following kind of strings:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(icl>uw)
test9(icl>uw).#attr
"test10..().#"(icl>uw).#attr
test11(icl>uw(agt>uw2,obj>uw3),icl>uw4(agt>uw5))
test12(icl>uw1(agt>uw2,obj>uw3),icl>uw4).#attr1.#attr2
test13(agt>thing,obj>role>effect)
So the "a" parser parses the string until open parentheses or .#attr part. "b" parser parses the characters inside optional parentheses. "c" parses the optional .#attrs.
Currently I get similar error on all test strings containing parentheses part:
11:07:44.662 [main] DEBUG - Parsed: test8()
11:07:44.667 [main] ERROR - FAILURE parsing: test8(icl>uw) -- `)' expected but `i' found
So I assume that the parser parsed the first part correctly, but failed when it saw the parentheses part.
The right solution to parse nested structures is to use recursion, for example in the following fashion:
val parser= "regex".r
#tailrec
def extract(string:String,foundTokens:List[String]=List.empty):List[String]={
parser.findFirstMatchIn(string) match {
case Some(parser(matchedValue)) => extract(matchedValue,matchedValue::foundedTokens)
case None=>foundTokens
}
Where basically at each call to the function, you append the found token to a list of results and you launch the function on the result of the match. When you do not find anymore you return the found token.
If multiple matches are possible inside each subtoken, then you should look for a procedure like this one:
def extract(string:String):Iterator[String]={
parser.findAllIn(string).flatMap{
item => extract(item)
}
}
I'm trying to write a parser for a certain language as part of my research. Currently I have problems getting the following code to work in a way I want:
private def _uw: Parser[UW] = _headword ~ _modifiers ~ _attributes ^^ {
case hw ~ mods ~ attrs => new UW(hw, mods, attrs)
}
private def _headword[String] = "\".*\"".r | "[^(),]*".r
private def _modifiers: Parser[List[UWModifier]] = opt("(" ~> repsep(_modifier, ",") <~ ")") ^^ {
case Some(mods) => mods
case None => List[UWModifier]()
}
private def _modifier: Parser[UWModifier] = ("[^><]*".r ^^ (RelTypes.toRelType(_))) ~ "[><]".r ~ _uw ^^ {
case (rel: RelType) ~ x ~ (uw: UW) => new UWModifier(rel, uw)
}
private def _attributes: Parser[List[UWAttribute]] = rep(_attribute) ^^ {
case Nil => List[UWAttribute]()
case attrs => attrs
}
private def _attribute: Parser[UWAttribute] = ".#" ~> "[^>.]*".r ^^ (new UWAttribute(_))
The above code contains just one part of the language, and to spare time and space, I won't go much into details about the whole language. _uw method is supposed to parse a string that consists of three parts, although just the first part must exist in the string.
_uw should be able to parse these test strings correctly:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(urel>uw)
test9(urel>uw).#attr
"test10..().#"(urel>uw).#attr
test11(urel1>uw1(urel2>uw2,urel3>uw3),urel4>uw4).#attr1.#attr2
So if the headword starts and ends with ", everything inside the double quotes is considered to be part of the headword. All words starting with .#, if they are not inside the double quotes, are attributes of the headword.
E.g. in test5, the parser should parse test5. as headword, and attr as an attribute. Just .# is omitted, and all dots before that should be contained in the headword.
So, after headword there CAN be attributes and/or modifiers. The order is strict, so attributes always come after modifiers. If there are attributes but no modifiers, everything until .# is considered as part of the headword.
The main problem is "[^#(]*".r. I've tried all kind of creative alternatives, such as "(^[\\w\\.]*)((\\.\\#)|$)".r, but nothing seems to work. How does lookahead or lookbehind even affect parser combinators? I'm not an expert on parsing or regex, so all help is welcome!
I don't think "[^#(]*".r has anything to do with your problem. I see this:
private def _headword[String] = "\".*\"".r | "[^(),]*".r
which is the first thing in _uw (and, by the way, using underscores in names in Scala is not recommended), so when it tries to parse test5..#attr, the second regexp will match all of it!
scala> "[^(),]*".r findFirstIn "test5..#attr"
res0: Option[String] = Some(test5..#attr)
So there will be nothing left for the remaining parsers. Also, the first regex in _headword is also problematic, because .* will accept quotes, which means that something like this becomes valid:
"test6 with a " inside of it..".#attr
As for look-ahead and look-behind, it doesn't affect parser combinators at all. Either the regex matches, or it doesn't -- that's all the parser combinators care about.
Sorry if it's a novice question - I want to parse something defined by
Exp ::= Mandatory_Part Optional_Part0 Optional_Part1
I thought I could do this:
proc::Parser String
proc = do {
;str<-parserMandatoryPart
;str0<-optional(parserOptionalPart0) --(1)
;str1<-optional(parserOptionalPart1) --(2)
;return str++str0++str1
}
I want to get str0/str1 if optional parts are present, otherwise, str0/str1 would be "".
But (1) and (2) won't work since optional() doesn't allow extracting result from its parameters, in this case, parserOptionalPart0/parserOptionalPart1.
Now What would be the proper way to do it?
Many thanks!
Billy R
The function you're looking for is optionMaybe. It returns Nothing if the parser failed, and returns the content in Just if it consumed input.
From the docs:
option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.
So you could do:
proc :: Parser String
proc = do
str <- parserMandatoryPart
str0 <- option "" parserOptionalPart0
str1 <- option "" parserOptionalPart1
return (str++str0++str1)
Watch out for the "without consuming input" part. You may need to wrap either or both optional parsers with try.
I've also adjusted your code style to be more standard, and fixed an error on the last line. return isn't a keyword; it's an ordinary function. So return a ++ b is (return a) ++ b, i.e. almost never what you want.