Scala parser failure handling, dangling commas - parsing

Getting started with Scala parser combinations, before moving on need to grasp failure/error handling better (note: still getting into Scala as well)
Want to parse strings like "a = b, c = d" into a list of tuples but flag the user when dangling commas are found.
Thought about matching off failure ("a = b, ") when matching comma separated property assignments:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") | rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
def propertyAssignment: Parser[(String, String)] = ident ~ "=" ~ ident ^^ {
case id ~ "=" ~ prop => (id, prop)
}
And call the parser with:
p.parseAll(p.commaList(p.propertyAssignment), "name = John , ")
which results in a Failure, no surprise but with:
string matching regex `\p{javaJavaIdentifierStart}\p{javaJavaIdentifierPart}*' expected but end of source found
The commList function succeeds on the first property assignment and starts repeating given the comma but the next "ident" fails on the fact that the next character is the end of the source data. Thought I could catch that 2nd alternative in the commList would match:
rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
Nix. Ideas?

Scalaz to the rescue :-)
When you are working with warnings, it is not a good idea to exit your parser with a failure. You can easily combine the parser with the Scalaz writer monad. With this monads you can add messages to the partial result during the parser run. These messages could be infos, warnings or errors. After the parser finishes, you can then validate the result, if it can be used or if it contains critical problems. With such a separate vaildator step you get usual much better error messages. For example you could accept arbitrary characters at the end of the string, but issue an error when they are found (e.g. "Garbage found after last statement"). The error message can be much more helpful for the user than the cryptic default one you get in the example below ("string matching regex `\z' expected [...]").
Here is an example based on the code in your question:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import util.parsing.combinator.RegexParsers
import scalaz._, Scalaz._
object DemoParser extends RegexParsers {
type Warning = String
case class Equation(left : String, right : String)
type PWriter = Writer[Vector[Warning], List[Equation]]
val emptyList : List[Equation] = Nil
def rep1sep2[T](p : => Parser[T], q : => Parser[Any]): Parser[List[T]] =
p ~ rep(q ~> p) ^^ {case x~y => x::y}
def name : Parser[String] = """\w+""".r
def equation : Parser[Equation] = name ~ "=" ~ name ^^ { case n ~ _ ~ v => Equation(n,v) }
def commaList : Parser[PWriter] = rep1sep(equation, ",") ^^ (_.set(Vector()))
def danglingComma : Parser[PWriter] = opt(",") ^^ (
_ map (_ => emptyList.set(Vector("Warning: Dangling comma")))
getOrElse(emptyList.set(Vector("getOrElse(emptyList.set(Vector(""))))
def danglingList : Parser[PWriter] = commaList ~ danglingComma ^^ {
case l1 ~ l2 => (l1.over ++ l2.over).set(l1.written ++ l2.written) }
def apply(input: String): PWriter = parseAll(danglingList, input) match {
case Success(result, _) => result
case failure : NoSuccess => emptyList.set(Vector(failure.msg))
}
}
// Exiting paste mode, now interpreting.
import util.parsing.combinator.RegexParsers
import scalaz._
import Scalaz._
defined module DemoParser
scala> DemoParser("a=1, b=2")
res2: DemoParser.PWriter = (Vector(),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2,")
res3: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ")
res4: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ;")
res5: DemoParser.PWriter = (Vector(string matching regex `\z' expected but `;' found),List())
scala>
As you can see, it handles the error cases fine. If you want to extend the example, add case classes for different kinds of errors and include the current parser positions in the messages.
Btw. the problem with the white spaces is handled by the RegexParsers class. If you want to change the handling of white spaces, just override the field whiteSpace.

Your parser isn't expecting the trailing whitespace at the end of "name = John , ".
You could use a regex to optionally parse "," followed by any amount of whitespace:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") <~ opt(",\\s*".r ~> failure("Dangling comma"))
Note that you can avoid using alternatives (|) here, by making the failure part of the optional parser. If the optional part consumes some input and then fails, then the whole parser fails.

Related

Using regex in StandardTokenParsers

I'm trying to use regex in my StandardTokenParsers based parser. For that, I've subclassed StdLexical as follows:
class CustomLexical extends StdLexical{
def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in:Input) = r.findPrefixMatchOf(in.source.subSequence(in.offset, in.source.length)) match {
case Some(matched) => Success(in.source.subSequence(in.offset, in.offset + matched.end).toString,
in.drop(matched.end))
case None => Failure("string matching regex `" + r + "' expected but " + in.first + " found", in)
}
}
override def token: Parser[Token] =
( regex("[a-zA-Z]:\\\\[\\w\\\\?]* | /[\\w/]*".r) ^^ { StringLit(_) }
| identChar ~ rep( identChar | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }
| ...
But I'm a little confused on how I would define a Parser that takes advantage of this. I have a parser defined as:
def mTargetFolder: Parser[String] = "TargetFolder" ~> "=" ~> mFilePath
which should be used to identify valid file paths. I tried then:
def mFilePath: Parser[String] = "[a-zA-Z]:\\\\[\\w\\\\?]* | /[\\w/]*".r
But this is obviously not right. I get an error:
scala: type mismatch;
found : scala.util.matching.Regex
required: McfpDSL.this.Parser[String]
def mFilePath: Parser[String] = "[a-zA-Z]:\\\\[\\w\\\\?]* | /[\\w/]*".r
^
What is the proper way of using the extension made on my StdLexical subclass?
If you really want to use token based parsing, and reuse StdLexical, I would advise to update the syntax for "TargetFolder" so that the value after the equal sign is a proper string literal. Or in other words, make it so the path should be enclosed with quotes. From that point you don't need to extends StdLexical anymore.
Then comes the problem of converting a regexp to a parser. Scala already has RegexParsers for this (which implicitly converts a regexp to a Parser[String]), but unfortunately that's not what you want here because it works on streams of Char (type Elem = Char in RegexParsers) while you are working on a sttream of tokens.
So we will indeed have to define our own conversion from Regex to Parser[String] (but at the syntactic level rather than lexical level, or in other words in the token parser).
import scala.util.parsing.combinator.syntactical._
import scala.util.matching.Regex
import scala.util.parsing.input._
object MyParser extends StandardTokenParsers {
import lexical.StringLit
def regexStringLit(r: Regex): Parser[String] = acceptMatch(
"string literal matching regex " + r,
{ case StringLit( s ) if r.unapplySeq(s).isDefined => s }
)
lexical.delimiters += "="
lexical.reserved += "TargetFolder"
lazy val mTargetFolder: Parser[String] = "TargetFolder" ~> "=" ~> mFilePath
lazy val mFilePath: Parser[String] = regexStringLit("([a-zA-Z]:\\\\[\\w\\\\?]*)|(/[\\w/]*)".r)
def parseTargetFolder( s: String ) = { mTargetFolder( new lexical.Scanner( s ) ) }
}
Example:
scala> MyParser.parseTargetFolder("""TargetFolder = "c:\Dir1\Dir2" """)
res12: MyParser.ParseResult[String] = [1.31] parsed: c:\Dir1\Dir2
scala> MyParser.parseTargetFolder("""TargetFolder = "/Dir1/Dir2" """)
res13: MyParser.ParseResult[String] = [1.29] parsed: /Dir1/Dir2
scala> MyParser.parseTargetFolder("""TargetFolder = "Hello world" """)
res14: MyParser.ParseResult[String] =
[1.16] failure: identifier matching regex ([a-zA-Z]:\\[\w\\?]*)|(/[\w/]*) expected
TargetFolder = "Hello world"
^
Note that also fixed your "target folder" regexp here, you had missing parens around the two alternative, plus unneeded spaces.
Just call your function regex when you want to get a Parser[String] from a Regex:
def p: Parser[String] = regex("".r)
Or make regex implicit to let the compiler call it automatically for you:
implicit def regex(r: Regex): Parser[String] = ...
// =>
def p: Parser[String] = "".r

How to parse this structure: "name[arg,arg]" with scala combinator parsers?

I have several strings like these:
name[arg,arg,arg]
name[arg,arg]
name[arg]
name
I wanted to parse it with scala combinator parsers, and this is the best that I managed to get:
object TaskDepParser extends JavaTokenParsers {
def name: Parser[String] = "[^\\[\\],]+".r
def expr: Parser[(String, Option[List[String]])] =
name ^^ { a => (a, None) } |
name ~ "[" ~ repsep(name, ",") ~ "]" ^^ { case name~_~args~_ => (name, Some(args)) }
}
It works on name, but fails to work on name[arg] - says string matching regex\z' expected but [' found. Is it possible to fix it?
#TonyK has already given the answer in his comment. But I wanna suggest that Scala parser combinators can already parse optional values:
object TaskDepParser extends JavaTokenParsers {
def name: Parser[String] = """[^\[\],]+""".r
def expr: Parser[(String, Option[List[String]])] =
name ~ opt("[" ~> repsep(name, ",") <~ "]") ^^ { case name ~ args => (name, args) }
}
With ~> and <~ it is possible to keep only left or right result to avoid unnecessary patter matching in ^^. Furthermore I would use triple quotes for strings to avoid lots of escaping.
I think it might work if you flip it around...Name is getting sucked up by the first rule, and then you get a failure on input.

How to parse embeded keywords without white-space in scala

I'm trying to split input by some keywords without delimiter like white-space.
object MyParser extends JavaTokenParsers {
def expr = (text | keyword).+
def text = ".+".r ^^ ("'"+_+"'")
def keyword = "ID".r ^^ ("["+_+"]")
}
val p = MyParser
p.parse(p.expr, "fooIDbar") match {
case p.Success(r, _) => r foreach print
case x => println(x.toString)
}
This outputs as below.
>> 'hogeIDfuga'
But I really want to do is like this.
>> 'hoge'[ID]'fuga'
It seems text engulfs all the characters. I tried to express [text does't contain keyword], but I could't. How to express it in regex or scala parser? or any other solutions?
I have seen some posts 1 2, but they don't work in my case.
First, keyword is a constant word so you don't need a regex, a plain string is enough.
Second, a text is some string that doesn't contain a keyword, not any string. Try this:
import util.parsing.combinator._
object MyParser extends JavaTokenParsers {
def expr = (text | keyword).+
def text = """((?!ID).)+""".r ^^ ("'"+_+"'")
def keyword = "ID" ^^ ("["+_+"]")
}
val p = MyParser
p.parse(p.expr, "fooIDbar") match {
case p.Success(r, _) => r foreach print
case x => println(x.toString)
}
As for the trick of writing a regex that not matching something, read this.

Scala Parser - Message Length

I'm toying with Scala's Parser library. I am trying to write a parser for a format where a length is specified followed by a message of that length. For example:
x.parseAll(x.message, "5helloworld") // result: "hello", remaining: "world"
I'm not sure how to do this using combinators. My mind first goes to:
def message = length ~ body
But obviously body depends on length, and I don't know how to do that :p
Instead you could just define a message Parser as a single Parser (not combination of Parsers) and I think that is doable (although I haven't looked if a single Parser can pull several elem?).
Anyways, I'm a scala noob, I just find this awesome :)
You should use into for that, or its abbreviation, >>:
scala> object T extends RegexParsers {
| def length: Parser[String] = """\d+""".r
| def message: Parser[String] = length >> { length => """\w{%d}""".format(length.toInt).r }
| }
defined module T
scala> T.parseAll(T.message, "5helloworld")
res0: T.ParseResult[String] =
[1.7] failure: string matching regex `\z' expected but `w' found
5helloworld
^
scala> T.parse(T.message, "5helloworld")
res1: T.ParseResult[String] = [1.7] parsed: hello
Be careful with precedence when using it. If you add an "~ remainder" after the function above, for instance, Scala will interpret it as length >> ({ length => ...} ~ remainder) instead of (length >> { length => ...}) ~ remainder.
This does not sound like a context free language, so you will need to use flatMap :
def message = length.flatMap(l => bodyOfLength(n))
where length is of type Parser[Int] and bodyOfLength(n) would be based on repN, such as
def bodyWithLength(n: Int) : Parser[String]
= repN(n, elem("any", _ => true)) ^^ {_.mkString}
I wouldn´t use pasrer combinators for this purpose. But if you have to or the problem becomes more complex you could try this:
def times(x :Long,what:String) : Parser[Any] = x match {
case 1 => what;
case x => what~times(x-1,what);
}
Don´t use parseAll if you want something remained, use parse.
You could parse length, store the result in a mutable field x(I know ugly, but useful here) and parse body x times, then you get the String parsed and the rest remains in the parser.

Having some simple problems with Scala combinator parsers

First, the code:
package com.digitaldoodles.markup
import scala.util.parsing.combinator.{Parsers, RegexParsers}
import com.digitaldoodles.rex._
class MarkupParser extends RegexParsers {
val stopTokens = (Lit("{{") | "}}" | ";;" | ",,").lookahead
val name: Parser[String] = """[##!$]?[a-zA-Z][a-zA-Z0-9]*""".r
val content: Parser[String] = (patterns.CharAny ** 0 & stopTokens).regex
val function: Parser[Any] = name ~ repsep(content, "::") <~ ";;"
val block1: Parser[Any] = "{{" ~> function
val block2: Parser[Any] = "{{" ~> function <~ "}}"
val lst: Parser[Any] = repsep("[a-z]", ",")
}
object ParseExpr extends MarkupParser {
def main(args: Array[String]) {
println("Content regex is ", (patterns.CharAny ** 0 & stopTokens).regex)
println(parseAll(block1, "{{#name 3:4:foo;;"))
println(parseAll(block2, "{{#name 3:4:foo;; stuff}}"))
println(parseAll(lst, "a,b,c"))
}
}
then, the run results:
[info] == run ==
[info] Running com.digitaldoodles.markup.ParseExpr
(Content regex is ,(?:[\s\S]{0,})(?=(?:(?:\{\{|\}\})|;;)|\,\,))
[1.18] parsed: (#name~List(3:4:foo))
[1.24] failure: `;;' expected but `}' found
{{#name 3:4:foo;; stuff}}
^
[1.1] failure: string matching regex `\z' expected but `a' found
a,b,c
^
I use a custom library to assemble some of my regexes, so I've printed out the "content" regex; its supposed to be basically any text up to but not including certain token patterns, enforced using a positive lookahead assertion.
Finally, the problems:
1) The first run on "block1" succeeds, but shouldn't, because the separator in the "repsep" function is "::", yet ":" are parsed as separators.
2) The run on "block2" fails, presumably because the lookahead clause isn't working--but I can't figure out why this should be. The lookahead clause was already exercised in the "repsep" on the run on "block1" and seemed to work there, so why should it fail on block 2?
3) The simple repsep exercise on "lst" fails because internally, the parser engine seems to be looking for a boundary--is this something I need to work around somehow?
Thanks,
Ken
1) No, "::" are not parsed as separators. If it did, the output would be (#name~List(3, 4, foo)).
2) It happens because "}}" is also a delimiter, so it takes the longest match it can -- the one that includes ";;" as well. If you make the preceding expression non-eager, it will then fail at "s" on "stuff", which I presume is what you expected.
3) You passed a literal, not a regex. Modify "[a-z]" to "[a-z]".r and it will work.

Resources