Sbt Parser fails with OutOfMemoryException - parsing

I have an sbt plugin that contains something a Parser similar to this:
package sbtpin
import sbt.complete._
import DefaultParsers._
object InputParser {
private lazy val dotParser: Parser[Char] = '.'
private lazy val objectId = identifier(Letter, IDChar | dotParser)
private lazy val addCommand1 = "add" ~> Space.+ ~> objectId ~ (Space.+ ~> NotSpace.+).? map(p => AddCommand1(p._1, p._2))
private lazy val addCommand2 = "add -n" ~> Space.+ ~> objectId ~ (Space.+ ~> NotSpace.+).? map(p => AddCommand1(p._1, p._2))
private lazy val addCommand2 = "add -l" ~> Space.+ ~> objectId ~ (Space.+ ~> NotSpace.+).? map(p => AddCommand1(p._1, p._2))
lazy val parser: Parser[Command] = Space.* ~> (addCommand1 | addCommand2 | addCommand3)
}
When trying to run tests with this parser, it fails with "java.lang.OutOfMemoryError: Java heap space"
at scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:46)
at scala.collection.mutable.StringBuilder.<init>(StringBuilder.scala:51)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:290)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at sbt.complete.ParserMain$$anon$3$$anonfun$string$1.apply(Parser.scala:313)
at sbt.complete.ParserMain$$anon$3$$anonfun$string$1.apply(Parser.scala:313)
at sbt.complete.Parser$Value.map(Parser.scala:161)
at sbt.complete.MapParser.resultEmpty$lzycompute(Parser.scala:704)
at sbt.complete.MapParser.resultEmpty(Parser.scala:704)
at sbt.complete.Repeat.derive(Parser.scala:834)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
at sbt.complete.HomParser.derive(Parser.scala:632)
Compiling also takes a long time, which is unexpected.

I realized that I was using Space.+, Space.* etc., when Space, OptSpace etc. would be sufficient. This is because Space and OptSpace (which are defined in DefaultParsers) already match multiple characters.
I changed the code to the following, which works fine:
package sbtpin
import sbt.complete._
import DefaultParsers._
object InputParser {
private lazy val dotParser: Parser[Char] = '.'
private lazy val objectId = identifier(Letter, IDChar | dotParser)
private lazy val addCommand1 = "add" ~> Space ~> objectId ~ (Space ~> NotSpace).? map(p => AddCommand1(p._1, p._2))
private lazy val addCommand2 = "add -n" ~> Space ~> objectId ~ (Space.+ ~> NotSpace).? map(p => AddCommand1(p._1, p._2))
private lazy val addCommand2 = "add -l" ~> Space ~> objectId ~ (Space ~> NotSpace).? map(p => AddCommand1(p._1, p._2))
lazy val parser: Parser[Command] = OptSpace ~> (addCommand1 | addCommand2 | addCommand3)
}
This is how Space, OptSpace, etc. are defined in sbt:
/** Matches a single character that is not a whitespace character. */
lazy val NotSpaceClass = charClass(!_.isWhitespace, "non-whitespace character")
/** Matches a single whitespace character, as determined by Char.isWhitespace.*/
lazy val SpaceClass = charClass(_.isWhitespace, "whitespace character")
/** Matches a non-empty String consisting of non-whitespace characters. */
lazy val NotSpace = NotSpaceClass.+.string
/** Matches a possibly empty String consisting of non-whitespace characters. */
lazy val OptNotSpace = NotSpaceClass.*.string
/** Matches a non-empty String consisting of whitespace characters.
* The suggested tab completion is a single, constant space character.*/
lazy val Space = SpaceClass.+.examples(" ")
/** Matches a possibly empty String consisting of whitespace characters.
* The suggested tab completion is a single, constant space character.*/
lazy val OptSpace = SpaceClass.*.examples(" ")

Related

Scala parser failure handling, dangling commas

Getting started with Scala parser combinations, before moving on need to grasp failure/error handling better (note: still getting into Scala as well)
Want to parse strings like "a = b, c = d" into a list of tuples but flag the user when dangling commas are found.
Thought about matching off failure ("a = b, ") when matching comma separated property assignments:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") | rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
def propertyAssignment: Parser[(String, String)] = ident ~ "=" ~ ident ^^ {
case id ~ "=" ~ prop => (id, prop)
}
And call the parser with:
p.parseAll(p.commaList(p.propertyAssignment), "name = John , ")
which results in a Failure, no surprise but with:
string matching regex `\p{javaJavaIdentifierStart}\p{javaJavaIdentifierPart}*' expected but end of source found
The commList function succeeds on the first property assignment and starts repeating given the comma but the next "ident" fails on the fact that the next character is the end of the source data. Thought I could catch that 2nd alternative in the commList would match:
rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
Nix. Ideas?
Scalaz to the rescue :-)
When you are working with warnings, it is not a good idea to exit your parser with a failure. You can easily combine the parser with the Scalaz writer monad. With this monads you can add messages to the partial result during the parser run. These messages could be infos, warnings or errors. After the parser finishes, you can then validate the result, if it can be used or if it contains critical problems. With such a separate vaildator step you get usual much better error messages. For example you could accept arbitrary characters at the end of the string, but issue an error when they are found (e.g. "Garbage found after last statement"). The error message can be much more helpful for the user than the cryptic default one you get in the example below ("string matching regex `\z' expected [...]").
Here is an example based on the code in your question:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import util.parsing.combinator.RegexParsers
import scalaz._, Scalaz._
object DemoParser extends RegexParsers {
type Warning = String
case class Equation(left : String, right : String)
type PWriter = Writer[Vector[Warning], List[Equation]]
val emptyList : List[Equation] = Nil
def rep1sep2[T](p : => Parser[T], q : => Parser[Any]): Parser[List[T]] =
p ~ rep(q ~> p) ^^ {case x~y => x::y}
def name : Parser[String] = """\w+""".r
def equation : Parser[Equation] = name ~ "=" ~ name ^^ { case n ~ _ ~ v => Equation(n,v) }
def commaList : Parser[PWriter] = rep1sep(equation, ",") ^^ (_.set(Vector()))
def danglingComma : Parser[PWriter] = opt(",") ^^ (
_ map (_ => emptyList.set(Vector("Warning: Dangling comma")))
getOrElse(emptyList.set(Vector("getOrElse(emptyList.set(Vector(""))))
def danglingList : Parser[PWriter] = commaList ~ danglingComma ^^ {
case l1 ~ l2 => (l1.over ++ l2.over).set(l1.written ++ l2.written) }
def apply(input: String): PWriter = parseAll(danglingList, input) match {
case Success(result, _) => result
case failure : NoSuccess => emptyList.set(Vector(failure.msg))
}
}
// Exiting paste mode, now interpreting.
import util.parsing.combinator.RegexParsers
import scalaz._
import Scalaz._
defined module DemoParser
scala> DemoParser("a=1, b=2")
res2: DemoParser.PWriter = (Vector(),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2,")
res3: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ")
res4: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ;")
res5: DemoParser.PWriter = (Vector(string matching regex `\z' expected but `;' found),List())
scala>
As you can see, it handles the error cases fine. If you want to extend the example, add case classes for different kinds of errors and include the current parser positions in the messages.
Btw. the problem with the white spaces is handled by the RegexParsers class. If you want to change the handling of white spaces, just override the field whiteSpace.
Your parser isn't expecting the trailing whitespace at the end of "name = John , ".
You could use a regex to optionally parse "," followed by any amount of whitespace:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") <~ opt(",\\s*".r ~> failure("Dangling comma"))
Note that you can avoid using alternatives (|) here, by making the failure part of the optional parser. If the optional part consumes some input and then fails, then the whole parser fails.

How to parse set of properties in unspecified order?

I have a Regex parser that processes a custom property file. In my file, I have the following structure:
...
[NodeA]
propA=val1
propB=val2
propC=val3
[NodeB]
...
I defined a parser that processes NodeA as follows:
lazy val parserA: Parser[String] = "propA" ~> "=" ~> mPropA
lazy val parserB: Parser[String] =
...
lazy val nodeA: Parser[NodeA] = "[" ~> "NodeA" ~> "]" ~> parserA ~> parserB ~> parserB ^^ {
case iPropA ~ iPropB ~ iPropC => new NodeA(iPropA, iPropB, iPropC)
}
This works fine as it stands. The problem is if NodeA comes with a different property order, in which case I get a parsing error. For example:
[NodeA]
propC=val3
propA=val1
propB=val2
Is there any way to define my parser such that it accepts an unspecified ordering of NodeA's properties?
Still I have the feeling not understanding your problem, but what about:
import scala.util.parsing.combinator.JavaTokenParsers
object Test extends App with JavaTokenParsers {
case class Prop(name: String, value: String)
case class Node(name: String, propA: Prop, propB: Prop, propC: Prop)
lazy val prop = (ident <~ "=") ~ ident ^^ {
case p ~ v => (p, v)
}
lazy val node = "[" ~> ident <~ "]"
lazy val props = repN(3, prop) ^^ {
_.sorted map Prop.tupled
}
lazy val nodes = rep(node ~ props) ^^ {
_ map { case node ~ List(a, b, c) => Node(node, a, b, c) }
}
val in =
"""[NodeA]
propA=val1
propB=val2
propC=val3
[NodeB]
propC=val3
propA=val1
propB=val2"""
println(parseAll(nodes, in))
}

Using regex in StandardTokenParsers

I'm trying to use regex in my StandardTokenParsers based parser. For that, I've subclassed StdLexical as follows:
class CustomLexical extends StdLexical{
def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in:Input) = r.findPrefixMatchOf(in.source.subSequence(in.offset, in.source.length)) match {
case Some(matched) => Success(in.source.subSequence(in.offset, in.offset + matched.end).toString,
in.drop(matched.end))
case None => Failure("string matching regex `" + r + "' expected but " + in.first + " found", in)
}
}
override def token: Parser[Token] =
( regex("[a-zA-Z]:\\\\[\\w\\\\?]* | /[\\w/]*".r) ^^ { StringLit(_) }
| identChar ~ rep( identChar | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }
| ...
But I'm a little confused on how I would define a Parser that takes advantage of this. I have a parser defined as:
def mTargetFolder: Parser[String] = "TargetFolder" ~> "=" ~> mFilePath
which should be used to identify valid file paths. I tried then:
def mFilePath: Parser[String] = "[a-zA-Z]:\\\\[\\w\\\\?]* | /[\\w/]*".r
But this is obviously not right. I get an error:
scala: type mismatch;
found : scala.util.matching.Regex
required: McfpDSL.this.Parser[String]
def mFilePath: Parser[String] = "[a-zA-Z]:\\\\[\\w\\\\?]* | /[\\w/]*".r
^
What is the proper way of using the extension made on my StdLexical subclass?
If you really want to use token based parsing, and reuse StdLexical, I would advise to update the syntax for "TargetFolder" so that the value after the equal sign is a proper string literal. Or in other words, make it so the path should be enclosed with quotes. From that point you don't need to extends StdLexical anymore.
Then comes the problem of converting a regexp to a parser. Scala already has RegexParsers for this (which implicitly converts a regexp to a Parser[String]), but unfortunately that's not what you want here because it works on streams of Char (type Elem = Char in RegexParsers) while you are working on a sttream of tokens.
So we will indeed have to define our own conversion from Regex to Parser[String] (but at the syntactic level rather than lexical level, or in other words in the token parser).
import scala.util.parsing.combinator.syntactical._
import scala.util.matching.Regex
import scala.util.parsing.input._
object MyParser extends StandardTokenParsers {
import lexical.StringLit
def regexStringLit(r: Regex): Parser[String] = acceptMatch(
"string literal matching regex " + r,
{ case StringLit( s ) if r.unapplySeq(s).isDefined => s }
)
lexical.delimiters += "="
lexical.reserved += "TargetFolder"
lazy val mTargetFolder: Parser[String] = "TargetFolder" ~> "=" ~> mFilePath
lazy val mFilePath: Parser[String] = regexStringLit("([a-zA-Z]:\\\\[\\w\\\\?]*)|(/[\\w/]*)".r)
def parseTargetFolder( s: String ) = { mTargetFolder( new lexical.Scanner( s ) ) }
}
Example:
scala> MyParser.parseTargetFolder("""TargetFolder = "c:\Dir1\Dir2" """)
res12: MyParser.ParseResult[String] = [1.31] parsed: c:\Dir1\Dir2
scala> MyParser.parseTargetFolder("""TargetFolder = "/Dir1/Dir2" """)
res13: MyParser.ParseResult[String] = [1.29] parsed: /Dir1/Dir2
scala> MyParser.parseTargetFolder("""TargetFolder = "Hello world" """)
res14: MyParser.ParseResult[String] =
[1.16] failure: identifier matching regex ([a-zA-Z]:\\[\w\\?]*)|(/[\w/]*) expected
TargetFolder = "Hello world"
^
Note that also fixed your "target folder" regexp here, you had missing parens around the two alternative, plus unneeded spaces.
Just call your function regex when you want to get a Parser[String] from a Regex:
def p: Parser[String] = regex("".r)
Or make regex implicit to let the compiler call it automatically for you:
implicit def regex(r: Regex): Parser[String] = ...
// =>
def p: Parser[String] = "".r

Scala: Parsing matching token

I'm playing around with a toy HTML parser, to help familiarize myself with Scala's parsing combinators library:
import scala.util.parsing.combinator._
sealed abstract class Node
case class TextNode(val contents : String) extends Node
case class Element(
val tag : String,
val attributes : Map[String,Option[String]],
val children : Seq[Node]
) extends Node
object HTML extends RegexParsers {
val node: Parser[Node] = text | element
val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
val label: Parser[String] = """(\w[:\w]*)""".r
val value : Parser[String] = """("[^"]*"|\w+)""".r
val attribute : Parser[(String,Option[String])] = label ~ (
"=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
) ^^ { case (k ~ v) => k -> v }
val element: Parser[Element] = (
("<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" )
~ rep(node) ~
("</" ~> label <~ ">")
) ^^ {
case (tag ~ attributes ~ children ~ close) => Element(tag, Map(attributes : _*), children)
}
}
What I'm realizing I want is some way to make sure my opening and closing tags match.
I think to do that, I need some sort of flatMap combinator ~ Parser[A] => (A => Parser[B]) => Parser[B],
so I can use the opening tag to construct the parser for the closing tag. But I don't see anything matching that signature in the library.
What's the proper way to do this?
You can write a method that takes a tag name and returns a parser for a closing tag with that name:
object HTML extends RegexParsers {
lazy val node: Parser[Node] = text | element
val text: Parser[TextNode] = """[^<]+""".r ^^ TextNode
val label: Parser[String] = """(\w[:\w]*)""".r
val value : Parser[String] = """("[^"]*"|\w+)""".r
val attribute : Parser[(String, Option[String])] = label ~ (
"=" ~> value ^^ Some[String] | "" ^^ { case _ => None }
) ^^ { case (k ~ v) => k -> v }
val openTag: Parser[String ~ Seq[(String, Option[String])]] =
"<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">"
def closeTag(name: String): Parser[String] = "</" ~> name <~ ">"
val element: Parser[Element] = openTag.flatMap {
case (tag ~ attrs) =>
rep(node) <~ closeTag(tag) ^^
(children => Element(tag, attrs.toMap, children))
}
}
Note that you also need to make node lazy. Now you get a nice clean error message for unmatched tags:
scala> HTML.parse(HTML.element, "<a></b>")
res0: HTML.ParseResult[Element] =
[1.6] failure: `a' expected but `b' found
<a></b>
^
I've been a little more verbose than necessary for the sake of clarity. If you want concision you can skip the openTag and closeTag methods and write element like this, for example:
val element = "<" ~> label ~ rep(whiteSpace ~> attribute) <~ ">" >> {
case (tag ~ attrs) =>
rep(node) <~ "</" ~> tag <~ ">" ^^
(children => Element(tag, attrs.toMap, children))
}
I'm sure more concise versions would be possible, but in my opinion even this is edging toward unreadability.
There is a flatMap on Parser, and also an equivalent method named into and an operator >>, which might be more convenient aliases (flatMap is still needed when used in for comprehensions). It is indeed a valid way to do what you're looking for.
Alternatively, you can check that the tags match with ^?.
You are looking at the wrong place. It's a normal mistake, though. You want a method Parser[A] => (A => Parser[B]) => Parser[B], but you looked at the docs of Parsers, not Parser.
Look here.
There's a flatMap, also known as into or >>.

Having some simple problems with Scala combinator parsers

First, the code:
package com.digitaldoodles.markup
import scala.util.parsing.combinator.{Parsers, RegexParsers}
import com.digitaldoodles.rex._
class MarkupParser extends RegexParsers {
val stopTokens = (Lit("{{") | "}}" | ";;" | ",,").lookahead
val name: Parser[String] = """[##!$]?[a-zA-Z][a-zA-Z0-9]*""".r
val content: Parser[String] = (patterns.CharAny ** 0 & stopTokens).regex
val function: Parser[Any] = name ~ repsep(content, "::") <~ ";;"
val block1: Parser[Any] = "{{" ~> function
val block2: Parser[Any] = "{{" ~> function <~ "}}"
val lst: Parser[Any] = repsep("[a-z]", ",")
}
object ParseExpr extends MarkupParser {
def main(args: Array[String]) {
println("Content regex is ", (patterns.CharAny ** 0 & stopTokens).regex)
println(parseAll(block1, "{{#name 3:4:foo;;"))
println(parseAll(block2, "{{#name 3:4:foo;; stuff}}"))
println(parseAll(lst, "a,b,c"))
}
}
then, the run results:
[info] == run ==
[info] Running com.digitaldoodles.markup.ParseExpr
(Content regex is ,(?:[\s\S]{0,})(?=(?:(?:\{\{|\}\})|;;)|\,\,))
[1.18] parsed: (#name~List(3:4:foo))
[1.24] failure: `;;' expected but `}' found
{{#name 3:4:foo;; stuff}}
^
[1.1] failure: string matching regex `\z' expected but `a' found
a,b,c
^
I use a custom library to assemble some of my regexes, so I've printed out the "content" regex; its supposed to be basically any text up to but not including certain token patterns, enforced using a positive lookahead assertion.
Finally, the problems:
1) The first run on "block1" succeeds, but shouldn't, because the separator in the "repsep" function is "::", yet ":" are parsed as separators.
2) The run on "block2" fails, presumably because the lookahead clause isn't working--but I can't figure out why this should be. The lookahead clause was already exercised in the "repsep" on the run on "block1" and seemed to work there, so why should it fail on block 2?
3) The simple repsep exercise on "lst" fails because internally, the parser engine seems to be looking for a boundary--is this something I need to work around somehow?
Thanks,
Ken
1) No, "::" are not parsed as separators. If it did, the output would be (#name~List(3, 4, foo)).
2) It happens because "}}" is also a delimiter, so it takes the longest match it can -- the one that includes ";;" as well. If you make the preceding expression non-eager, it will then fail at "s" on "stuff", which I presume is what you expected.
3) You passed a literal, not a regex. Modify "[a-z]" to "[a-z]".r and it will work.

Resources