Having some simple problems with Scala combinator parsers - parsing

First, the code:
package com.digitaldoodles.markup
import scala.util.parsing.combinator.{Parsers, RegexParsers}
import com.digitaldoodles.rex._
class MarkupParser extends RegexParsers {
val stopTokens = (Lit("{{") | "}}" | ";;" | ",,").lookahead
val name: Parser[String] = """[##!$]?[a-zA-Z][a-zA-Z0-9]*""".r
val content: Parser[String] = (patterns.CharAny ** 0 & stopTokens).regex
val function: Parser[Any] = name ~ repsep(content, "::") <~ ";;"
val block1: Parser[Any] = "{{" ~> function
val block2: Parser[Any] = "{{" ~> function <~ "}}"
val lst: Parser[Any] = repsep("[a-z]", ",")
}
object ParseExpr extends MarkupParser {
def main(args: Array[String]) {
println("Content regex is ", (patterns.CharAny ** 0 & stopTokens).regex)
println(parseAll(block1, "{{#name 3:4:foo;;"))
println(parseAll(block2, "{{#name 3:4:foo;; stuff}}"))
println(parseAll(lst, "a,b,c"))
}
}
then, the run results:
[info] == run ==
[info] Running com.digitaldoodles.markup.ParseExpr
(Content regex is ,(?:[\s\S]{0,})(?=(?:(?:\{\{|\}\})|;;)|\,\,))
[1.18] parsed: (#name~List(3:4:foo))
[1.24] failure: `;;' expected but `}' found
{{#name 3:4:foo;; stuff}}
^
[1.1] failure: string matching regex `\z' expected but `a' found
a,b,c
^
I use a custom library to assemble some of my regexes, so I've printed out the "content" regex; its supposed to be basically any text up to but not including certain token patterns, enforced using a positive lookahead assertion.
Finally, the problems:
1) The first run on "block1" succeeds, but shouldn't, because the separator in the "repsep" function is "::", yet ":" are parsed as separators.
2) The run on "block2" fails, presumably because the lookahead clause isn't working--but I can't figure out why this should be. The lookahead clause was already exercised in the "repsep" on the run on "block1" and seemed to work there, so why should it fail on block 2?
3) The simple repsep exercise on "lst" fails because internally, the parser engine seems to be looking for a boundary--is this something I need to work around somehow?
Thanks,
Ken

1) No, "::" are not parsed as separators. If it did, the output would be (#name~List(3, 4, foo)).
2) It happens because "}}" is also a delimiter, so it takes the longest match it can -- the one that includes ";;" as well. If you make the preceding expression non-eager, it will then fail at "s" on "stuff", which I presume is what you expected.
3) You passed a literal, not a regex. Modify "[a-z]" to "[a-z]".r and it will work.

Related

Scala parser failure handling, dangling commas

Getting started with Scala parser combinations, before moving on need to grasp failure/error handling better (note: still getting into Scala as well)
Want to parse strings like "a = b, c = d" into a list of tuples but flag the user when dangling commas are found.
Thought about matching off failure ("a = b, ") when matching comma separated property assignments:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") | rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
def propertyAssignment: Parser[(String, String)] = ident ~ "=" ~ ident ^^ {
case id ~ "=" ~ prop => (id, prop)
}
And call the parser with:
p.parseAll(p.commaList(p.propertyAssignment), "name = John , ")
which results in a Failure, no surprise but with:
string matching regex `\p{javaJavaIdentifierStart}\p{javaJavaIdentifierPart}*' expected but end of source found
The commList function succeeds on the first property assignment and starts repeating given the comma but the next "ident" fails on the fact that the next character is the end of the source data. Thought I could catch that 2nd alternative in the commList would match:
rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
Nix. Ideas?
Scalaz to the rescue :-)
When you are working with warnings, it is not a good idea to exit your parser with a failure. You can easily combine the parser with the Scalaz writer monad. With this monads you can add messages to the partial result during the parser run. These messages could be infos, warnings or errors. After the parser finishes, you can then validate the result, if it can be used or if it contains critical problems. With such a separate vaildator step you get usual much better error messages. For example you could accept arbitrary characters at the end of the string, but issue an error when they are found (e.g. "Garbage found after last statement"). The error message can be much more helpful for the user than the cryptic default one you get in the example below ("string matching regex `\z' expected [...]").
Here is an example based on the code in your question:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import util.parsing.combinator.RegexParsers
import scalaz._, Scalaz._
object DemoParser extends RegexParsers {
type Warning = String
case class Equation(left : String, right : String)
type PWriter = Writer[Vector[Warning], List[Equation]]
val emptyList : List[Equation] = Nil
def rep1sep2[T](p : => Parser[T], q : => Parser[Any]): Parser[List[T]] =
p ~ rep(q ~> p) ^^ {case x~y => x::y}
def name : Parser[String] = """\w+""".r
def equation : Parser[Equation] = name ~ "=" ~ name ^^ { case n ~ _ ~ v => Equation(n,v) }
def commaList : Parser[PWriter] = rep1sep(equation, ",") ^^ (_.set(Vector()))
def danglingComma : Parser[PWriter] = opt(",") ^^ (
_ map (_ => emptyList.set(Vector("Warning: Dangling comma")))
getOrElse(emptyList.set(Vector("getOrElse(emptyList.set(Vector(""))))
def danglingList : Parser[PWriter] = commaList ~ danglingComma ^^ {
case l1 ~ l2 => (l1.over ++ l2.over).set(l1.written ++ l2.written) }
def apply(input: String): PWriter = parseAll(danglingList, input) match {
case Success(result, _) => result
case failure : NoSuccess => emptyList.set(Vector(failure.msg))
}
}
// Exiting paste mode, now interpreting.
import util.parsing.combinator.RegexParsers
import scalaz._
import Scalaz._
defined module DemoParser
scala> DemoParser("a=1, b=2")
res2: DemoParser.PWriter = (Vector(),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2,")
res3: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ")
res4: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ;")
res5: DemoParser.PWriter = (Vector(string matching regex `\z' expected but `;' found),List())
scala>
As you can see, it handles the error cases fine. If you want to extend the example, add case classes for different kinds of errors and include the current parser positions in the messages.
Btw. the problem with the white spaces is handled by the RegexParsers class. If you want to change the handling of white spaces, just override the field whiteSpace.
Your parser isn't expecting the trailing whitespace at the end of "name = John , ".
You could use a regex to optionally parse "," followed by any amount of whitespace:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") <~ opt(",\\s*".r ~> failure("Dangling comma"))
Note that you can avoid using alternatives (|) here, by making the failure part of the optional parser. If the optional part consumes some input and then fails, then the whole parser fails.

How to embed Scala code inside a specially defined syntax?

I don't know if this info is relevant to the question, but I am learning Scala parser combinators.
Using some examples (in this master thesis) I was able to write a simple functional (in the sense that it is non imperative) programming language.
Is there a way to improve my parser/evaluator such that it could allow/evaluate input like this:
<%
import scala.<some package / classes>
import weka.<some package / classes>
%>
some DSL code (lambda calculus)
<%
System.out.println("asdasd");
J48 j48 = new J48();
%>
as input written in the guest language (DSL)?
Should I use reflection or something similar* to evaluate such input?
Is there some source code recommendation to study (may be groovy sources?)?
Maybe this is something similar: runtime compilation, but I am not sure this is the best alternative.
EDIT
Complete answer given bellow with "{" and "}". Maybe "{{" would be better.
It is the question as to what the meaning of such import statements should be.
Perhaps you start first with allowing references to java methods in your language (the Lambda Calculus, I guess?).
For example:
java.lang.System.out.println "foo"
If you have that, you can then add resolution of unqualified names like
println "foo"
But here comes the first problem: println exists in System.out and System.err, or, to be more correct: it is a method of PrintStream, and both System.err and System.out are PrintStreams.
Hence you would need some notion of Objects, Classes, Types, and so on to do it right.
I managed how to run Scala code embedded in my interpreted DSL.
Insertion of DSL vars into Scala code and recovering returning value comes as a bonus. :)
Minimal relevant code from parsing and interpreting until performing embedded Scala code run-time execution (Main Parser AST and Interpreter):
object Main extends App {
val ast = Parser1 parse "some dsl code here"
Interpreter eval ast
}
object Parser1 extends RegexParsers with ImplicitConversions {
import AST._
val separator = ";"
def parse(input: String): Expr = parseAll(program, input).get
type P[+T] = Parser[T]
def program = rep1sep(expr, separator) <~ separator ^^ Sequence
def expr: Parser[Expr] = (assign /*more calls here*/)
def scalacode: P[Expr] = "{" ~> rep(scala_text) <~ "}" ^^ {case l => Scalacode(l.flatten)}
def scala_text = text_no_braces ~ "$" ~ ident ~ text_no_braces ^^ {case a ~ b ~ c ~ d => List(a, b + c, d)}
//more rules here
def assign = ident ~ ("=" ~> atomic_expr) ^^ Assign
//more rules here
def atomic_expr = (
ident ^^ Var
//more calls here
| "(" ~> expr <~ ")"
| scalacode
| failure("expression expected")
)
def text_no_braces = """[a-zA-Z0-9\"\'\+\-\_!##%\&\(\)\[\]\/\?\:;\.\>\<\,\|= \*\\\n]*""".r //| fail("Scala code expected")
def ident = """[a-zA-Z]+[a-zA-Z0-9]*""".r
}
object AST {
sealed abstract class Expr
// more classes here
case class Scalacode(items: List[String]) extends Expr
case class Literal(v: Any) extends Expr
case class Var(name: String) extends Expr
}
object Interpreter {
import AST._
val env = collection.immutable.Map[VarName, VarValue]()
def run(code: String) = {
val code2 = "val res_1 = (" + code + ")"
interpret.interpret(code2)
val res = interpret.valueOfTerm("res_1")
if (res == None) Literal() else Literal(res.get)
}
class Context(private var env: Environment = initEnv) {
def eval(e: Expr): Any = e match {
case Scalacode(l: List[String]) => {
val r = l map {
x =>
if (x.startsWith("$")) {
eval(Var(x.drop(1)))
} else {
x
}
}
eval(run(r.mkString))
}
case Assign(id, expr) => env += (id -> eval(expr))
//more pattern matching here
case Literal(v) => v
case Var(id) => {
env getOrElse(id, sys.error("Undefined " + id))
}
}
}
}

How to parse this structure: "name[arg,arg]" with scala combinator parsers?

I have several strings like these:
name[arg,arg,arg]
name[arg,arg]
name[arg]
name
I wanted to parse it with scala combinator parsers, and this is the best that I managed to get:
object TaskDepParser extends JavaTokenParsers {
def name: Parser[String] = "[^\\[\\],]+".r
def expr: Parser[(String, Option[List[String]])] =
name ^^ { a => (a, None) } |
name ~ "[" ~ repsep(name, ",") ~ "]" ^^ { case name~_~args~_ => (name, Some(args)) }
}
It works on name, but fails to work on name[arg] - says string matching regex\z' expected but [' found. Is it possible to fix it?
#TonyK has already given the answer in his comment. But I wanna suggest that Scala parser combinators can already parse optional values:
object TaskDepParser extends JavaTokenParsers {
def name: Parser[String] = """[^\[\],]+""".r
def expr: Parser[(String, Option[List[String]])] =
name ~ opt("[" ~> repsep(name, ",") <~ "]") ^^ { case name ~ args => (name, args) }
}
With ~> and <~ it is possible to keep only left or right result to avoid unnecessary patter matching in ^^. Furthermore I would use triple quotes for strings to avoid lots of escaping.
I think it might work if you flip it around...Name is getting sucked up by the first rule, and then you get a failure on input.

Scala Parser - Message Length

I'm toying with Scala's Parser library. I am trying to write a parser for a format where a length is specified followed by a message of that length. For example:
x.parseAll(x.message, "5helloworld") // result: "hello", remaining: "world"
I'm not sure how to do this using combinators. My mind first goes to:
def message = length ~ body
But obviously body depends on length, and I don't know how to do that :p
Instead you could just define a message Parser as a single Parser (not combination of Parsers) and I think that is doable (although I haven't looked if a single Parser can pull several elem?).
Anyways, I'm a scala noob, I just find this awesome :)
You should use into for that, or its abbreviation, >>:
scala> object T extends RegexParsers {
| def length: Parser[String] = """\d+""".r
| def message: Parser[String] = length >> { length => """\w{%d}""".format(length.toInt).r }
| }
defined module T
scala> T.parseAll(T.message, "5helloworld")
res0: T.ParseResult[String] =
[1.7] failure: string matching regex `\z' expected but `w' found
5helloworld
^
scala> T.parse(T.message, "5helloworld")
res1: T.ParseResult[String] = [1.7] parsed: hello
Be careful with precedence when using it. If you add an "~ remainder" after the function above, for instance, Scala will interpret it as length >> ({ length => ...} ~ remainder) instead of (length >> { length => ...}) ~ remainder.
This does not sound like a context free language, so you will need to use flatMap :
def message = length.flatMap(l => bodyOfLength(n))
where length is of type Parser[Int] and bodyOfLength(n) would be based on repN, such as
def bodyWithLength(n: Int) : Parser[String]
= repN(n, elem("any", _ => true)) ^^ {_.mkString}
I wouldn´t use pasrer combinators for this purpose. But if you have to or the problem becomes more complex you could try this:
def times(x :Long,what:String) : Parser[Any] = x match {
case 1 => what;
case x => what~times(x-1,what);
}
Don´t use parseAll if you want something remained, use parse.
You could parse length, store the result in a mutable field x(I know ugly, but useful here) and parse body x times, then you get the String parsed and the rest remains in the parser.

Scala: Using StandardTokenParser for parsing hexadecimal numbers

I am using Scala combinatorial parser by extending scala.util.parsing.combinator.syntactical.StandardTokenParser. This class provides following methods
def ident : Parser[String] for parsing identifiers and
def numericLit : Parser[String] for parsing a number (decimal I suppose)
I am using scala.util.parsing.combinator.lexical.Scannersfrom scala.util.parsing.combinator.lexical.StdLexicalfor lexing.
My requirement is to parse a hexadecimal number (without the 0x prefix) which can be of any length. Basically a grammar like: ([0-9]|[a-f])+
I tried integrating Regex parser but there are type issues there. Other ways to extend the definition of lexer delimiter and grammar rules lead to token not found!
As I thought the problem can be solved by extending the behavior of Lexer and not the Parser. The standard lexer takes only decimal digits, so I created a new lexer:
class MyLexer extends StdLexical {
override type Elem = Char
override def digit = ( super.digit | hexDigit )
lazy val hexDigits = Set[Char]() ++ "0123456789abcdefABCDEF".toArray
lazy val hexDigit = elem("hex digit", hexDigits.contains(_))
}
And my parser (which has to be a StandardTokenParser) can be extended as follows:
object ParseAST extends StandardTokenParsers{
override val lexical:MyLexer = new MyLexer()
lexical.delimiters += ( "(" , ")" , "," , "#")
...
}
The construction of the "number" from digits is taken care by StdLexical class:
class StdLexical {
...
def token: Parser[Token] =
...
| digit~rep(digit)^^{case first ~ rest => NumericLit(first :: rest mkString "")}
}
Since StdLexical gives just the parsed number as a String it is not a problem for me, as I am not interested in numeric value either.
You can use the RegexParsers with an action associated to the token in question.
import scala.util.parsing.combinator._
object HexParser extends RegexParsers {
val hexNum: Parser[Int] = """[0-9a-f]+""".r ^^
{ case s:String => Integer.parseInt(s,16) }
def seq: Parser[Any] = repsep(hexNum, ",")
}
This will define a parser that reads comma separated hex number with no prior 0x. And it will actually return a Int.
val result = HexParser.parse(HexParser.seq, "1, 2, f, 10, 1a2b34d")
scala> println(result)
[1.21] parsed: List(1, 2, 15, 16, 27439949)
Not there is no way to distinguish decimal notation numbers. Also I'm using the Integer.parseInt, this is limited to the size of your Int. To get any length you may have to make your own parser and use BigInteger or arrays.

Resources