I'm toying with Scala's Parser library. I am trying to write a parser for a format where a length is specified followed by a message of that length. For example:
x.parseAll(x.message, "5helloworld") // result: "hello", remaining: "world"
I'm not sure how to do this using combinators. My mind first goes to:
def message = length ~ body
But obviously body depends on length, and I don't know how to do that :p
Instead you could just define a message Parser as a single Parser (not combination of Parsers) and I think that is doable (although I haven't looked if a single Parser can pull several elem?).
Anyways, I'm a scala noob, I just find this awesome :)
You should use into for that, or its abbreviation, >>:
scala> object T extends RegexParsers {
| def length: Parser[String] = """\d+""".r
| def message: Parser[String] = length >> { length => """\w{%d}""".format(length.toInt).r }
| }
defined module T
scala> T.parseAll(T.message, "5helloworld")
res0: T.ParseResult[String] =
[1.7] failure: string matching regex `\z' expected but `w' found
5helloworld
^
scala> T.parse(T.message, "5helloworld")
res1: T.ParseResult[String] = [1.7] parsed: hello
Be careful with precedence when using it. If you add an "~ remainder" after the function above, for instance, Scala will interpret it as length >> ({ length => ...} ~ remainder) instead of (length >> { length => ...}) ~ remainder.
This does not sound like a context free language, so you will need to use flatMap :
def message = length.flatMap(l => bodyOfLength(n))
where length is of type Parser[Int] and bodyOfLength(n) would be based on repN, such as
def bodyWithLength(n: Int) : Parser[String]
= repN(n, elem("any", _ => true)) ^^ {_.mkString}
I wouldn´t use pasrer combinators for this purpose. But if you have to or the problem becomes more complex you could try this:
def times(x :Long,what:String) : Parser[Any] = x match {
case 1 => what;
case x => what~times(x-1,what);
}
Don´t use parseAll if you want something remained, use parse.
You could parse length, store the result in a mutable field x(I know ugly, but useful here) and parse body x times, then you get the String parsed and the rest remains in the parser.
Related
Getting started with Scala parser combinations, before moving on need to grasp failure/error handling better (note: still getting into Scala as well)
Want to parse strings like "a = b, c = d" into a list of tuples but flag the user when dangling commas are found.
Thought about matching off failure ("a = b, ") when matching comma separated property assignments:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") | rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
def propertyAssignment: Parser[(String, String)] = ident ~ "=" ~ ident ^^ {
case id ~ "=" ~ prop => (id, prop)
}
And call the parser with:
p.parseAll(p.commaList(p.propertyAssignment), "name = John , ")
which results in a Failure, no surprise but with:
string matching regex `\p{javaJavaIdentifierStart}\p{javaJavaIdentifierPart}*' expected but end of source found
The commList function succeeds on the first property assignment and starts repeating given the comma but the next "ident" fails on the fact that the next character is the end of the source data. Thought I could catch that 2nd alternative in the commList would match:
rep1sep(inner, ",") ~> opt(",") ~> failure("Dangling comma")
Nix. Ideas?
Scalaz to the rescue :-)
When you are working with warnings, it is not a good idea to exit your parser with a failure. You can easily combine the parser with the Scalaz writer monad. With this monads you can add messages to the partial result during the parser run. These messages could be infos, warnings or errors. After the parser finishes, you can then validate the result, if it can be used or if it contains critical problems. With such a separate vaildator step you get usual much better error messages. For example you could accept arbitrary characters at the end of the string, but issue an error when they are found (e.g. "Garbage found after last statement"). The error message can be much more helpful for the user than the cryptic default one you get in the example below ("string matching regex `\z' expected [...]").
Here is an example based on the code in your question:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import util.parsing.combinator.RegexParsers
import scalaz._, Scalaz._
object DemoParser extends RegexParsers {
type Warning = String
case class Equation(left : String, right : String)
type PWriter = Writer[Vector[Warning], List[Equation]]
val emptyList : List[Equation] = Nil
def rep1sep2[T](p : => Parser[T], q : => Parser[Any]): Parser[List[T]] =
p ~ rep(q ~> p) ^^ {case x~y => x::y}
def name : Parser[String] = """\w+""".r
def equation : Parser[Equation] = name ~ "=" ~ name ^^ { case n ~ _ ~ v => Equation(n,v) }
def commaList : Parser[PWriter] = rep1sep(equation, ",") ^^ (_.set(Vector()))
def danglingComma : Parser[PWriter] = opt(",") ^^ (
_ map (_ => emptyList.set(Vector("Warning: Dangling comma")))
getOrElse(emptyList.set(Vector("getOrElse(emptyList.set(Vector(""))))
def danglingList : Parser[PWriter] = commaList ~ danglingComma ^^ {
case l1 ~ l2 => (l1.over ++ l2.over).set(l1.written ++ l2.written) }
def apply(input: String): PWriter = parseAll(danglingList, input) match {
case Success(result, _) => result
case failure : NoSuccess => emptyList.set(Vector(failure.msg))
}
}
// Exiting paste mode, now interpreting.
import util.parsing.combinator.RegexParsers
import scalaz._
import Scalaz._
defined module DemoParser
scala> DemoParser("a=1, b=2")
res2: DemoParser.PWriter = (Vector(),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2,")
res3: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ")
res4: DemoParser.PWriter = (Vector(Warning: Dangling comma),List(Equation(a,1), Equation(b,2)))
scala> DemoParser("a=1, b=2, ;")
res5: DemoParser.PWriter = (Vector(string matching regex `\z' expected but `;' found),List())
scala>
As you can see, it handles the error cases fine. If you want to extend the example, add case classes for different kinds of errors and include the current parser positions in the messages.
Btw. the problem with the white spaces is handled by the RegexParsers class. If you want to change the handling of white spaces, just override the field whiteSpace.
Your parser isn't expecting the trailing whitespace at the end of "name = John , ".
You could use a regex to optionally parse "," followed by any amount of whitespace:
def commaList[T](inner: Parser[T]): Parser[List[T]] =
rep1sep(inner, ",") <~ opt(",\\s*".r ~> failure("Dangling comma"))
Note that you can avoid using alternatives (|) here, by making the failure part of the optional parser. If the optional part consumes some input and then fails, then the whole parser fails.
I have several strings like these:
name[arg,arg,arg]
name[arg,arg]
name[arg]
name
I wanted to parse it with scala combinator parsers, and this is the best that I managed to get:
object TaskDepParser extends JavaTokenParsers {
def name: Parser[String] = "[^\\[\\],]+".r
def expr: Parser[(String, Option[List[String]])] =
name ^^ { a => (a, None) } |
name ~ "[" ~ repsep(name, ",") ~ "]" ^^ { case name~_~args~_ => (name, Some(args)) }
}
It works on name, but fails to work on name[arg] - says string matching regex\z' expected but [' found. Is it possible to fix it?
#TonyK has already given the answer in his comment. But I wanna suggest that Scala parser combinators can already parse optional values:
object TaskDepParser extends JavaTokenParsers {
def name: Parser[String] = """[^\[\],]+""".r
def expr: Parser[(String, Option[List[String]])] =
name ~ opt("[" ~> repsep(name, ",") <~ "]") ^^ { case name ~ args => (name, args) }
}
With ~> and <~ it is possible to keep only left or right result to avoid unnecessary patter matching in ^^. Furthermore I would use triple quotes for strings to avoid lots of escaping.
I think it might work if you flip it around...Name is getting sucked up by the first rule, and then you get a failure on input.
I'm trying to split input by some keywords without delimiter like white-space.
object MyParser extends JavaTokenParsers {
def expr = (text | keyword).+
def text = ".+".r ^^ ("'"+_+"'")
def keyword = "ID".r ^^ ("["+_+"]")
}
val p = MyParser
p.parse(p.expr, "fooIDbar") match {
case p.Success(r, _) => r foreach print
case x => println(x.toString)
}
This outputs as below.
>> 'hogeIDfuga'
But I really want to do is like this.
>> 'hoge'[ID]'fuga'
It seems text engulfs all the characters. I tried to express [text does't contain keyword], but I could't. How to express it in regex or scala parser? or any other solutions?
I have seen some posts 1 2, but they don't work in my case.
First, keyword is a constant word so you don't need a regex, a plain string is enough.
Second, a text is some string that doesn't contain a keyword, not any string. Try this:
import util.parsing.combinator._
object MyParser extends JavaTokenParsers {
def expr = (text | keyword).+
def text = """((?!ID).)+""".r ^^ ("'"+_+"'")
def keyword = "ID" ^^ ("["+_+"]")
}
val p = MyParser
p.parse(p.expr, "fooIDbar") match {
case p.Success(r, _) => r foreach print
case x => println(x.toString)
}
As for the trick of writing a regex that not matching something, read this.
First, the code:
package com.digitaldoodles.markup
import scala.util.parsing.combinator.{Parsers, RegexParsers}
import com.digitaldoodles.rex._
class MarkupParser extends RegexParsers {
val stopTokens = (Lit("{{") | "}}" | ";;" | ",,").lookahead
val name: Parser[String] = """[##!$]?[a-zA-Z][a-zA-Z0-9]*""".r
val content: Parser[String] = (patterns.CharAny ** 0 & stopTokens).regex
val function: Parser[Any] = name ~ repsep(content, "::") <~ ";;"
val block1: Parser[Any] = "{{" ~> function
val block2: Parser[Any] = "{{" ~> function <~ "}}"
val lst: Parser[Any] = repsep("[a-z]", ",")
}
object ParseExpr extends MarkupParser {
def main(args: Array[String]) {
println("Content regex is ", (patterns.CharAny ** 0 & stopTokens).regex)
println(parseAll(block1, "{{#name 3:4:foo;;"))
println(parseAll(block2, "{{#name 3:4:foo;; stuff}}"))
println(parseAll(lst, "a,b,c"))
}
}
then, the run results:
[info] == run ==
[info] Running com.digitaldoodles.markup.ParseExpr
(Content regex is ,(?:[\s\S]{0,})(?=(?:(?:\{\{|\}\})|;;)|\,\,))
[1.18] parsed: (#name~List(3:4:foo))
[1.24] failure: `;;' expected but `}' found
{{#name 3:4:foo;; stuff}}
^
[1.1] failure: string matching regex `\z' expected but `a' found
a,b,c
^
I use a custom library to assemble some of my regexes, so I've printed out the "content" regex; its supposed to be basically any text up to but not including certain token patterns, enforced using a positive lookahead assertion.
Finally, the problems:
1) The first run on "block1" succeeds, but shouldn't, because the separator in the "repsep" function is "::", yet ":" are parsed as separators.
2) The run on "block2" fails, presumably because the lookahead clause isn't working--but I can't figure out why this should be. The lookahead clause was already exercised in the "repsep" on the run on "block1" and seemed to work there, so why should it fail on block 2?
3) The simple repsep exercise on "lst" fails because internally, the parser engine seems to be looking for a boundary--is this something I need to work around somehow?
Thanks,
Ken
1) No, "::" are not parsed as separators. If it did, the output would be (#name~List(3, 4, foo)).
2) It happens because "}}" is also a delimiter, so it takes the longest match it can -- the one that includes ";;" as well. If you make the preceding expression non-eager, it will then fail at "s" on "stuff", which I presume is what you expected.
3) You passed a literal, not a regex. Modify "[a-z]" to "[a-z]".r and it will work.
I am using Scala combinatorial parser by extending scala.util.parsing.combinator.syntactical.StandardTokenParser. This class provides following methods
def ident : Parser[String] for parsing identifiers and
def numericLit : Parser[String] for parsing a number (decimal I suppose)
I am using scala.util.parsing.combinator.lexical.Scannersfrom scala.util.parsing.combinator.lexical.StdLexicalfor lexing.
My requirement is to parse a hexadecimal number (without the 0x prefix) which can be of any length. Basically a grammar like: ([0-9]|[a-f])+
I tried integrating Regex parser but there are type issues there. Other ways to extend the definition of lexer delimiter and grammar rules lead to token not found!
As I thought the problem can be solved by extending the behavior of Lexer and not the Parser. The standard lexer takes only decimal digits, so I created a new lexer:
class MyLexer extends StdLexical {
override type Elem = Char
override def digit = ( super.digit | hexDigit )
lazy val hexDigits = Set[Char]() ++ "0123456789abcdefABCDEF".toArray
lazy val hexDigit = elem("hex digit", hexDigits.contains(_))
}
And my parser (which has to be a StandardTokenParser) can be extended as follows:
object ParseAST extends StandardTokenParsers{
override val lexical:MyLexer = new MyLexer()
lexical.delimiters += ( "(" , ")" , "," , "#")
...
}
The construction of the "number" from digits is taken care by StdLexical class:
class StdLexical {
...
def token: Parser[Token] =
...
| digit~rep(digit)^^{case first ~ rest => NumericLit(first :: rest mkString "")}
}
Since StdLexical gives just the parsed number as a String it is not a problem for me, as I am not interested in numeric value either.
You can use the RegexParsers with an action associated to the token in question.
import scala.util.parsing.combinator._
object HexParser extends RegexParsers {
val hexNum: Parser[Int] = """[0-9a-f]+""".r ^^
{ case s:String => Integer.parseInt(s,16) }
def seq: Parser[Any] = repsep(hexNum, ",")
}
This will define a parser that reads comma separated hex number with no prior 0x. And it will actually return a Int.
val result = HexParser.parse(HexParser.seq, "1, 2, f, 10, 1a2b34d")
scala> println(result)
[1.21] parsed: List(1, 2, 15, 16, 27439949)
Not there is no way to distinguish decimal notation numbers. Also I'm using the Integer.parseInt, this is limited to the size of your Int. To get any length you may have to make your own parser and use BigInteger or arrays.