Parser with dependent tokens - parsing

Building parsers for inputTasks are powerful and incredibly easy to use.
But now I have a use case, I don't know to express:
I would like to create a parser like this:
token(Space ~> language) ~ token(Space ~> number).+
where language can be one of theses 3 values:
English
French
Spanish
and number can be one of theses 3 values:
one, two, three for English
un, deux, trois for French
uno, dos, tres for Spanish.
Our parser can be easily written as:
token(Space ~> StringBasic.examples(FixedSetExamples("English", "French", "Spanish")) ~
token(Space ~> number).+
but I don't know how to write number, because it is dependent of the value of language.
Example inputs:
English one one
Spanish dos
French trois deux
I think this must be possible, because i.e. the arguments of a command or an input task, depend on the command type. I've studied the source code of SBT, but it is hard to understand.
More info:
Related documentation.
Repeating dependent parsers with Scala/SBT parser combinators

You can use flatMap and create a second parser that is based on the first one:
val languageNumbersParser = {
import complete.DefaultParsers._
val languages: Parser[String] = literal("English") | "French" | "Spanish"
val engNums: Parser[String] = literal("one") | "two" | "three"
val freNums: Parser[String] = literal("un") | "deux" | "trois"
val spaNums: Parser[String] = literal("uno") | "dos" | "tres"
token(Space ~> languages) flatMap {
case "English" => success("English") ~ token(Space ~> engNums).+
case "French" => success("French") ~ token(Space ~> freNums).+
case "Spanish" => success("Spanish") ~ token(Space ~> spaNums).+
}
}

Related

Parse grammar alternating and repeating

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

Scala Parser Combinator compilation issue: failing to match multiple variables in case

I am learning how to use the Scala Parser Combinators, which by the way are lovely to work with.
Unfortunately I am getting a compilation error. I have read, and recreated the worked examples from: http://www.artima.com/pins1ed/combinator-parsing.html <-- from Chapter 31 of Programming in Scala, First Edition, and a few other blogs.
I've reduced my code to a much simpler version to demonstrate my problem. I am working on parser that would parse the following samples
if a then x else y
if a if b then x else y else z
with a little extra that the conditions can have an optional "/1,2,3" syntax
if a/1 then x else y
if a/2,3 if b/3,4 then x else y else z
So I have ended with the following code
def ifThenElse: Parser[Any] =
"if" ~> condition ~ inputList ~ yes ~> "else" ~ no
def condition: Parser[Any] = ident
def inputList: Parser[Any] = opt("/" ~> repsep(input, ","))
def input: Parser[Any] = ident
def yes: Parser[Any] = "then" ~> result | ifThenElse
def no: Parser[Any] = result | ifThenElse
def result: Parser[Any] = ident
Now I want to add some transformations. I now get a compilation error on the second ~ in the case:
def ifThenElse: Parser[Any] =
"if" ~> condition ~ inputList ~ yes ~> "else" ~ no ^^ {
case c ~ i ~ y ~ n => null
^ constructor cannot be instantiated to expected type; found : SmallestFailure.this.~[a,b] required: String
When I change the code to
"if" ~> condition ~ inputList ~ yes ~> "else" ~ no ^^ {
case c ~ i => println("c: " + c + ", i: " + i)
I expected it not to compile, but it did. I thought I would need a variable for each clause. When executed (using parseAll) parsing "if a then b else c" produces "c: else, i: c". So it seems like c and i are the tail of the string.
I don't know if it is significant, but none of the example tutorials seem to have an example with more than two variables being matched, and this is matching four
You do not have to match the "else":
def ifThenElse: Parser[Any] =
"if" ~> condition ~ inputList ~ (yes <~ "else") ~ no ^^ {
case c ~ i ~ y ~ n => null
}
Works as expected.

Non-left-recursive PEG grammar for an "expression"

It's either a simple identifier (like cow) something surrounded by brackets ((...)) something that looks like a method call (...(...)) or something that looks like a member access (thing.member):
def expr = identifier |
"(" ~> expr <~ ")" |
expr ~ ("(" ~> expr <~ ")") |
expr ~ "." ~ identifier
It's given in Scala Parser Combinator syntax, but it should be pretty straightforward to understand. It's similar to how expressions end up looking in many programming languages (hence the name expr) However, as it stands, it is left-recursive and causes my nice PEG parser to explode.
I have not succeeded in factoring out the left-recursion while still maintaining correctness for cases like (cow.head).moo(dog.run(fast)). How can I refactor this, or would I need to shift to some parser-generator that can tolerate left recursive grammars?
The trick is to have multiple rules where the first element of each rule is the next rule instead of being a recursive call to the same rule, and the rest of the rule is optional and repeating. For example the following would work for your example:
def expr = method_call
def method_call = member_access ~ ( "(" ~> expr <~ ")" ).*
def member_access = atomic_expression ~ ( "." ~> identifier).*
def atomic_expression = identifier |
"(" ~> expr <~ ")"

Scala combinator parser, what does >> mean?

I am little bit confusing about ">>" in scala. Daniel said in Scala parser combinators parsing xml? that it could be used to parameterize the parser base on result from previous parser. Could someone give me some example/hint ? I already read scaladoc but still not understand it.
thanks
As I said, it serves to parameterize a parser, but let's walk through an example to make it clear.
Let's start with a simple parser, that parses a number follow by a word:
def numberAndWord = number ~ word
def number = "\\d+".r
def word = "\\w+".r
Under RegexParsers, this will parse stuff like "3 fruits".
Now, let's say you also want a list of what these "n things" are. For example, "3 fruits: banana, apple, orange". Let's try to parse that to see how it goes.
First, how do I parse "N" things? As it happen, there's a repN method:
def threeThings = repN(3, word)
That will parse "banana apple orange", but not "banana, apple, orange". I need a separator. There's repsep that provides that, but that won't let me specify how many repetitions I want. So, let's provide the separator ourselves:
def threeThings = word ~ repN(2, "," ~> word)
Ok, that words. We can write the whole example now, for three things, like this:
def listOfThings = "3" ~ word ~ ":" ~ threeThings
def word = "\\w+".r
def threeThings = word ~ repN(2, "," ~> word)
That kind of works, except that I'm fixing "N" in 3. I want to let the user specify how many. And that's where >>, also known as into (and, yes, it is flatMap for Parser), comes into. First, let's change threeThings:
def things(n: Int) = n match {
case 1 => word ^^ (List(_))
case x if x > 1 => word ~ repN(x - 1, "," ~> word) ^^ { case w ~ l => w :: l }
case x => err("Invalid repetitions: "+x)
}
This is slightly more complicated than you might have expected, because I'm forcing it to return Parser[List[String]]. But how do I pass a parameter to things? I mean, this won't work:
def listOfThings = number ~ word ~ ":" ~ things(/* what do I put here?*/)
But we can rewrite that like this:
def listOfThings = (number ~ word <~ ":") >> {
case n ~ what => things(n.toInt)
}
That is almost good enough, except that I now lost n and what: it only returns "List(banana, apple, orange)", not how many there ought to be, and what they are. I can do that like this:
def listOfThings = (number ~ word <~ ":") >> {
case n ~ what => things(n.toInt) ^^ { list => new ~(n.toInt, new ~(what, list)) }
}
def number = "\\d+".r
def word = "\\w+".r
def things(n: Int) = n match {
case 1 => word ^^ (List(_))
case x if x > 1 => word ~ repN(x - 1, "," ~> word) ^^ { case w ~ l => w :: l }
case x => err("Invalid repetitions: "+x)
}
Just a final comment. You might have wondered asked yourself "what do you mean flatMap? Isn't that a monad/for-comprehension thingy?" Why, yes, and yes! :-) Here's another way of writing listOfThings:
def listOfThings = for {
nOfWhat <- number ~ word <~ ":"
n ~ what = nOfWhat
list <- things(n.toInt)
} yield new ~(n.toInt, new ~(what, list))
I'm not doing n ~ what <- number ~ word <~ ":" because that uses filter or withFilter in Scala, which is not implemented by Parsers. But here's even another way of writing it, that doesn't have the exact same semantics, but produce the same results:
def listOfThings = for {
n <- number
what <- word
_ <- ":" : Parser[String]
list <- things(n.toInt)
} yield new ~(n.toInt, new ~(what, list))
This might even give one to think that maybe the claim that "monads are everywhere" might have something to it. :-)
The method >> takes a function that is given the result of the parser and uses it to contruct a new parser. As stated, this can be used to parameterize a parser on the result of a previous parser.
Example
The following parser parses a line with n + 1 integer values. The first value n states the number of values to follow. This first integer is parsed and then the result of this parse is used to construct a parser that parses n further integers.
Parser definition
The following line assumes, that you can parse an integer with parseInt: Parser[Int]. It first parses an integer value n and then uses >> to parse n additional integers which form the result of the parser. So the initial n is not returned by the parser (though it's the size of the returned list).
def intLine: Parser[Seq[Int]] = parseInt >> (n => repN(n,parseInt))
Valid inputs
1 42
3 1 2 3
0
Invalid inputs
0 1
1
3 42 42

A grammar expression for representing comma-delimited lists

Based on my experience, formal grammars typically express comma-delimited lists in a form similar to this:
foo_list -> foo ("," foo)*
What alternatives are there to avoid mentioning foo twice? Although this contrived example may seem innocent enough, I am encountering non-trivial expressions instead of foo. For example:
foo_list -> ( ( bar | baz | cat ) ) ( "," ( bar | baz | cat ) )*
I remember a (proprietary) parser generator that I once worked with, which would have this production written as
foo_list ::= <* bar | baz | cat ; "," *>
Yes, exactly like that. The actual metacharacters above are disputable, but I deem the general approach acceptable.
When writing another parser generator, I considered something alike for a while, but dropped it in favor of keeping the model simple.
A syntax diagram of course can nicely represent it without the unwanted repetition:
During my experimentation, this syntax showed some potential:
foo_list -> ( bar | baz | cat ) ("," ...)*
The ... token refers to the preceding expression (in this case, ( bar | baz | cat )).
This is not a perfect solution, but I am putting it out there for discussion.

Resources