EBNF to Scala parser combinator

EBNF to Scala parser combinator - parsing

I have the following EBNF that I want to parse:
PostfixExp -> PrimaryExp ( "[" Exp "]"
| . id "(" ExpList ")"
| . length )*
And this is what I got:
def postfixExp: Parser[Expression] = (
primaryExp ~ rep(
"[" ~ expression ~ "]"
| "." ~ ident ~"(" ~ repsep(expression, "," ) ~ ")"
| "." ~ "length") ^^ {
case primary ~ list => list.foldLeft(primary)((prim,post) =>
post match {
case "[" ~ length ~ "]" => ElementExpression(prim, length.asInstanceOf[Expression])
case "." ~ function ~"(" ~ arguments ~ ")" => CallMethodExpression(prim, function.asInstanceOf[String], arguments.asInstanceOf[List[Expression]])
case _ => LengthExpression(prim)
}
)
})
But I would like to know if there is a better way, preferably without having to resort to casting (asInstanceOf).

I would do it like this:
type E = Expression
def postfixExp = primaryExp ~ rep(
"[" ~> expr <~ "]" ^^ { e => ElementExpression(_:E, e) }
| "." ~ "length" ^^^ LengthExpression
| "." ~> ident ~ ("(" ~> repsep(expr, ",") <~ ")") ^^ flatten2 { (f, args) =>
CallMethodExpression(_:E, f, args)
}
) ^^ flatten2 { (e, ls) => collapse(ls)(e) }
def expr: Parser[E] = ...
def collapse(ls: List[E=>E])(e: E) = {
ls.foldLeft(e) { (e, f) => f(e) }
}
Shortened expressions to expr for brevity as well as added the type alias E for the same reason.
The trick that I'm using here to avoid the ugly case analysis is to return a function value from within the inner production. This function takes an Expression (which will be the primary) and then returns a new Expression based on the first. This unifies the two cases of dot-dispatch and bracketed expressions. Finally, the collapse method is used to merge the linear List of function values into a proper AST, starting with the specified primary expression.
Note that LengthExpression is just returned as a value (using ^^^) from its respective production. This works because the companion objects for case classes (assuming that LengthExpression is indeed a case class) extend the corresponding function value delegating to their constructor. Thus, the function represented by LengthExpression takes a single Expression and returns a new instance of LengthExpression, precisely satisfying our needs for the higher-order tree construction.

Related

Building a Solidity parser in Rust, hitting an `expression can not fail` and `recursion` error

I am building a Solidity parser with Rust. I am using the Pest Parser crate and am setting up my grammar.pest file to be very similar to the Solidity repo's Lexer/Parser.
I am hitting two errors. The first is an error saying:
|
41 | callArgumentList = {LParen ~ ((expression ~ (Comma ~ expression)*)? | LBrace ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrace) ~ RParen␊
| ^-----------------------------------^
|
= expression cannot fail; following choices cannot be reached
This error is showing for this rule defined in my grammar.pest.
callArgumentList = {LParen ~ ((expression ~ (Comma ~ expression)*)? | LBrace ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrace) ~ RParen
}
The above syntax is trying to replicate this rule in the Solidity official parser.
I believe that I have my syntax to match the Solidity repo's syntax for this rule but I am not sure what is going on here.
Secondly, I am getting a recursive error.
--> 131:6
|
131 | (expression ~ LBrack ~ expression? ~ RBrack)␊
| ^--------^
|
= rule expression is left-recursive (expression -> expression); pest::prec_climber might be useful in this case
This second error is due to the grammar for an expression, however it must be recursive. Here is the rule from the Solidity Repo.
Below is my grammar.pest implementation.
expression = {
(expression ~ LBrack ~ expression? ~ RBrack)
| (expression ~ LBrack ~ expression? ~ Colon ~ expression? ~ RBrack)
| (expression ~ Period ~ (identifier | Address))
| (expression ~ LBrack ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrack)
| (expression ~ callArgumentList)
| (Payable ~ callArgumentList)
| (Type ~ LParen ~ typeName ~ RParen)
| ((Inc | Dec | Not | BitNot | Delete | Sub) ~ expression)
| (expression ~ (Inc | Dec))
| (expression ~ Exp ~ expression)
| (expression ~ (Mul | Div | Mod) ~ expression)
| (expression ~ (Add | Sub) ~ expression)
| (expression ~ (Shl | Sar | Shr) ~ expression)
| (expression ~ BitAnd ~ expression)
| (expression ~ BitXor ~ expression)
| (expression ~ BitOr ~ expression)
| (expression ~ (LessThan | GreaterThan | LessThanOrEqual | GreaterThanOrEqual) ~ expression)
| (expression ~ (Equal | NotEqual) ~ expression)
| (expression ~ And ~ expression)
| (expression ~ Or ~ expression)
| (expression ~ Conditional ~ expression ~ Colon ~ expression)
| (expression ~ assignOp ~ expression)
| (New ~ typeName)
| (tupleExpression)
| (inlineArrayExpression)
}
Can someone help me figure out how to fix these two errors?

The easiest fix for direct left recursion is to do the usual refactoring for operator precedence by inserting rules and ordering according to precedence. For example (with many alts removed to simplify):
expression = { array }
array = { arraycolon ~ ( LBrack ~ expression? ~ RBrack ) * }
arraycolon = { qualified ~ ( LBrack ~ expression? ~ Colon ~ expression? ~ RBrack ) * }
qualified = { arraycomma ~ ( Period ~ identifier ) * }
arraycomma = { multexpr ~ ( LBrack ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrack ) * }
multexpr = { addexpr ~ ( (Mul | Div | Mod) ~ expression ) * }
addexpr = { andexpr ~ ( (Add | Sub) ~ expression ) * }
andexpr = { orexpr ~ ( And ~ expression ) * }
orexpr = { identifier ~ ( Or ~ expression ) * }
namedArgument = { identifier ~ Colon ~ expression }
LBrack = #{ "[" }
RBrack = #{ "]" }
Colon = #{ ":" }
Period = #{ "." }
Comma = #{ "," }
Mul = #{ "*" }
Div = #{ "/" }
Mod = #{ "%" }
Add = #{ "+" }
Sub = #{ "-" }
And = #{ "&&" }
Or = #{ "||" }
identifier = { (ASCII_ALPHANUMERIC | "_" | "-")+ }
WHITESPACE = _{ " " | "\t" | NEWLINE }
COMMENT = _{ "#" ~ (!NEWLINE ~ ANY)* }
You can also do it using the usual refactoring that is in the Dragon book. Or, you can even use the native Pest feature to specify operator precedence.
The other problem is caused because the ?-operator is in the wrong place in ( expression ~ ( Comma expression ) * )?. A similar problem is discussed here. The solution is a refactoring to remove ?-operator, the reintroduce it in the correct place:
x : l (a? | b) r;
==> (eliminate ?-operator, useless parentheses)
x : l ( | a | b) r;
==> (add ?-operator)
x : l ( a | b )? r ;
Using this and regrouping, a possible solution is:
callArgumentList = { LParen ~ ( ( expression ~ ( Comma ~ expression ) * ) | ( LBrace ~ ( namedArgument ~ ( Comma ~ namedArgument ) * ) ? ~ RBrace ) )? ~ RParen }

The linked Solidity grammar is for Antlr, not Pest, and it uses a different parsing methodology. So there's no reason to expect the grammar to work unmodified in Pest.
Evidently, Antlr has different limitations. Unlike Pest, Antlr allows some left recursion by translating it into semantic predicates, a feature which doesn't appear to exist in Pest. Instead, PEST offers precedence climbing, which can be used to parse left-associative operators in algebraic expressions. That feature doesn't seem to be fully documented, but there are some examples which can be used as a model.
I'm not sure how Antlr handles the callArgumentList production, but since Antlr can backtrack, there are various possibilities. PEG grammars don't backtrack after success, and ? and * expressions always succeed because they can successfully match nothing. This is explained in the Pest book. So the error message is correct; the second alternative will never be used because the first one will always succeed, possibly by matching nothing. The intention is that the entire argument list be optional, so the optionality operator is misplaced. It should be (reformatted to avoid horizontal scrolling)
callArgumentList =
{
LParen ~
(
expression ~ (Comma ~ expression)*
|
LBrace ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrace
)? ~
RParen
}
The difference is subtle -- all I did was to move the first ? up one level.
I don't know if that will actually work, though, because it's possible that the braced mapping can match expression. In that case, it might be necessary to put the alternatives in the opposite order to get a correct parse. (And that's not guaranteed to work, either. It really depends on the rest of the grammar.)

Parser Combinators - Simple grammar

I am trying to use parser combinators in Scala on a simple grammar that I have copied from a book. When I run the following code it stops immediately after the first token has been parsed with the error
[1.3] failure: string matching regex '\z' expected but '+' found
I can see why things go wrong. The first token is an expression and therefor it is the only thing that needs to be parsed according to the grammar. However I do not know what is a good way to fix it.
object SimpleParser extends RegexParsers
{
def Name = """[a-zA-Z]+""".r
def Int = """[0-9]+""".r
def Main:Parser[Any] = Expr
def Expr:Parser[Any] =
(
Term
| Term <~ "+" ~> Expr
| Term <~ "-" ~> Expr
)
def Term:Parser[Any] =
(
Factor
| Factor <~ "*" ~> Term
)
def Factor:Parser[Any] =
(
Name
| Int
| "-" ~> Int
| "(" ~> Expr <~ ")"
| "let" ~> Name <~ "=" ~> Expr <~ "in" ~> Expr <~ "end"
)
def main(args: Array[String])
{
var input = "2 + 2"
println(input)
println(parseAll(Main, input))
}
}

Factor <~ "*" ~> Term means Factor.<~("*" ~> Term), so the whole right part is dropped.
Use Factor ~ "*" ~ Term ^^ { case f ~ _ ~ t => ??? } or rep1sep:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import scala.util.parsing.combinator.RegexParsers
object SimpleParser extends RegexParsers
{
def Name = """[a-zA-Z]+""".r
def Int = """[0-9]+""".r
def Main:Parser[Any] = Expr
def Expr:Parser[Any] = rep1sep(Term, "+" | "-")
def Term:Parser[Any] = rep1sep(Factor, "*")
def Factor:Parser[Any] =
(
"let" ~> Name ~ "=" ~ Expr ~ "in" ~ Expr <~ "end" ^^ { case n ~ _ ~ e1 ~ _ ~ e2 => (n, e1, e2)
| Int
| "-" ~> Int
| "(" ~> Expr <~ ")"
| Name }
)
}
SimpleParser.parseAll(SimpleParser.Main, "2 + 2")
// Exiting paste mode, now interpreting.
import scala.util.parsing.combinator.RegexParsers
defined module SimpleParser
res1: SimpleParser.ParseResult[Any] = [1.6] parsed: List(List(2), List(2))
The second part of parser def Term:Parser[Any] = Factor | Factor <~ "*" ~> Term is useless. The first part, Factor, can parse (with non-empty next) any Input that the second part, Factor <~ "*" ~> Term, is able to parse.

Scala Parse floating point numbers with StandardTokenParsers

This is a grammar for a System of first order ODEs:
system ::= equation { equation }
equation ::= variable "=" (arithExpr | param) "\n"
variable ::= algebraicVar | stateVar
algrebraicVar ::= identifier
stateVar ::= algebraicVar'
arithExpr ::= term { "+" term | "-" term }
term ::= factor { "*" factor | "/" factor }
factor ::= algebraicVar
| powerExpr
| floatingPointNumber
| functionCall
| "(" arithExpr ")"
powerExpr ::= arithExpr {"^" arithExpr}
Notes:
An identifier should be a valid Scala Identifier.
A stateVar is an algebraicVar followed by one apostrophe (x' denotes the first derivative of x --with respect to time--)
I haven't coded anything for a functionCall but I mean something like Cos[Omega]
This is what I have already
package tests
import scala.util.parsing.combinator.lexical.StdLexical
import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.combinator._
import scala.util.parsing.combinator.JavaTokenParsers
import token._
object Parser1 extends StandardTokenParsers {
lexical.delimiters ++= List("(", ")", "=", "+", "-", "*", "/", "\n")
lexical.reserved ++= List(
"Log", "Ln", "Exp",
"Sin", "Cos", "Tan",
"Cot", "Sec", "Csc",
"Sqrt", "Param", "'")
def system: Parser[Any] = repsep(equation, "\n")
def equation: Parser[Any] = variable ~ "=" ~ ("Param" | arithExpr )
def variable: Parser[Any] = stateVar | algebraicVar
def algebraicVar: Parser[Any] = ident
def stateVar: Parser[Any] = algebraicVar ~ "\'"
def arithExpr: Parser[Any] = term ~ rep("+" ~ term | "-" ~ term)
def term: Parser[Any] = factor ~ rep("*" ~ factor | "/" ~ factor)
def factor: Parser[Any] = algebraicVar | floatingPointNumber | "(" ~ arithExpr ~ ")"
def powerExpr: Parser[Any] = arithExpr ~ rep("^" ~ arithExpr)
def main(args: Array[String]) {
val code = "x1 = 2.5 * x2"
equation(new lexical.Scanner(code)) match {
case Success(msg, _) => println(msg)
case Failure(msg, _) => println(msg)
case Error(msg, _) => println(msg)
}
}
}
However this line doesn't work:
def factor: Parser[Any] = algebraicVar | floatingPointNumber | "(" ~ arithExpr ~ ")"
Because I haven't defined what's a floatingPointNumber. First I tried to mix in JavaTokenParsers but then I get conflicting definitions. The reason I'm trying to use StandardTokenParsers instead of JavaTokenParsers is to use able to use a set of predefined Keywords with
lexical.reserved ++= List(
"Log", "Ln", "Exp",
"Sin", "Cos", "Tan",
"Cot", "Sec", "Csc",
"Sqrt", "Param", "'")
I asked this on the Scala-user mailing list (https://groups.google.com/forum/?fromgroups#!topic/scala-user/KXlfGauGR9Q) but I haven't received enough replies. Thanks a lot for helping.

Given that mixing in JavaTokenParsers doesn't work, you might try mixing in RegexParsers instead and copying just the definition of floatingPointNumber from the source for JavaTokenParsers.
That definition, at least in this version is simply a regex:
def floatingPointNumber: Parser[String] =
"""-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r

LPeg grammar oddity

Part of a Lua application of mine is a search bar, and I'm trying to make it understand boolean expressions. I'm using LPeg, but the current grammar gives a strange result:
> re, yajl = require're', require'yajl'
> querypattern = re.compile[=[
QUERY <- ( EXPR / TERM )? S? !. -> {}
EXPR <- S? TERM ( (S OPERATOR)? S TERM )+ -> {}
TERM <- KEYWORD / ( "(" S? EXPR S? ")" ) -> {}
KEYWORD <- ( WORD {":"} )? ( WORD / STRING )
WORD <- {[A-Za-z][A-Za-z0-9]*}
OPERATOR <- {("AND" / "XOR" / "NOR" / "OR")}
STRING <- ('"' {[^"]*} '"' / "'" {[^']*} "'") -> {}
S <- %s+
]=]
> = yajl.to_string(lpeg.match(querypattern, "bar foo"))
"bar"
> = yajl.to_string(lpeg.match(querypattern, "name:bar AND foo"))
> = yajl.to_string(lpeg.match(querypattern, "name:bar AND foo"))
"name"
> = yajl.to_string(lpeg.match(querypattern, "name:'bar' AND foo"))
"name"
> = yajl.to_string(lpeg.match(querypattern, "bar AND (name:foo OR place:here)"))
"bar"
It only parses the first token, and I cannot figure out why it does this. As far as I know, a partial match is impossible because of the !. at the end of the starting non-terminal. How can I fix this?

The match is getting the entire string, but the captures are wrong. Note that
'->' has a higher precedence than concatenation, so you probably need parentheses around things like this:
EXPR <- S? ( TERM ( (S OPERATOR)? S TERM )+ ) -> {}

Parsing Scheme using Scala parser combinators

I'm writing a small scheme interpreter in Scala and I'm running into problems parsing lists in Scheme. My code parses lists that contain multiple numbers, identifiers, and booleans, but it chokes if I try to parse a list containing multiple strings or lists. What am I missing?
Here's my parser:
class SchemeParsers extends RegexParsers {
// Scheme boolean #t and #f translate to Scala's true and false
def bool : Parser[Boolean] =
("#t" | "#f") ^^ {case "#t" => true; case "#f" => false}
// A Scheme identifier allows alphanumeric chars, some symbols, and
// can't start with a digit
def id : Parser[String] =
"""[a-zA-Z=*+/<>!\?][a-zA-Z0-9=*+/<>!\?]*""".r ^^ {case s => s}
// This interpreter only accepts numbers as integers
def num : Parser[Int] = """-?\d+""".r ^^ {case s => s toInt}
// A string can have any character except ", and is wrapped in "
def str : Parser[String] = '"' ~> """[^""]*""".r <~ '"' ^^ {case s => s}
// A Scheme list is a series of expressions wrapped in ()
def list : Parser[List[Any]] =
'(' ~> rep(expr) <~ ')' ^^ {s: List[Any] => s}
// A Scheme expression contains any of the other constructions
def expr : Parser[Any] = id | str | num | bool | list ^^ {case s => s}
}

As it was correctly pointed out by #Gabe, you left some white-spaces unhandled:
scala> object SchemeParsers extends RegexParsers {
|
| private def space = regex("[ \\n]*".r)
|
| // Scheme boolean #t and #f translate to Scala's true and false
| private def bool : Parser[Boolean] =
| ("#t" | "#f") ^^ {case "#t" => true; case "#f" => false}
|
| // A Scheme identifier allows alphanumeric chars, some symbols, and
| // can't start with a digit
| private def id : Parser[String] =
| """[a-zA-Z=*+/<>!\?][a-zA-Z0-9=*+/<>!\?]*""".r
|
| // This interpreter only accepts numbers as integers
| private def num : Parser[Int] = """-?\d+""".r ^^ {case s => s toInt}
|
| // A string can have any character except ", and is wrapped in "
| private def str : Parser[String] = '"' ~> """[^""]*""".r <~ '"' <~ space ^^ {case s => s}
|
| // A Scheme list is a series of expressions wrapped in ()
| private def list : Parser[List[Any]] =
| '(' ~> space ~> rep(expr) <~ ')' <~ space ^^ {s: List[Any] => s}
|
| // A Scheme expression contains any of the other constructions
| private def expr : Parser[Any] = id | str | num | bool | list ^^ {case s => s}
|
| def parseExpr(str: String) = parse(expr, str)
| }
defined module SchemeParsers
scala> SchemeParsers.parseExpr("""(("a" "b") ("a" "b"))""")
res12: SchemeParsers.ParseResult[Any] = [1.22] parsed: List(List(a, b), List(a, b))
scala> SchemeParsers.parseExpr("""("a" "b" "c")""")
res13: SchemeParsers.ParseResult[Any] = [1.14] parsed: List(a, b, c)
scala> SchemeParsers.parseExpr("""((1) (1 2) (1 2 3))""")
res14: SchemeParsers.ParseResult[Any] = [1.20] parsed: List(List(1), List(1, 2), List(1, 2, 3))

The only problem with the code is your usage of characters instead of strings. Below, I removed the redundant ^^ { case s => s } and replaced all characters with strings. I'll further discuss this issue below.
class SchemeParsers extends RegexParsers {
// Scheme boolean #t and #f translate to Scala's true and false
def bool : Parser[Boolean] =
("#t" | "#f") ^^ {case "#t" => true; case "#f" => false}
// A Scheme identifier allows alphanumeric chars, some symbols, and
// can't start with a digit
def id : Parser[String] =
"""[a-zA-Z=*+/<>!\?][a-zA-Z0-9=*+/<>!\?]*""".r ^^ {case s => s}
// This interpreter only accepts numbers as integers
def num : Parser[Int] = """-?\d+""".r ^^ {case s => s toInt}
// A string can have any character except ", and is wrapped in "
def str : Parser[String] = "\"" ~> """[^""]*""".r <~ "\""
// A Scheme list is a series of expressions wrapped in ()
def list : Parser[List[Any]] =
"(" ~> rep(expr) <~ ")" ^^ {s: List[Any] => s}
// A Scheme expression contains any of the other constructions
def expr : Parser[Any] = id | str | num | bool | list
}
All Parsers have an implicit accept for their Elem types. So, if the basic element is a Char, such as in RegexParsers, then there's an implicit accept action for them, which is what happens here for the symbols (, ) and ", which are characters in your code.
What RegexParsers do automatically is to skip white spaces (defined as protected val whiteSpace = """\s+""".r, so you could override that) automatically at the beginning of any String or Regex. It also takes care of moving the positioning cursor past the white space in case of error messages.
One consequence of this that you seem not to have realized is that " a string beginning with a space" will have its prefix space removed from the parsed output, which is very unlikely to be something you want. :-)
Also, since \s includes new lines, a new line will be acceptable before any identifier, which may or may not be what you want.
You may disable space skipping in your regex as a whole by overrideing skipWhiteSpace. On the other hand, the default skipWhiteSpace tests for whiteSpace's length, so you could potentially turn it on and off just by manipulating the value of whiteSpace throughout the parsing process.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart