scala parser. output of log parser utility - parsing

I am using the log parser utility to trace the parsing.
Scala code:
import scala.util.parsing.combinator.JavaTokenParsers
object Arith extends JavaTokenParsers with App {
def expr: Parser[Any] = log(term~rep("+"~term | "-"~term))("expr")
def term: Parser[Any] = factor~rep("*"~factor | "/"~factor)
def factor: Parser[Any] = floatingPointNumber | "("~expr~")"
println(parseAll(expr, "2 * (3 + 7)"))
}
Output:
trying expr at scala.util.parsing.input.CharSequenceReader#13a317a
trying expr at scala.util.parsing.input.CharSequenceReader#14c1103
expr --> [1.11] parsed: ((3~List())~List((+~(7~List()))))
expr --> [1.12] parsed: ((2~List((*~(((~((3~List())~List((+~(7~List())))))~)))))~List())
[1.12] parsed: ((2~List((*~(((~((3~List())~List((+~(7~List())))))~)))))~List())
The input is printed as scala.util.parsing.input.CharSequenceReader#13a317a. Is there a way to print string representation of the input like "2 * (3 + 7)"?

Overriding log solved my problem.
Scala Snippet:
import scala.util.parsing.combinator.JavaTokenParsers
object Arith extends JavaTokenParsers with App {
override def log[T](p: => Parser[T])(name: String): Parser[T] = Parser { in =>
def prt(x: Input) = x.source.toString.drop(x.offset)
println("trying " + name + " at " + prt(in))
val r = p(in)
println(name + " --> " + r + " next: " + prt(r.next))
r
}
def expr: Parser[Any] = log(term ~ rep("+" ~ term | "-" ~ term))("expr")
def term: Parser[Any] = factor ~ rep("*" ~ factor | "/" ~ factor)
def factor: Parser[Any] = floatingPointNumber | "(" ~ expr ~ ")"
println(parseAll(expr, "(3+7)*2"))
}
Output:
trying expr at (3+7)*2
trying expr at 3+7)*2
expr --> [1.5] parsed: ((3~List())~List((+~(7~List())))) next: )*2
expr --> [1.8] parsed: (((((~((3~List())~List((+~(7~List())))))~))~List((*~2)))~List()) next:
[1.8] parsed: (((((~((3~List())~List((+~(7~List())))))~))~List((*~2)))~List())

Related

Parser Combinators - Simple grammar

I am trying to use parser combinators in Scala on a simple grammar that I have copied from a book. When I run the following code it stops immediately after the first token has been parsed with the error
[1.3] failure: string matching regex '\z' expected but '+' found
I can see why things go wrong. The first token is an expression and therefor it is the only thing that needs to be parsed according to the grammar. However I do not know what is a good way to fix it.
object SimpleParser extends RegexParsers
{
def Name = """[a-zA-Z]+""".r
def Int = """[0-9]+""".r
def Main:Parser[Any] = Expr
def Expr:Parser[Any] =
(
Term
| Term <~ "+" ~> Expr
| Term <~ "-" ~> Expr
)
def Term:Parser[Any] =
(
Factor
| Factor <~ "*" ~> Term
)
def Factor:Parser[Any] =
(
Name
| Int
| "-" ~> Int
| "(" ~> Expr <~ ")"
| "let" ~> Name <~ "=" ~> Expr <~ "in" ~> Expr <~ "end"
)
def main(args: Array[String])
{
var input = "2 + 2"
println(input)
println(parseAll(Main, input))
}
}
Factor <~ "*" ~> Term means Factor.<~("*" ~> Term), so the whole right part is dropped.
Use Factor ~ "*" ~ Term ^^ { case f ~ _ ~ t => ??? } or rep1sep:
scala> :paste
// Entering paste mode (ctrl-D to finish)
import scala.util.parsing.combinator.RegexParsers
object SimpleParser extends RegexParsers
{
def Name = """[a-zA-Z]+""".r
def Int = """[0-9]+""".r
def Main:Parser[Any] = Expr
def Expr:Parser[Any] = rep1sep(Term, "+" | "-")
def Term:Parser[Any] = rep1sep(Factor, "*")
def Factor:Parser[Any] =
(
"let" ~> Name ~ "=" ~ Expr ~ "in" ~ Expr <~ "end" ^^ { case n ~ _ ~ e1 ~ _ ~ e2 => (n, e1, e2)
| Int
| "-" ~> Int
| "(" ~> Expr <~ ")"
| Name }
)
}
SimpleParser.parseAll(SimpleParser.Main, "2 + 2")
// Exiting paste mode, now interpreting.
import scala.util.parsing.combinator.RegexParsers
defined module SimpleParser
res1: SimpleParser.ParseResult[Any] = [1.6] parsed: List(List(2), List(2))
The second part of parser def Term:Parser[Any] = Factor | Factor <~ "*" ~> Term is useless. The first part, Factor, can parse (with non-empty next) any Input that the second part, Factor <~ "*" ~> Term, is able to parse.

Parse != in scala

I need to parse statements of the form
var1!=var2
var1==var2
and so on. I have the following construct:
lazy val Line : Parser[Any] = (Expr ~ "!=" ~ Expr)^^ {e => SMT( "(not (= " + e._1._1 + " " + e._2 + "))")} | (Expr ~ "==" ~ Expr)^^ {e => SMT( "(" + (e._1._2) + " " + e._1._1 + " " + e._2 + ")")}
The second part for the "==" works just fine, returning me (== var1 var2), but the first part just does not parse. Whatever I try to parse instead of the "!=", neither "!= " nor " !=" or " != " are recognized.
Of course I can replace the "!=" before I hand it to the parser, but is there a more elegant way?
Have a look at this minimal example (Scala 2.9.2):
import scala.util.parsing.combinator.syntactical._
import scala.util.parsing.combinator._
sealed trait ASTNode
case class Eq(v1: String, v2: String) extends ASTNode
case class Not(n: ASTNode) extends ASTNode
object MyParser extends StandardTokenParsers {
lexical.delimiters += ("==", "!=")
lazy val line = (
(ident ~ ("==" ~> ident)) ^^ { case e1 ~ e2 => Eq(e1, e2) }
| (ident ~ ("!=" ~> ident)) ^^ { case e1 ~ e2 => Not(Eq(e1, e2)) }
)
def main(code: String) = {
val tokens = new lexical.Scanner(code)
line(tokens) match {
case Success(tree, _) => println(tree)
case e: NoSuccess => Console.err.println(e)
}
}
}
MyParser.main("x == y")
MyParser.main("x != y")

Scala Parse floating point numbers with StandardTokenParsers

This is a grammar for a System of first order ODEs:
system ::= equation { equation }
equation ::= variable "=" (arithExpr | param) "\n"
variable ::= algebraicVar | stateVar
algrebraicVar ::= identifier
stateVar ::= algebraicVar'
arithExpr ::= term { "+" term | "-" term }
term ::= factor { "*" factor | "/" factor }
factor ::= algebraicVar
| powerExpr
| floatingPointNumber
| functionCall
| "(" arithExpr ")"
powerExpr ::= arithExpr {"^" arithExpr}
Notes:
An identifier should be a valid Scala Identifier.
A stateVar is an algebraicVar followed by one apostrophe (x' denotes the first derivative of x --with respect to time--)
I haven't coded anything for a functionCall but I mean something like Cos[Omega]
This is what I have already
package tests
import scala.util.parsing.combinator.lexical.StdLexical
import scala.util.parsing.combinator.syntactical.StandardTokenParsers
import scala.util.parsing.combinator._
import scala.util.parsing.combinator.JavaTokenParsers
import token._
object Parser1 extends StandardTokenParsers {
lexical.delimiters ++= List("(", ")", "=", "+", "-", "*", "/", "\n")
lexical.reserved ++= List(
"Log", "Ln", "Exp",
"Sin", "Cos", "Tan",
"Cot", "Sec", "Csc",
"Sqrt", "Param", "'")
def system: Parser[Any] = repsep(equation, "\n")
def equation: Parser[Any] = variable ~ "=" ~ ("Param" | arithExpr )
def variable: Parser[Any] = stateVar | algebraicVar
def algebraicVar: Parser[Any] = ident
def stateVar: Parser[Any] = algebraicVar ~ "\'"
def arithExpr: Parser[Any] = term ~ rep("+" ~ term | "-" ~ term)
def term: Parser[Any] = factor ~ rep("*" ~ factor | "/" ~ factor)
def factor: Parser[Any] = algebraicVar | floatingPointNumber | "(" ~ arithExpr ~ ")"
def powerExpr: Parser[Any] = arithExpr ~ rep("^" ~ arithExpr)
def main(args: Array[String]) {
val code = "x1 = 2.5 * x2"
equation(new lexical.Scanner(code)) match {
case Success(msg, _) => println(msg)
case Failure(msg, _) => println(msg)
case Error(msg, _) => println(msg)
}
}
}
However this line doesn't work:
def factor: Parser[Any] = algebraicVar | floatingPointNumber | "(" ~ arithExpr ~ ")"
Because I haven't defined what's a floatingPointNumber. First I tried to mix in JavaTokenParsers but then I get conflicting definitions. The reason I'm trying to use StandardTokenParsers instead of JavaTokenParsers is to use able to use a set of predefined Keywords with
lexical.reserved ++= List(
"Log", "Ln", "Exp",
"Sin", "Cos", "Tan",
"Cot", "Sec", "Csc",
"Sqrt", "Param", "'")
I asked this on the Scala-user mailing list (https://groups.google.com/forum/?fromgroups#!topic/scala-user/KXlfGauGR9Q) but I haven't received enough replies. Thanks a lot for helping.
Given that mixing in JavaTokenParsers doesn't work, you might try mixing in RegexParsers instead and copying just the definition of floatingPointNumber from the source for JavaTokenParsers.
That definition, at least in this version is simply a regex:
def floatingPointNumber: Parser[String] =
"""-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r

Expressions in a CoCo to ANTLR translator

I'm parsing CoCo/R grammars in a utility to automate CoCo -> ANTLR translation. The core ANTLR grammar is:
rule '=' expression '.' ;
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
term
: (factor (factor)*)? ;
factor
: symbol
| '(' expression ')'
-> ^( GROUPED_EXPR expression )
| '[' expression']'
-> ^( OPTIONAL_EXPR expression)
| '{' expression '}'
-> ^( SEQUENCE_EXPR expression)
;
symbol
: IF_ACTION
| ID (ATTRIBUTES)?
| STRINGLITERAL
;
My problem is with constructions such as these:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
CS results in an AST with a OR_EXPR node although no '|' character
actually appears. I'm sure this is due to the definition of
expression but I cannot see any other way to write the rules.
I did experiment with this to resolve the ambiguity.
// explicitly test for the presence of an '|' character
expression
#init { bool ored = false; }
: term {ored = (input.LT(1).Type == OR); } (OR term)*
-> {ored}? ^(OR_EXPR term term*)
-> ^(LIST term term*)
It works but the hack reinforces my conviction that something fundamental is wrong.
Any tips much appreciated.
Your rule:
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
always causes the rewrite rule to create a tree with a root of type OR_EXPR. You can create "sub rewrite rules" like this:
expression
: (term -> REWRITE_RULE_X) ('|' term -> ^(REWRITE_RULE_Y))*
;
And to resolve the ambiguity in your grammar, it's easiest to enable global backtracking which can be done in the options { ... } section of your grammar.
A quick demo:
grammar CocoR;
options {
output=AST;
backtrack=true;
}
tokens {
RULE;
GROUP;
SEQUENCE;
OPTIONAL;
OR;
ATOMS;
}
parse
: rule EOF -> rule
;
rule
: ID '=' expr* '.' -> ^(RULE ID expr*)
;
expr
: (a=atoms -> $a) ('|' b=atoms -> ^(OR $expr $b))*
;
atoms
: atom+ -> ^(ATOMS atom+)
;
atom
: ID
| '(' expr ')' -> ^(GROUP expr)
| '{' expr '}' -> ^(SEQUENCE expr)
| '[' expr ']' -> ^(OPTIONAL expr)
;
ID
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
with input:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
produces the AST:
and the input:
foo = a | b ({c} | d [e f]) .
produces:
The class to test this:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
/*
String source =
"CS = { ExternAliasDirective } \n" +
"{ UsingDirective } \n" +
"EOF . ";
*/
String source = "foo = a | b ({c} | d [e f]) .";
ANTLRStringStream in = new ANTLRStringStream(source);
CocoRLexer lexer = new CocoRLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CocoRParser parser = new CocoRParser(tokens);
CocoRParser.parse_return returnValue = parser.parse();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and with the output this class produces, I used the following website to create the AST-images: http://graph.gafol.net/
HTH
EDIT
To account for epsilon (empty string) in your OR expressions, you might try something (quickly tested!) like this:
expr
: (a=atoms -> $a) ( ( '|' b=atoms -> ^(OR $expr $b)
| '|' -> ^(OR $expr NOTHING)
)
)*
;
which parses the source:
foo = a | b | .
into the following AST:
The production for expression explicitly says that it can only return an OR_EXPR node. You can try something like:
expression
:
term
|
term ('|' term)+
-> ^( OR_EXPR term term* )
;
Further down, you could use:
term
: factor*;

Parsing Scheme using Scala parser combinators

I'm writing a small scheme interpreter in Scala and I'm running into problems parsing lists in Scheme. My code parses lists that contain multiple numbers, identifiers, and booleans, but it chokes if I try to parse a list containing multiple strings or lists. What am I missing?
Here's my parser:
class SchemeParsers extends RegexParsers {
// Scheme boolean #t and #f translate to Scala's true and false
def bool : Parser[Boolean] =
("#t" | "#f") ^^ {case "#t" => true; case "#f" => false}
// A Scheme identifier allows alphanumeric chars, some symbols, and
// can't start with a digit
def id : Parser[String] =
"""[a-zA-Z=*+/<>!\?][a-zA-Z0-9=*+/<>!\?]*""".r ^^ {case s => s}
// This interpreter only accepts numbers as integers
def num : Parser[Int] = """-?\d+""".r ^^ {case s => s toInt}
// A string can have any character except ", and is wrapped in "
def str : Parser[String] = '"' ~> """[^""]*""".r <~ '"' ^^ {case s => s}
// A Scheme list is a series of expressions wrapped in ()
def list : Parser[List[Any]] =
'(' ~> rep(expr) <~ ')' ^^ {s: List[Any] => s}
// A Scheme expression contains any of the other constructions
def expr : Parser[Any] = id | str | num | bool | list ^^ {case s => s}
}
As it was correctly pointed out by #Gabe, you left some white-spaces unhandled:
scala> object SchemeParsers extends RegexParsers {
|
| private def space = regex("[ \\n]*".r)
|
| // Scheme boolean #t and #f translate to Scala's true and false
| private def bool : Parser[Boolean] =
| ("#t" | "#f") ^^ {case "#t" => true; case "#f" => false}
|
| // A Scheme identifier allows alphanumeric chars, some symbols, and
| // can't start with a digit
| private def id : Parser[String] =
| """[a-zA-Z=*+/<>!\?][a-zA-Z0-9=*+/<>!\?]*""".r
|
| // This interpreter only accepts numbers as integers
| private def num : Parser[Int] = """-?\d+""".r ^^ {case s => s toInt}
|
| // A string can have any character except ", and is wrapped in "
| private def str : Parser[String] = '"' ~> """[^""]*""".r <~ '"' <~ space ^^ {case s => s}
|
| // A Scheme list is a series of expressions wrapped in ()
| private def list : Parser[List[Any]] =
| '(' ~> space ~> rep(expr) <~ ')' <~ space ^^ {s: List[Any] => s}
|
| // A Scheme expression contains any of the other constructions
| private def expr : Parser[Any] = id | str | num | bool | list ^^ {case s => s}
|
| def parseExpr(str: String) = parse(expr, str)
| }
defined module SchemeParsers
scala> SchemeParsers.parseExpr("""(("a" "b") ("a" "b"))""")
res12: SchemeParsers.ParseResult[Any] = [1.22] parsed: List(List(a, b), List(a, b))
scala> SchemeParsers.parseExpr("""("a" "b" "c")""")
res13: SchemeParsers.ParseResult[Any] = [1.14] parsed: List(a, b, c)
scala> SchemeParsers.parseExpr("""((1) (1 2) (1 2 3))""")
res14: SchemeParsers.ParseResult[Any] = [1.20] parsed: List(List(1), List(1, 2), List(1, 2, 3))
The only problem with the code is your usage of characters instead of strings. Below, I removed the redundant ^^ { case s => s } and replaced all characters with strings. I'll further discuss this issue below.
class SchemeParsers extends RegexParsers {
// Scheme boolean #t and #f translate to Scala's true and false
def bool : Parser[Boolean] =
("#t" | "#f") ^^ {case "#t" => true; case "#f" => false}
// A Scheme identifier allows alphanumeric chars, some symbols, and
// can't start with a digit
def id : Parser[String] =
"""[a-zA-Z=*+/<>!\?][a-zA-Z0-9=*+/<>!\?]*""".r ^^ {case s => s}
// This interpreter only accepts numbers as integers
def num : Parser[Int] = """-?\d+""".r ^^ {case s => s toInt}
// A string can have any character except ", and is wrapped in "
def str : Parser[String] = "\"" ~> """[^""]*""".r <~ "\""
// A Scheme list is a series of expressions wrapped in ()
def list : Parser[List[Any]] =
"(" ~> rep(expr) <~ ")" ^^ {s: List[Any] => s}
// A Scheme expression contains any of the other constructions
def expr : Parser[Any] = id | str | num | bool | list
}
All Parsers have an implicit accept for their Elem types. So, if the basic element is a Char, such as in RegexParsers, then there's an implicit accept action for them, which is what happens here for the symbols (, ) and ", which are characters in your code.
What RegexParsers do automatically is to skip white spaces (defined as protected val whiteSpace = """\s+""".r, so you could override that) automatically at the beginning of any String or Regex. It also takes care of moving the positioning cursor past the white space in case of error messages.
One consequence of this that you seem not to have realized is that " a string beginning with a space" will have its prefix space removed from the parsed output, which is very unlikely to be something you want. :-)
Also, since \s includes new lines, a new line will be acceptable before any identifier, which may or may not be what you want.
You may disable space skipping in your regex as a whole by overrideing skipWhiteSpace. On the other hand, the default skipWhiteSpace tests for whiteSpace's length, so you could potentially turn it on and off just by manipulating the value of whiteSpace throughout the parsing process.

Resources