Building expression parser with Dart petitparser, getting stuck on node visitor - dart

I've got more of my expression parser working (Dart PetitParser to get at AST datastructure created with ExpressionBuilder). It appears to be generating accurate ASTs for floats, parens, power, multiply, divide, add, subtract, unary negative in front of both numbers and expressions. (The nodes are either literal strings, or an object that has a precedence with a List payload that gets walked and concatenated.)
I'm stuck now on visiting the nodes. I have clean access to the top node (thanks to Lukas), but I'm stuck on deciding whether or not to add a paren. For example, in 20+30*40, we don't need parens around 30*40, and the parse tree correctly has the node for this closer to the root so I'll hit it first during traversal. However, I don't seem to have enough data when looking at the 30*40 node to determine if it needs parens before going on to the 20+.. A very similar case would be (20+30)*40, which gets parsed correctly with 20+30 closer to the root, so once again, when visiting the 20+30 node I need to add parens before going on to *40.
This has to be a solved problem, but I never went to compiler school, so I know just enough about ASTs to be dangerous. What "a ha" am I missing?
// rip-common.dart:
import 'package:petitparser/petitparser.dart';
// import 'package:petitparser/debug.dart';
class Node {
int precedence;
List<dynamic> args;
Node([this.precedence = 0, this.args = const []]) {
// nodeList.add(this);
}
#override
String toString() => 'Node($precedence $args)';
String visit([int fromPrecedence = -1]) {
print('=== visiting $this ===');
var buf = StringBuffer();
var parens = (precedence > 0) &&
(fromPrecedence > 0) &&
(precedence < fromPrecedence);
print('<$fromPrecedence $precedence $parens>');
// for debugging:
var curlyOpen = '';
var curlyClose = '';
buf.write(parens ? '(' : curlyOpen);
for (var arg in args) {
if (arg is Node) {
buf.write(arg.visit(precedence));
} else if (arg is String) {
buf.write(arg);
} else {
print('not Node or String: $arg');
buf.write('$arg');
}
}
buf.write(parens ? ')' : curlyClose);
print('$buf for buf');
return '$buf';
}
}
class RIPParser {
Parser _make_parser() {
final builder = ExpressionBuilder();
var number = char('-').optional() &
digit().plus() &
(char('.') & digit().plus()).optional();
// precedence 5
builder.group()
..primitive(number.flatten().map((a) => Node(0, [a])))
..wrapper(char('('), char(')'), (l, a, r) => Node(0, [a]));
// negation is a prefix operator
// precedence 4
builder.group()..prefix(char('-').trim(), (op, a) => Node(4, [op, a]));
// power is right-associative
// precedence 3
builder.group()..right(char('^').trim(), (a, op, b) => Node(3, [a, op, b]));
// multiplication and addition are left-associative
// precedence 2
builder.group()
..left(char('*').trim(), (a, op, b) => Node(2, [a, op, b]))
..left(char('/').trim(), (a, op, b) => Node(2, [a, op, b]));
// precedence 1
builder.group()
..left(char('+').trim(), (a, op, b) => Node(1, [a, op, b]))
..left(char('-').trim(), (a, op, b) => Node(1, [a, op, b]));
final parser = builder.build().end();
return parser;
}
Result _result(String input) {
var parser = _make_parser(); // eventually cache
var result = parser.parse(input);
return result;
}
String parse(String input) {
var result = _result(input);
if (result.isFailure) {
return result.message;
} else {
print('result.value = ${result.value}');
return '$result';
}
}
String visit(String input) {
var result = _result(input);
var top_node = result.value; // result.isFailure ...
return top_node.visit();
}
}
// rip_cmd_example.dart
import 'dart:io';
import 'package:rip_common/rip_common.dart';
void main() {
print('start');
String input;
while (true) {
input = stdin.readLineSync();
if (input.isEmpty) {
break;
}
print(RIPParser().parse(input));
print(RIPParser().visit(input));
}
;
print('done');
}

As you've observed, the ExpressionBuilder already assembles the tree in the right precedence order based on the operator groups you've specified.
This also happens for the wrapping parens node created here: ..wrapper(char('('), char(')'), (l, a, r) => Node(0, [a])). If I test for this node, I get back the input string for your example expressions: var parens = precedence == 0 && args.length == 1 && args[0] is Node;.
Unless I am missing something, there should be no reason for you to track the precedence manually. I would also recommend that you create different node classes for the different operators: ValueNode, ParensNode, NegNode, PowNode, MulNode, ... A bit verbose, but much easier to understand what is going on, if each of them can just visit (print, evaluate, optimize, ...) itself.

Related

Get children count of a tree node

The docs for Node only mention following methods:
Equal, GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual, NotEqual, Slice, Subscription
It does mention how to access child by index using Subscription, but how can I find out the count of children node has to iterate over them?
Here is my use case:
Exp parsed = parse(#Exp, "2+(4+3)*48");
println("the number of root children is: " + size(parsed));
But it yields error, as size() seems to only work with a List.
Different answers, different aspects that are better or worse. Here are a few:
import ParseTree;
int getChildrenCount1(Tree parsed) {
return (0 | it + 1 | _ <- parsed.args);
}
getChildrenCount1 iterates over the raw children of a parse tree node. This includes whitespace and comment nodes (layout) and keywords (literals). You might want to filter for those, or compensate by division.
On the other hand, this seems a bit indirect. We could also just directly ask for the length of the children list:
import List;
import ParseTree;
int getChildrenCount2(Tree parsed) {
return size(parsed.args) / 2 + 1; // here we divide by two assuming every other node is a layout node
}
There is also the way of meta-data. Every parse tree node has a declarative description of the production directly there which can be queried and explored:
import ParseTree;
import List;
// immediately match on the meta-structure of a parse node:
int getChildrenCount3(appl(Production prod, list[Tree] args)) {
return size(prod.symbols);
}
This length of symbols should be the same as the length of args.
// To filter for "meaningful" children in a declarative way:
int getChildrenCount4(appl(prod(_, list[Symbol] symbols, _), list[Tree] args)) {
return (0 | it + 1 | sort(_) <- symbols);
}
The sort filters for context-free non-terminals as declared with syntax rules. Lexical children would match lex and layout and literals with layouts and lit.
Without all that pattern matching:
int getChildrenCount4(Tree tree) {
return (0 | it + 1 | s <- tree.prod.symbols, isInteresting(s));
}
bool isInteresting(Symbol s) = s is sort || s is lex;
So far this seems to work, but it is awful:
int getChildrenCount(Tree parsed) {
int infinity = 1000;
for (int i <- [0..infinity]) {
try parsed[i];
catch: return i;
}
return infinity;
}
void main() {
Exp parsed = parse(#Exp, "132+(4+3)*48");
println("the number of root children is: ");
println(getChildrenCount(parsed));
}

Abstract Syntax Tree for Source Code including Expressions

I am building a new simple programming language (just to learn how compilers work in my free time).
I have already built a lexer which can tokenize my source code into lexemes.
However, I am now stuck on how to form an Abstract Syntax Tree from the tokens, where the source code might contain an expression (with operator precedence).
For simplicity, I shall include only 4 basic operators: +, -, /, and * in addition to brackets (). Operator precedence will follow BODMAS rule.
I realize I might be able to convert the expression from infix to prefix/postfix, form the tree and substitute it.
However, I am not sure if that is possible. Even if it is possible, I am not sure how efficient it might be or how difficult it might be to implement.
Is there some trivial way to form the tree in-place without having to convert to prefix/postfix first?
I came across the Shunting Yard algorithm which seems to do this. However, I found it to be quite a complicated algorithm. Is there something simpler, or should I go ahead with implementing the Shunting Yard algorithm?
Currently, the following program is tokenized by my lexer as follows:
I am demonstrating using a Java program for syntax familiarity.
Source Program:
public class Hello
{
public static void main(String[] args)
{
int a = 5;
int b = 6;
int c = 7;
int r = a + b * c;
System.out.println(r);
}
}
Lexer output:
public
class
Hello
{
public
static
void
main
(
String
[
]
args
)
{
int
a
=
5
;
int
b
=
6
;
int
c
=
7
;
int
r
=
a
+
b
*
c
;
System
.
out
.
println
(
r
)
;
}
}
// I know this might look ugly that I use a global variable ret to return parsed subtrees
// but please bear with it, I got used to this for various performance/usability reasons
var ret, tokens
function get_precedence(op) {
// this is an essential part, cannot parse an expression without the precedence checker
if (op == '*' || op == '/' || op == '%') return 14
if (op == '+' || op == '-') return 13
if (op == '<=' || op == '>=' || op == '<' || op == '>') return 11
if (op == '==' || op == '!=') return 10
if (op == '^') return 8
if (op == '&&') return 6
if (op == '||') return 5
return 0
}
function parse_primary(pos) {
// in the real language primary is almost everything that can be on the sides of +
// but here we only handle numbers detected with the JavaScript 'typeof' keyword
if (typeof tokens[pos] == 'number') {
ret = {
type: 'number',
value: tokens[pos],
}
return pos + 1
}
else {
return undefined
}
}
function parse_operator(pos) {
// let's just reuse the function we already wrote insted of creating another huge 'if'
if (get_precedence(tokens[pos]) != 0) {
ret = {
type: 'operator',
operator: tokens[pos],
}
return pos + 1
}
else {
return undefined
}
}
function parse_expr(pos) {
var stack = [], code = [], n, op, next, precedence
pos = parse_primary(pos)
if (pos == undefined) {
// error, an expression can only start with a primary
return undefined
}
stack.push(ret)
while (true) {
n = pos
pos = parse_operator(pos)
if (pos == undefined) break
op = ret
pos = parse_primary(pos)
if (pos == undefined) break
next = ret
precedence = get_precedence(op.operator)
while (stack.length > 0 && get_precedence(stack[stack.length - 1].operator) >= precedence) {
code.push(stack.pop())
}
stack.push(op)
code.push(next)
}
while(stack.length > 0) {
code.push(stack.pop())
}
if (code.length == 1) ret = code[0]
else ret = {
type: 'expr',
stack: code,
}
return n
}
function main() {
tokens = [1, '+', 2, '*', 3]
var pos = parse_expr(0)
if (pos) {
console.log('parsed expression AST')
console.log(ret)
}
else {
console.log('unable to parse anything')
}
}
main()
Here is your bare-bones implementation of shunting yard expression parsing. This is written in JavaScript. This is as minimalistic and simple as you can get. Tokenizing is left off for brevity, you give the parse the array of tokens (you call them lexemes).
The actual Shunting Yard is the parse_expr function. This is the "classic" implementation that uses the stack, this is my preference, some people prefer functional recursion.
Functions that parse various syntax elements are usually called "parselets". here we have three of them, one for expression, others are for primary and operator. If a parselet detects the corresponding syntax construction at the position pos it will return the next position right after the construct, and the construct itself in AST form is returned via the global variable ret. If the parselet does not find what it expects it returns undefined.
It is now trivially simple to add support for parens grouping (, just extend parse_primary with if (parse_group())... else if (parse_number())... etc. In the meantime your parse_primary will grow real big supporting various things, prefix operators, function calls, etc.

What is the sequence combinator in Chomp?

I'm attempting to parse a subset of JSON that only contains a single, non-nested object with string only values that may contain escape sequences. E.g.
{
"A KEY": "SOME VALUE",
"Another key": "Escape sequences \n \r \\ \/ \f \t \u263A"
}
Using the Chomp parser combinator in Rust. I have it parsing this structure ignoring escape sequences but am having trouble working out how to handle the escape sequences. Looking at other quoted string parsers that use combinators such as:
Arc JSON parser
PHP parser-combinator
Paka
They each use a sequence combinator, what is the equivalent in Chomp?
Chomp is based on Attoparsec and Parsec, so for parsing escaped strings I would use the scan parser to obtain the slice between the " characters while keeping any escaped " characters.
The sequence combinator is just the ParseResult::bind method, used to chain the match of the " character and the escaped string itself so that it will be able to parse "foo\\"bar" and not just foo\\"bar. You get this for free when you use the parse! macro, each ; is implicitly converted into a bind call to chain the parsers together.
The linked parsers use a many and or combinator and allocate a vector for the resulting characters. Paka does not seem to do any transformation on the resulting array, and PHP is using a regex with a callback to unescape the string.
This is code translated from Attoparsec's Aeson benchmark for parsing a JSON-string while not unescaping any escaped characters.
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::buffer::IntoStream;
use chomp::buffer::Stream;
pub fn json_string(i: Input<u8>) -> U8Result<&[u8]> {
parse!{i;
token(b'"');
let escaped_str = scan(false, |s, c| if s { Some(false) }
else if c == b'"' { None }
else { Some(c == b'\\') });
token(b'"');
ret escaped_str
}
}
#[test]
fn test_it() {
let r = "\"foo\\\"bar\\tbaz\"".as_bytes().into_stream().parse(json_string);
assert_eq!(r, Ok(&b"foo\\\"bar\\tbaz"[..]));
}
The parser above is not equivalent, it yields a slice of u8 borrowed from the source buffer/slice. If you want an owned Vec of the data you should preferably use [T]::to_vec or String::from_utf8 instead of building a parser using many and or since it will not be as fast as scan and the result is the same.
If you want to parse UTF-8 and escape-sequences you can filter the resulting slice and then calling String::from_utf8 on the Vec (Rust strings are UTF-8, to use a string containing invalid UTF-8 can result in undefined behaviour). If performance is an issue you should build that into the parser most likely.
The documentation states (emphasis mine):
Using parsers is almost entirely done using the parse! macro, which enables us to do three distinct things:
Sequence parsers over the remaining input
Store intermediate results into datatypes
Return a datatype at the end, which may be the result of any arbitrary computation over the intermediate results.
It then provides this example of parsing a sequence of two numbers followed by a constant string:
fn f(i: Input<u8>) -> U8Result<(u8, u8, u8)> {
parse!{i;
let a = digit();
let b = digit();
string(b"missiles");
ret (a, b, a + b)
}
}
fn digit(i: Input<u8>) -> U8Result<u8> {
satisfy(i, |c| b'0' <= c && c <= b'9').map(|c| c - b'0')
}
There is also ParseResult::bind and ParseResult::then which are documented to sequentially compose a result with a second action.
Because I'm always interested in parsing, I went ahead and played with it a bit to see how it would look. I'm not happy with the deep indenting that would happen with the nested or calls, but there's probably something better that can be done. This is just one possible solution:
#[macro_use]
extern crate chomp;
use chomp::*;
use chomp::ascii::is_alpha;
use chomp::buffer::{Source, Stream, ParseError};
use std::str;
use std::iter::FromIterator;
#[derive(Debug)]
pub enum StringPart<'a> {
String(&'a [u8]),
Newline,
Slash,
}
impl<'a> StringPart<'a> {
fn from_bytes(s: &[u8]) -> StringPart {
match s {
br#"\\"# => StringPart::Slash,
br#"\n"# => StringPart::Newline,
s => StringPart::String(s),
}
}
}
impl<'a> FromIterator<StringPart<'a>> for String {
fn from_iter<I>(iterator: I) -> Self
where I: IntoIterator<Item = StringPart<'a>>
{
let mut s = String::new();
for part in iterator {
match part {
StringPart::String(p) => s.push_str(str::from_utf8(p).unwrap()),
StringPart::Newline => s.push('\n'),
StringPart::Slash => s.push('\\'),
}
}
s
}
}
fn json_string_part(i: Input<u8>) -> U8Result<StringPart> {
or(i,
|i| parse!{i; take_while1(is_alpha)},
|i| or(i,
|i| parse!{i; string(br"\\")},
|i| parse!{i; string(br"\n")}),
).map(StringPart::from_bytes)
}
fn json_string(i: Input<u8>) -> U8Result<String> {
many1(i, json_string_part)
}
fn main() {
let input = br#"\\stuff\n"#;
let mut i = Source::new(input as &[u8]);
println!("Input has {} bytes", input.len());
loop {
match i.parse(json_string) {
Ok(x) => {
println!("Result has {} bytes", x.len());
println!("{:?}", x);
},
Err(ParseError::Retry) => {}, // Needed to refill buffer when necessary
Err(ParseError::EndOfInput) => break,
Err(e) => { panic!("{:?}", e); }
}
}
}

How to handle multiple optionals in grammar

I would like to know how I can handle multiple optionals without concrete pattern matching for each possible permutation.
Below is a simplified example of the problem I am facing:
lexical Int = [0-9]+;
syntax Bool = "True" | "False";
syntax Period = "Day" | "Month" | "Quarter" | "Year";
layout Standard = [\ \t\n\f\r]*;
syntax Optionals = Int? i Bool? b Period? p;
str printOptionals(Optionals opt){
str res = "";
if(!isEmpty("<opt.i>")) { // opt has i is always true (same for opt.i?)
res += printInt(opt.i);
}
if(!isEmpty("<opt.b>")){
res += printBool(opt.b);
}
if(!isEmpty("<opt.p>")) {
res += printPeriod(opt.period);
}
return res;
}
str printInt(Int i) = "<i>";
str printBool(Bool b) = "<b>";
str printPeriod(Period p) = "<p>";
However this gives the error message:
The called signature: printInt(opt(lex("Int"))), does not match the declared signature: str printInt(sort("Int"));
How do I get rid of the opt part when I know it is there?
I'm not sure how ideal this is, but you could do this for now:
if (/Int i := opt.i) {
res += printInt(i);
}
This will extract the Int from within opt.i if it is there, but the match will fail if Int was not provided as one of the options.
The current master on github has the following feature to deal with optionals: they can be iterated over.
For example:
if (Int i <- opt.i) {
res += printInt(i);
}
The <- will produce false immediately if the optional value is absent, and otherwise loop once through and bind the value which is present to the pattern.
An untyped solution is to project out the element from the parse tree:
rascal>opt.i.args[0];
Tree: `1`
Tree: appl(prod(lex("Int"),[iter(\char-class([range(48,57)]))],{}),[appl(regular(iter(\char-class([range(48,57)]))),[char(49)])[#loc=|file://-|(0,1,<1,0>,<1,1>)]])[#loc=|file://-|(0,1,<1,0>,<1,1>)]
However, then to transfer this back to an Int you'd have to pattern match, like so:
rascal>if (Int i := opt.i.args[0]) { printInt(i); }
str: "1"
One could write a generic cast function to help out here:
rascal>&T cast(type[&T] t, value v) { if (&T a := v) return a; throw "cast exception"; }
ok
rascal>printInt(cast(#Int, opt.i.args[0]))
str: "1"
Still, I believe Rascal is missing a feature here. Something like this would be a good feature request:
rascal>Int j = opt.i.value;
rascal>opt.i has value
bool: true

Parsing values contained inside nested brackets

I'm just fooling about and strangely found it a bit tricky to parse nested brackets in a simple recursive function.
For example, if the program's purpose it to lookup user details, it may go from {{name surname} age} to {Bob Builder age} and then to Bob Builder 20.
Here is a mini-program for summing totals in curly brackets that demonstrates the concept.
// Parses string recursively by eliminating brackets
def parse(s: String): String = {
if (!s.contains("{")) s
else {
parse(resolvePair(s))
}
}
// Sums one pair and returns the string, starting at deepest nested pair
// e.g.
// {2+10} lollies and {3+{4+5}} peanuts
// should return:
// {2+10} lollies and {3+9} peanuts
def resolvePair(s: String): String = {
??? // Replace the deepest nested pair with it's sumString result
}
// Sums values in a string, returning the result as a string
// e.g. sumString("3+8") returns "11"
def sumString(s: String): String = {
val v = s.split("\\+")
v.foldLeft(0)(_.toInt + _.toInt).toString
}
// Should return "12 lollies and 12 peanuts"
parse("{2+10} lollies and {3+{4+5}} peanuts")
Any ideas to a clean bit of code that could replace the ??? would be great. It's mostly out of curiosity that I'm searching for an elegant solution to this problem.
Parser combinators can handle this kind of situation:
import scala.util.parsing.combinator.RegexParsers
object BraceParser extends RegexParsers {
override def skipWhitespace = false
def number = """\d+""".r ^^ { _.toInt }
def sum: Parser[Int] = "{" ~> (number | sum) ~ "+" ~ (number | sum) <~ "}" ^^ {
case x ~ "+" ~ y => x + y
}
def text = """[^{}]+""".r
def chunk = sum ^^ {_.toString } | text
def chunks = rep1(chunk) ^^ {_.mkString} | ""
def apply(input: String): String = parseAll(chunks, input) match {
case Success(result, _) => result
case failure: NoSuccess => scala.sys.error(failure.msg)
}
}
Then:
BraceParser("{2+10} lollies and {3+{4+5}} peanuts")
//> res0: String = 12 lollies and 12 peanuts
There is some investment before getting comfortable with parser combinators but I think it is really worth it.
To help you decipher the syntax above:
regular expression and strings have implicit conversions to create primitive parsers with strings results, they have type Parser[String].
the ^^ operator allows to apply a function to the parsed elements
it can convert a Parser[String] into a Parser[Int] by doing ^^ {_.toInt}
Parser is a monad and Parser[T].^^(f) is equivalent to Parser[T].map(f)
the ~, ~> and <~ requires some inputs to be in a certain sequence
the ~> and <~ drop one side of the input out of the result
the case a ~ b allows to pattern match the results
Parser is a monad and (p ~ q) ^^ { case a ~ b => f(a, b) } is equivalent to for (a <- p; b <- q) yield (f(a, b))
(p <~ q) ^^ f is equivalent to for (a <- p; _ <- q) yield f(a)
rep1 is a repetition of 1 or more element
| tries to match an input with the parser on its left and if failing it will try the parser on the right
How about
def resolvePair(s: String): String = {
val open = s.lastIndexOf('{')
val close = s.indexOf('}', open)
if((open >= 0) && (close > open)) {
val (a,b) = s.splitAt(open+1)
val (c,d) = b.splitAt(close-open-1)
resolvePair(a.dropRight(1)+sumString(c).toString+d.drop(1))
} else
s
}
I know it's ugly but I think it works fine.

Resources