Abstract Syntax Tree for Source Code including Expressions - parsing

I am building a new simple programming language (just to learn how compilers work in my free time).
I have already built a lexer which can tokenize my source code into lexemes.
However, I am now stuck on how to form an Abstract Syntax Tree from the tokens, where the source code might contain an expression (with operator precedence).
For simplicity, I shall include only 4 basic operators: +, -, /, and * in addition to brackets (). Operator precedence will follow BODMAS rule.
I realize I might be able to convert the expression from infix to prefix/postfix, form the tree and substitute it.
However, I am not sure if that is possible. Even if it is possible, I am not sure how efficient it might be or how difficult it might be to implement.
Is there some trivial way to form the tree in-place without having to convert to prefix/postfix first?
I came across the Shunting Yard algorithm which seems to do this. However, I found it to be quite a complicated algorithm. Is there something simpler, or should I go ahead with implementing the Shunting Yard algorithm?
Currently, the following program is tokenized by my lexer as follows:
I am demonstrating using a Java program for syntax familiarity.
Source Program:
public class Hello
{
public static void main(String[] args)
{
int a = 5;
int b = 6;
int c = 7;
int r = a + b * c;
System.out.println(r);
}
}
Lexer output:
public
class
Hello
{
public
static
void
main
(
String
[
]
args
)
{
int
a
=
5
;
int
b
=
6
;
int
c
=
7
;
int
r
=
a
+
b
*
c
;
System
.
out
.
println
(
r
)
;
}
}

// I know this might look ugly that I use a global variable ret to return parsed subtrees
// but please bear with it, I got used to this for various performance/usability reasons
var ret, tokens
function get_precedence(op) {
// this is an essential part, cannot parse an expression without the precedence checker
if (op == '*' || op == '/' || op == '%') return 14
if (op == '+' || op == '-') return 13
if (op == '<=' || op == '>=' || op == '<' || op == '>') return 11
if (op == '==' || op == '!=') return 10
if (op == '^') return 8
if (op == '&&') return 6
if (op == '||') return 5
return 0
}
function parse_primary(pos) {
// in the real language primary is almost everything that can be on the sides of +
// but here we only handle numbers detected with the JavaScript 'typeof' keyword
if (typeof tokens[pos] == 'number') {
ret = {
type: 'number',
value: tokens[pos],
}
return pos + 1
}
else {
return undefined
}
}
function parse_operator(pos) {
// let's just reuse the function we already wrote insted of creating another huge 'if'
if (get_precedence(tokens[pos]) != 0) {
ret = {
type: 'operator',
operator: tokens[pos],
}
return pos + 1
}
else {
return undefined
}
}
function parse_expr(pos) {
var stack = [], code = [], n, op, next, precedence
pos = parse_primary(pos)
if (pos == undefined) {
// error, an expression can only start with a primary
return undefined
}
stack.push(ret)
while (true) {
n = pos
pos = parse_operator(pos)
if (pos == undefined) break
op = ret
pos = parse_primary(pos)
if (pos == undefined) break
next = ret
precedence = get_precedence(op.operator)
while (stack.length > 0 && get_precedence(stack[stack.length - 1].operator) >= precedence) {
code.push(stack.pop())
}
stack.push(op)
code.push(next)
}
while(stack.length > 0) {
code.push(stack.pop())
}
if (code.length == 1) ret = code[0]
else ret = {
type: 'expr',
stack: code,
}
return n
}
function main() {
tokens = [1, '+', 2, '*', 3]
var pos = parse_expr(0)
if (pos) {
console.log('parsed expression AST')
console.log(ret)
}
else {
console.log('unable to parse anything')
}
}
main()
Here is your bare-bones implementation of shunting yard expression parsing. This is written in JavaScript. This is as minimalistic and simple as you can get. Tokenizing is left off for brevity, you give the parse the array of tokens (you call them lexemes).
The actual Shunting Yard is the parse_expr function. This is the "classic" implementation that uses the stack, this is my preference, some people prefer functional recursion.
Functions that parse various syntax elements are usually called "parselets". here we have three of them, one for expression, others are for primary and operator. If a parselet detects the corresponding syntax construction at the position pos it will return the next position right after the construct, and the construct itself in AST form is returned via the global variable ret. If the parselet does not find what it expects it returns undefined.
It is now trivially simple to add support for parens grouping (, just extend parse_primary with if (parse_group())... else if (parse_number())... etc. In the meantime your parse_primary will grow real big supporting various things, prefix operators, function calls, etc.

Related

Building expression parser with Dart petitparser, getting stuck on node visitor

I've got more of my expression parser working (Dart PetitParser to get at AST datastructure created with ExpressionBuilder). It appears to be generating accurate ASTs for floats, parens, power, multiply, divide, add, subtract, unary negative in front of both numbers and expressions. (The nodes are either literal strings, or an object that has a precedence with a List payload that gets walked and concatenated.)
I'm stuck now on visiting the nodes. I have clean access to the top node (thanks to Lukas), but I'm stuck on deciding whether or not to add a paren. For example, in 20+30*40, we don't need parens around 30*40, and the parse tree correctly has the node for this closer to the root so I'll hit it first during traversal. However, I don't seem to have enough data when looking at the 30*40 node to determine if it needs parens before going on to the 20+.. A very similar case would be (20+30)*40, which gets parsed correctly with 20+30 closer to the root, so once again, when visiting the 20+30 node I need to add parens before going on to *40.
This has to be a solved problem, but I never went to compiler school, so I know just enough about ASTs to be dangerous. What "a ha" am I missing?
// rip-common.dart:
import 'package:petitparser/petitparser.dart';
// import 'package:petitparser/debug.dart';
class Node {
int precedence;
List<dynamic> args;
Node([this.precedence = 0, this.args = const []]) {
// nodeList.add(this);
}
#override
String toString() => 'Node($precedence $args)';
String visit([int fromPrecedence = -1]) {
print('=== visiting $this ===');
var buf = StringBuffer();
var parens = (precedence > 0) &&
(fromPrecedence > 0) &&
(precedence < fromPrecedence);
print('<$fromPrecedence $precedence $parens>');
// for debugging:
var curlyOpen = '';
var curlyClose = '';
buf.write(parens ? '(' : curlyOpen);
for (var arg in args) {
if (arg is Node) {
buf.write(arg.visit(precedence));
} else if (arg is String) {
buf.write(arg);
} else {
print('not Node or String: $arg');
buf.write('$arg');
}
}
buf.write(parens ? ')' : curlyClose);
print('$buf for buf');
return '$buf';
}
}
class RIPParser {
Parser _make_parser() {
final builder = ExpressionBuilder();
var number = char('-').optional() &
digit().plus() &
(char('.') & digit().plus()).optional();
// precedence 5
builder.group()
..primitive(number.flatten().map((a) => Node(0, [a])))
..wrapper(char('('), char(')'), (l, a, r) => Node(0, [a]));
// negation is a prefix operator
// precedence 4
builder.group()..prefix(char('-').trim(), (op, a) => Node(4, [op, a]));
// power is right-associative
// precedence 3
builder.group()..right(char('^').trim(), (a, op, b) => Node(3, [a, op, b]));
// multiplication and addition are left-associative
// precedence 2
builder.group()
..left(char('*').trim(), (a, op, b) => Node(2, [a, op, b]))
..left(char('/').trim(), (a, op, b) => Node(2, [a, op, b]));
// precedence 1
builder.group()
..left(char('+').trim(), (a, op, b) => Node(1, [a, op, b]))
..left(char('-').trim(), (a, op, b) => Node(1, [a, op, b]));
final parser = builder.build().end();
return parser;
}
Result _result(String input) {
var parser = _make_parser(); // eventually cache
var result = parser.parse(input);
return result;
}
String parse(String input) {
var result = _result(input);
if (result.isFailure) {
return result.message;
} else {
print('result.value = ${result.value}');
return '$result';
}
}
String visit(String input) {
var result = _result(input);
var top_node = result.value; // result.isFailure ...
return top_node.visit();
}
}
// rip_cmd_example.dart
import 'dart:io';
import 'package:rip_common/rip_common.dart';
void main() {
print('start');
String input;
while (true) {
input = stdin.readLineSync();
if (input.isEmpty) {
break;
}
print(RIPParser().parse(input));
print(RIPParser().visit(input));
}
;
print('done');
}
As you've observed, the ExpressionBuilder already assembles the tree in the right precedence order based on the operator groups you've specified.
This also happens for the wrapping parens node created here: ..wrapper(char('('), char(')'), (l, a, r) => Node(0, [a])). If I test for this node, I get back the input string for your example expressions: var parens = precedence == 0 && args.length == 1 && args[0] is Node;.
Unless I am missing something, there should be no reason for you to track the precedence manually. I would also recommend that you create different node classes for the different operators: ValueNode, ParensNode, NegNode, PowNode, MulNode, ... A bit verbose, but much easier to understand what is going on, if each of them can just visit (print, evaluate, optimize, ...) itself.

Grammar and top down parsing

For a while I've been trying to learn how LL parsers work and, if I understand this correctly, when writing a top-down recursive descent parser by hand I should create a function for every non-terminal symbol. So for this example:
S -> AB
A -> aA|ε
B -> bg|dDe
D -> ab|cd
I'd have to create a function for each S, A, B and D like this:
Token B()
{
if (Lookahead(1) == 'b')
{
Eat('b');
Eat('g');
}
else if (Lookahead(1) == 'd')
{
Eat('d');
D();
Eat('e');
}
else
{
Error();
}
return B_TOKEN;
}
But then I try to do the same thing with the following grammar that I created to generate the same language as (a|b|c)*a regular expression:
S -> Ma
M -> aM|bM|cM|ε
That gives me following functions:
Token S()
{
char Ch = Lookahead(1);
if (Ch == 'a' || Ch == 'b' || Ch == 'c')
{
M();
Eat('a');
}
else
{
Error();
}
return S_TOKEN;
}
Token M()
{
char Ch = Lookahead(1);
if (Ch == 'a' || Ch == 'b' || Ch == 'c')
{
Eat(ch);
M();
}
return M_TOKEN;
}
And it doesn't seem to be good because for input 'bbcaa' M will consume everything, and after that S won't find the last 'a' and report an error, even though the grammar accepts it. It feels like M is missing the ε case, or maybe it is handled in the wrong way, but I'm not sure how to deal with it. Could anyone please help?
The behaviour of a top-down predictive parser is exactly as you note in your question. In other words, your second grammar is not suitable for top-down parsing (with one-token lookahead). Many grammars are not; that includes any grammar in which it is not possible to predict which production to use based on a finite lookahead.
In this case, if you were to lookahead two tokens instead of one, you could solve the conflict; M should predict the ε production on lookahead aEND, and the aM production on all other two-token lookaheads where the first token is a.

Predictive parser for EBNF productions

I am trying to write a recursive descent parser without backtracking for a kind of EBNF like this:
<a> ::= b [c] | d
where
<a> = non-terminal
lower-case-string = identifier
[term-in-brackets] = term-in-brackets is optional
a|b is the usual mutually exclusive choice between a and b.
For now, I care only about the right hand side.
Following the example at http://en.wikipedia.org/wiki/Recursive_descent_parser, I eventually ended up with the following procedures (rule in GNU bison syntax in comments above):
/* expression: term optional_or_term */
void expression()
{
term();
if (sym == OR_SYM)
optional_or_term();
}
/* optional_or_term: // empty
| OR_SYM term optional_or_term
*/
void optional_or_term()
{
while (sym == OR_SYM)
{
getsym();
term();
}
}
/* term: factor | factor term */
void term()
{
factor();
if (sym == EOF_SYM || sym == RIGHT_SQUAREB_SYM)
{
;
}
else if (sym == IDENTIFIER_SYM || sym == LEFT_SQUAREB_SYM)
term();
else if (sym == OR_SYM)
optional_or_term();
else
{
error("term: syntax error");
getsym();
}
}
/*
factor: IDENTIFIER_SYM
| LEFT_SQUAREB_SYM expression RIGHT_SQUAREB_SYM
*/
void factor()
{
if (accept(IDENTIFIER_SYM))
{
;
}
else if (accept(LEFT_SQUAREB_SYM))
{
expression();
expect(RIGHT_SQUAREB_SYM);
}
else
{
error("factor: syntax error");
getsym();
}
}
It seems to be working, but my expectation was that each procedure would correspond closely with the corresponding rule. You will notice that term() does not.
My question is: did the grammar need more transformation before the procedures were written?
I don't think your problem is the absence of operators for concatenation. I think it is not using Kleene star (and plus) for lists of things. The Kleene star lets you actually code a loop inside a procedure that implements the grammar rule.
I would have written your grammar as:
expression = term (OR_SYM term)*;
term = factor+;
factor = IDENTIFIER_SYM | LEFT_SQUAREB_SYM expression RIGHT_SQUAREB_SYM ;
(This is a pretty classic grammar for a grammar).
The parser code then looks like:
boolean function expression()
{ if term()
{ loop
{ if OR_SYM()
{ if term()
{}
else syntax_error();
}
else return true;
}
else return false;
}
boolean term()
{ if factor()
{ loop
{ if factor()
{}
else return true;
}
}
else return false;
}
boolean factor()
{ if IDENTIFIER(SYM)
return true;
else
{ if LEFT_SQUAREB_SYM()
{ if expression()
{ if RIGHT_SQUAREB_SYM()
return true;
else syntax_error();
}
else syntax_error();
else return false;
}
}
I tried to generate this in an absolutely mechanical way, and you can do pretty well like this. I did a lot of this earlier my career.
What you're not going to get is 150 working rules per day. First, for a big language, it is hard to get the grammar right; you'll be tweaking it repeatedly to get a grammar that works in the abstract, then you have to adjust the code you wrote. Next you'll discover that writing the lexer has its troubles too; just try writing a lexer for Java. Finally, you'll discover that parser rules isn't the whole game or even the biggest part of your effort; you need a lot to process real code. I call this "Life After Parsing"; see my bio
for more information.
If you want to get 150 working rules per day, switch to a GLR parser and stop coding parsers manually. That won't address the other issues, but it does make you incredibly productive at getting out a usable grammar. This is what I do now. Without exception. (Our DMS Software Reengineering Toolkit uses this, and we parse a lot of things that people claim are hard.)

How to handle multiple optionals in grammar

I would like to know how I can handle multiple optionals without concrete pattern matching for each possible permutation.
Below is a simplified example of the problem I am facing:
lexical Int = [0-9]+;
syntax Bool = "True" | "False";
syntax Period = "Day" | "Month" | "Quarter" | "Year";
layout Standard = [\ \t\n\f\r]*;
syntax Optionals = Int? i Bool? b Period? p;
str printOptionals(Optionals opt){
str res = "";
if(!isEmpty("<opt.i>")) { // opt has i is always true (same for opt.i?)
res += printInt(opt.i);
}
if(!isEmpty("<opt.b>")){
res += printBool(opt.b);
}
if(!isEmpty("<opt.p>")) {
res += printPeriod(opt.period);
}
return res;
}
str printInt(Int i) = "<i>";
str printBool(Bool b) = "<b>";
str printPeriod(Period p) = "<p>";
However this gives the error message:
The called signature: printInt(opt(lex("Int"))), does not match the declared signature: str printInt(sort("Int"));
How do I get rid of the opt part when I know it is there?
I'm not sure how ideal this is, but you could do this for now:
if (/Int i := opt.i) {
res += printInt(i);
}
This will extract the Int from within opt.i if it is there, but the match will fail if Int was not provided as one of the options.
The current master on github has the following feature to deal with optionals: they can be iterated over.
For example:
if (Int i <- opt.i) {
res += printInt(i);
}
The <- will produce false immediately if the optional value is absent, and otherwise loop once through and bind the value which is present to the pattern.
An untyped solution is to project out the element from the parse tree:
rascal>opt.i.args[0];
Tree: `1`
Tree: appl(prod(lex("Int"),[iter(\char-class([range(48,57)]))],{}),[appl(regular(iter(\char-class([range(48,57)]))),[char(49)])[#loc=|file://-|(0,1,<1,0>,<1,1>)]])[#loc=|file://-|(0,1,<1,0>,<1,1>)]
However, then to transfer this back to an Int you'd have to pattern match, like so:
rascal>if (Int i := opt.i.args[0]) { printInt(i); }
str: "1"
One could write a generic cast function to help out here:
rascal>&T cast(type[&T] t, value v) { if (&T a := v) return a; throw "cast exception"; }
ok
rascal>printInt(cast(#Int, opt.i.args[0]))
str: "1"
Still, I believe Rascal is missing a feature here. Something like this would be a good feature request:
rascal>Int j = opt.i.value;
rascal>opt.i has value
bool: true

Recursive descent parser implementation

I am looking to write some pseudo-code of a recursive descent parser. Now, I have no experience with this type of coding. I have read some examples online but they only work on grammar that uses mathematical expressions. Here is the grammar I am basing the parser on.
S -> if E then S | if E then S else S | begin S L | print E
L -> end | ; S L
E -> i
I must write methods S(), L() and E() and return some error messages, but the tutorials I have found online have not helped a lot. Can anyone point me in the right direction and give me some examples?.
I would like to write it in C# or Java syntax, since it is easier for me to relate.
Update
public void S() {
if (currentToken == "if") {
getNextToken();
E();
if (currentToken == "then") {
getNextToken();
S();
if (currentToken == "else") {
getNextToken();
S();
Return;
}
} else {
throw new IllegalTokenException("Procedure S() expected a 'then' token " + "but received: " + currentToken);
} else if (currentToken == "begin") {
getNextToken();
S();
L();
return;
} else if (currentToken == "print") {
getNextToken();
E();
return;
} else {
throw new IllegalTokenException("Procedure S() expected an 'if' or 'then' or else or begin or print token " + "but received: " + currentToken);
}
}
}
public void L() {
if (currentToken == "end") {
getNextToken();
return;
} else if (currentToken == ";") {
getNextToken();
S();
L();
return;
} else {
throw new IllegalTokenException("Procedure L() expected an 'end' or ';' token " + "but received: " + currentToken);
}
}
public void E() {
if (currentToken == "i") {
getNextToken();
return;
} else {
throw new IllegalTokenException("Procedure E() expected an 'i' token " + "but received: " + currentToken);
}
}
Basically in recursive descent parsing each non-terminal in the grammar is translated into a procedure, then inside each procedure you check to see if the current token you are looking at matches what you would expect to see on the right hand side of the non-terminal symbol corresponding to the procedure, if it does then continue applying the production, if it doesn't then you have encountered an error and must take some action.
So in your case as you mentioned above you will have procedures: S(), L(), and E(), I will give an example of how L() could be implemented and then you can try and do S() and E() on your own.
It is also important to note that you will need some other program to tokenize the input for you, then you can just ask that program for the next token from your input.
/**
* This procedure corresponds to the L non-terminal
* L -> 'end'
* L -> ';' S L
*/
public void L()
{
if(currentToken == 'end')
{
//we found an 'end' token, get the next token in the input stream
//Notice, there are no other non-terminals or terminals after
//'end' in this production so all we do now is return
//Note: that here we return to the procedure that called L()
getNextToken();
return;
}
else if(currentToken == ';')
{
//we found an ';', get the next token in the input stream
getNextToken();
//Notice that there are two more non-terminal symbols after ';'
//in this production, S and L; all we have to do is call those
//procedures in the order they appear in the production, then return
S();
L();
return;
}
else
{
//The currentToken is not either an 'end' token or a ';' token
//This is an error, raise some exception
throw new IllegalTokenException(
"Procedure L() expected an 'end' or ';' token "+
"but received: " + currentToken);
}
}
Now you try S() and E(), and post back.
Also as Kristopher points out your grammar has something called a dangling else, meaing that you have a production that starts with the same thing up to a point:
S -> if E then S
S -> if E then S else S
So this begs the question if your parser sees an 'if' token, which production should it choose to process the input? The answer is it has no idea which one to choose because unlike humans the compiler can't look ahead into the input stream to search for an 'else' token. This is a simple problem to fix by applying a rule known as Left-Factoring, very similar to how you would factor an algebra problem.
All you have to do is create a new non-terminal symbol S' (S-prime) whose right hand side will hold the pieces of the productions that aren't common, so your S productions no becomes:
S -> if E then S S'
S' -> else S
S' -> e
(e is used here to denote the empty string, which basically means there is no
input seen)
This is not the easiest grammar to start with because you have an unlimited amount of lookahead on your first production rule:
S -> if E then S | if E then S else S | begin S L | print E
consider
if 5 then begin begin begin begin ...
When do we determine this stupid else?
also, consider
if 5 then if 4 then if 3 then if 2 then print 2 else ...
Now, was that else supposed to bind to the if 5 then fragment? If not, that's actually cool, but be explicit.
You can rewrite your grammar (possibly, depending on else rule) equivalently as:
S -> if E then S (else S)? | begin S L | print E
L -> end | ; S L
E -> i
Which may or may not be what you want. But the pseudocode sort of jumps out from this.
define S() {
if (peek()=="if") {
consume("if")
E()
consume("then")
S()
if (peek()=="else") {
consume("else")
S()
}
} else if (peek()=="begin") {
consume("begin")
S()
L()
} else if (peek()=="print") {
consume("print")
E()
} else {
throw error()
}
}
define L() {
if (peek()=="end") {
consume("end")
} else if (peek==";")
consume(";")
S()
L()
} else {
throw error()
}
}
define E() {
consume_token_i()
}
For each alternate, I created an if statement that looked at the unique prefix. The final else on any match attempt is always an error. I consume keywords and call the functions corresponding to production rules as I encounter them.
Translating from pseudocode to real code isn't too complicated, but it isn't trivial. Those peeks and consumes probably don't actually operate on strings. It's far easier to operate on tokens. And simply walking a sentence and consuming it isn't quite the same as parsing it. You'll want to do something as you consume the pieces, possibly building up a parse tree (which means these functions probably return something). And throwing an error might be correct at a high level, but you'd want to put some meaningful information into the error. Also, things get more complex if you do require lookahead.
I would recommend Language Implementation Patterns by Terence Parr (the guy who wrote antlr, a recursive descent parser generator) when looking at these kinds of problems. The Dragon Book (Aho, et al, recommended in a comment) is good, too (it is still probably the dominant textbook in compiler courses).
I taught (really just, helped) the parsing section of a PL class last semester. I really recommend looking at the parsing slides from our page: here. Basically, for recursive descent parsing, you ask yourself the following question:
I've parsed a little bit of a nonterminal, now I'm at a point where I can make a choice as to what I'm supposed to parse next. What I see next will make decisions as to what nonterminal I'm in.
Your grammar -- by the way -- exhibits a very common ambiguity called the "dangling else," which has been around since the Algol days. In shift reduce parsers (typically produced by parser generators) this generates a shift / reduce conflict, where you typically choose to arbitrarily shift instead of reduce, giving you the common "maximal much" principal. (So, if you see "if (b) then if (b2) S1 else S2" you read it as "if (b) then { if (b2) { s1; } else { s2; } }")
Let's throw this out of your grammar, and work with a slightly simpler grammar:
T -> A + T
| A - T
| A
A -> NUM * A
| NUM / A
| NUM
we're also going to assume that NUM is something you get from the lexer (i.e., it's just a token). This grammar is LL(1), that is, you can parse it with a recursive descent parser implemented using a naive algorithm. The algorithm works like this:
parse_T() {
a = parse_A();
if (next_token() == '+') {
next_token(); // advance to the next token in the stream
t = parse_T();
return T_node_plus_constructor(a, t);
}
else if (next_token() == '-') {
next_token(); // advance to the next token in the stream
t = parse_T();
return T_node_minus_constructor(a, t);
} else {
return T_node_a_constructor(a);
}
}
parse_A() {
num = token(); // gets the current token from the token stream
next_token(); // advance to the next token in the stream
assert(token_type(a) == NUM);
if (next_token() == '*') {
a = parse_A();
return A_node_times_constructor(num, a);
}
else if (next_token() == '/') {
a = parse_A();
return A_node_div_constructor(num, a);
} else {
return A_node_num_constructor(num);
}
}
Do you see the pattern here:
for each nonterminal in your grammar: parse the first thing, then see what you have to look at to decide what you should parse next. Continue this until you're done!
Also, please note that this type of parsing typically won't produce what you want for arithmetic. Recursive descent parsers (unless you use a little trick with tail recursion?) won't produce leftmost derivations. In particular, you can't write left recursive rules like "a -> a - a" where the leftmost associativity is really necessary! This is why people typically use fancier parser generator tools. But the recursive descent trick is still worth knowing about and playing around with.

Resources