As a side project and learning experiment, I'm writing my own programming language with no extra premade tools, such as LLVM.
I've already written my own recursive descent parser, but I'm having a problem trying to think of the logistics of parsing a statement like this:
x()()[0]()
I can't think of a good way to make a parse tree/AST out of this. I've tried reading the grammars of other programming languages (notably Python and C#), but I just can't figure out how they do it.
How would I write something to parse the above grammar?
From an AST perspective, it probably helps to imagine that you have some sort of "function call" node, with one child representing the expression representing the function to call and one child per expression denoting an argument. For example, the code fn()()[0]() might look like this:
+--------+
| call |
+--------+
function / \ args
/ (null)
/
+-----------+
| selection |
+-----------+
array / \ index
/ 0
/
+--------+
| call |
+--------+
function / \ args
/ (null)
/
+--------+
| call |
+--------+
function / \ args
fn 0
In terms of how to parse something like this, I'd recommend treating the function call as a postfix operator like array selection (arr[index]) or member selection (object.field). A CFG fragment for this might look like this:
Expr --> Expr(ArgList) |
/* other expression types */
From a recursive descent perspective, after you've parsed an expression, you'd do a lookahead to see if there's an open parenthesis token after it. If so, that means that whatever you just read should be treated as the function component of a function call expression, and what you're about to read is the argument list.
Related
I looked at how to make a tree from a given data with F# and https://citizen428.net/blog/learning-fsharp-binary-search-tree/
Basically what I am attempting to do is to implementing a function for building an extremely simple AST using discriminated unions (DU) to represent the tree.
I want to use tokens/symbols to build the tree. I think these could also be represented by DU. I am struggling to implement the insert function.
Let's just say we use the following to represent the tree. The basic idea is that for addition and subtraction of integers I'll only need binary tree. The Expression could either be an operator or a constant. This might be the wrong way of implementing the tree, but I'm not sure.
type Tree =
| Node of Tree * Expression * Tree
| Empty
and Expression =
| Operator //could be a token or another type
| Constant of int
And let's use the following for representing tokens. There's probably a smarter way of doing this. This is just an example.
type Token =
| Integer
| Add
| Subtract
How should I implement the insert function? I've written the function below and tried different ways of inserting elements.
let rec insert tree element =
match element, tree with
//use Empty to initalize
| x, Empty -> Node(Empty, x, Empty)
| x, Node(Empty,y,Empty) when (*x is something here*) -> Node((*something*))
| _, _ -> failwith "Missing case"
If you got any advice or maybe a link then I would appreciate it.
I think that thinking about the problem in terms of tree insertion is not very helpful, because what you really want to do is to parse a sequence of tokens. So, a plain tree insertion is not very useful. You instead need to construct the tree (expression) in a more specific way.
For example, say I have:
let input = [Integer 1; Add; Integer 2; Subtract; Integer 1;]
Say I want to parse this sequence of tokens to get a representation of 1 + (2 - 1) (which has parentheses in the wrong way, but it makes it easier to explain the idea).
My approach would be to define a recursive Expression type rather than using a general tree:
type Token =
| Integer of int
| Add
| Subtract
type Operator =
| AddOp | SubtractOp
type Expression =
| Binary of Operator * Expression * Expression
| Constant of int
To parse a sequence of tokens, you can write something like:
let rec parse input =
match input with
| Integer i::Add::rest ->
Binary(AddOp, Constant i, parse rest)
| Integer i::Subtract::rest ->
Binary(SubtractOp, Constant i, parse rest)
| Integer i::[] ->
Constant i
| _ -> failwith "Unexpected token"
This looks for lists starting with Integer i; Add; ... or similar with subtract and constructs a tree recursively. Using the above input, you get:
> parse input;;
val it : Expression =
Binary (AddOp, Constant 1,
Binary (SubtractOp, Constant 2, Constant 1))
I am coding a parser of expression and visualization of it, which means every step of the recursive descent parsing or construction of AST will be visualized like a tiny version of VisuAlgo
// Expression grammer
Goal -> Expr
Expr -> Term + Term
| Expr - Term
| Term
Term -> Term * Factor
| Term / Factor
| Factor
Factor -> (Expr)
| num
| name
So I am wondering what data structure can be easily used for storing every step of constructing AST and how to implement the visualization of every step of constructing AST. Well, I have searched some similar questions and implement a recursive descent parser before, but just can't get a way to figure this out. I will be appreciated if anyone can help me.
This SO answer shows how to parse and build a tree as you parse
You could easily just the print the tree or tree fragments at each point where they are created.
To print the tree, just walk it recursively and print the nodes with indentation equal to depth of recursion.
An excerpt of my ANTLR v4 grammar looks like this:
expression:
| expression BINARY_OPERATOR expression
| unaryExpression
| nularExpression
;
unaryExpression:
ID expression
;
nularExpression:
ID
| NUMBER
| STRING
;
My goal is to match the language without knowing all the necessary keywords and therefore I'm simply matching keywords as IDs.
However there are binary operators that take an argument to both sides of the keyword (e.g. keyword ) and therefore they need "special treatment". As you can see I already included this "special treatment" in the expression rule.
The actual problem now consists of the fact that some of these binary operators can be used as unary operators (=normal keywords) as well meaning that the left argument does not have to be specified.
The above grammar can't habdle this case because everytime I tried to implement this I ended up with every binary operator being consumed as a unary operator.
Example:
Let's assume count is a binary operator.
Possible syntaxes are <arg1> count <arg2> and count <arg>
All my attempts to implement the above mentioned case ended up grouping myArgument count otherArgument like (myArgument (count (otherArgument) ) ) instead of (myArgument) count (otherArgument)
My brain tellsme that the solution to this problem is to tell the parser always to take two arguments for a binary operator and if this fails it should try to consume the binary operator as an unary one.
Does anybody know how to accomplish this?
How about something like this:
lower_precedence_expression
: ID higher_precedence_expression
| higher_precedence_expression
;
higher_precedence_expression
: higher_precedence_expression ID lower_precedence_expression
| ID
| NUMBER
| STRING
;
?
So, I was wondering what makes a parser like:
line : expression EOF;
expression : m_expression (PLUS m_expression)?;
m_expression: basic (TIMES basic)?;
basic : NUMBER | VARIABLE | (OPENING expression CLOSING) | expression;
left recursive and invalid, while a parser like
line : expression EOF;
expression : m_expression (PLUS m_expression)?;
m_expression: basic (TIMES basic)?;
basic : NUMBER | VARIABLE | (OPENING expression CLOSING);
is valid and works even though the definition of 'basic' still refers to 'expression'. In particular, I'd like to be able to parse expressions in the form of
a+b+c
without introducing operations acting on more than two operands.
line calls expression calls m_expression calls basic which calls expression... that is indirectly left recursive and bad for both v3 antlr and v4. The definition of left recursion means you can get back to the same rule without consuming a token. You have the OPENING token in front of expression in the second instance.
I am currently learning how to create a simple expression language using Irony. I'm having a little bit of trouble figuring out the best way to define function signatures, and determining whose responsibility it is to validate the input to those functions.
So far, I have a simple grammar that defines the basic elements of my language. This includes a handful of binary operators, parentheses, numbers, identifiers, and function calls. The BNF for my grammar looks something like this:
<expression> ::= <number> | <parenexp> | <binexp> | <fncall> | <identifier>
<parenexp> ::= ( <expression> )
<fncall> ::= <identifier> ( <argumentlist> )
<binexp> ::= <expression> <binop> <expression>
<binop> ::= + - * / %
... the rest of the grammar definition
Using the Irony parser, I am able to validate the syntax of various input strings to make sure they conform to this grammar:
x + y / z * AVG(a + b, p) -> Valid Syntax
x +/ AVG(x -> Invalid Syntax
All that is well and good, but now I want to go a step further and define the available functions, along with the number of parameters that each function requires. So for example, I want to have a function FOO that accepts one parameter and BAR that accepts two parameters:
FOO(a + b) * BAR(x + y, p + q) -> Valid
FOO(a + b, 13) -> Invalid
When the second statement is parsed, I'd like to be able to output an error message that is aware of the expected input for this function:
Too many arguments specified for function 'FOO'
I don't actually need to evaluate any of these statements, only validate the syntax of the statements and determine if they are valid expressions or not.
How exactly should I be doing this? I know that technically I could simply add the functions to the grammar like so:
<foofncall> ::= FOO( <expression> )
<barfncall> ::= BAR( <expression>, <expression> )
But something about this doesn't feel quite right. To me it seems like the grammar should only define a generic function call, and not every function available to the language.
How is this typically accomplished in other languages?
What are the components called that should handle the responsibilities of analyzing the basic syntax of the language grammar versus the more specific elements like function definitions? Should both responsibilities be handled by the same component?
While you can do typechecking in directly in the grammar so its enforced in the parser, its generally a bad idea to do so. Instead, the parser should just parse the basic syntax, and separate typechecking code should be used for typechecking.
In the normal case of a compiler, the parser just produces an abstract syntax tree or some equivalent representation of the program. Then, a typechecking pass is run over the AST that ensures all types match appropriately -- ensures that functions have the right number of arguments and those arguments have the right type, as well as ensuring that variables have the right type for what is assigned to them and how they are used.
Besides being generally simpler, this usually allows you to give better error messages -- instead of just 'Invalid', you can say 'too many arguments to FOO' or what have you.