When writing a parser, I want to remember the location of lexemes found, so that I can report useful error messages to the programmer, as in “if-less else on line 23” or ”unexpected character on line 45, character 6” or “variable not defined” or something similar. But once I have built the syntax tree, I will transform it in several ways, optimizing or expanding some kind of macros. The transformations produce or rearrange lexemes which do not have a meaningful location.
Therefore it seems that the type representing the syntax tree should come in two flavor, a flavor with locations decorating lexemes and a flavor without lexemes. Ideally we would like to work with a purely abstract syntax tree, as defined in the OCaml book:
# type unr_op = UMINUS | NOT ;;
# type bin_op = PLUS | MINUS | MULT | DIV | MOD
| EQUAL | LESS | LESSEQ | GREAT | GREATEQ | DIFF
| AND | OR ;;
# type expression =
ExpInt of int
| ExpVar of string
| ExpStr of string
| ExpUnr of unr_op * expression
| ExpBin of expression * bin_op * expression ;;
# type command =
Rem of string
| Goto of int
| Print of expression
| Input of string
| If of expression * int
| Let of string * expression ;;
# type line = { num : int ; cmd : command } ;;
# type program = line list ;;
We should be allowed to totally forget about locations when working on that tree and have special functions to map an expression back to its location (for instance), that we could use in case of emergency.
What is the best way to define such a type in OCaml or to handle lexeme positions?
The best way is to work always with AST nodes fully annotated with the locations. For example:
type expression = {
expr_desc : expr_desc;
expr_loc : Lexing.position * Lexing.position; (* start and end *)
}
and expr_desc =
ExpInt of int
| ExpVar of string
| ExpStr of string
| ExpUnr of unr_op * expression
| ExpBin of expression * bin_op * expression
Your idea, keeping the AST free of locations and writing a function to retrieve the missing locations is not a good idea, I believe. Such a function should require searching by pointer equivalence of AST nodes or something similar, which does not really scale.
I strongly recommend to look though OCaml compiler's parser.mly which is a full scale example of AST with locations.
Related
I created a discriminated union which has three possible options:
type tool =
| Hammer
| Screwdriver
| Nail
I would like to match a single character to one tool option. I wrote this function:
let getTool (letter: char) =
match letter with
| H -> Tool.Hammer
| S -> Tool.Screwdriver
| N -> Tool.Nail
Visual Studio Code throws me now the warning that only the first character will be matched and that the other rules never will be.
Can somebody please explain this behaviour and maybe provide an alternative?
That's not how characters are denoted in F#. What you wrote are variable names, not characters.
To denote a character, use single quotes:
let getTool (letter: char) =
match letter with
| 'H' -> Tool.Hammer
| 'S' -> Tool.Screwdriver
| 'N' -> Tool.Nail
Apart from the character syntax (inside single quotes - see Fyodor's response), you should handle the case when the letter is not H, S or N, either using the option type or throwing an exception (less functional but enough for an exercise):
type Tool =
| Hammer
| Screwdriver
| Nail
module Tool =
let ofLetter (letter: char) =
match letter with
| 'H' -> Hammer
| 'S' -> Screwdriver
| 'N' -> Nail
| _ -> invalidArg (nameof letter) $"Unsupported letter '{letter}'"
Usage:
> Tool.ofLetter 'S';;
val it : Tool = Screwdriver
> Tool.ofLetter 'C';;
System.ArgumentException: Unsupported letter 'C' (Parameter 'letter')
I looked at how to make a tree from a given data with F# and https://citizen428.net/blog/learning-fsharp-binary-search-tree/
Basically what I am attempting to do is to implementing a function for building an extremely simple AST using discriminated unions (DU) to represent the tree.
I want to use tokens/symbols to build the tree. I think these could also be represented by DU. I am struggling to implement the insert function.
Let's just say we use the following to represent the tree. The basic idea is that for addition and subtraction of integers I'll only need binary tree. The Expression could either be an operator or a constant. This might be the wrong way of implementing the tree, but I'm not sure.
type Tree =
| Node of Tree * Expression * Tree
| Empty
and Expression =
| Operator //could be a token or another type
| Constant of int
And let's use the following for representing tokens. There's probably a smarter way of doing this. This is just an example.
type Token =
| Integer
| Add
| Subtract
How should I implement the insert function? I've written the function below and tried different ways of inserting elements.
let rec insert tree element =
match element, tree with
//use Empty to initalize
| x, Empty -> Node(Empty, x, Empty)
| x, Node(Empty,y,Empty) when (*x is something here*) -> Node((*something*))
| _, _ -> failwith "Missing case"
If you got any advice or maybe a link then I would appreciate it.
I think that thinking about the problem in terms of tree insertion is not very helpful, because what you really want to do is to parse a sequence of tokens. So, a plain tree insertion is not very useful. You instead need to construct the tree (expression) in a more specific way.
For example, say I have:
let input = [Integer 1; Add; Integer 2; Subtract; Integer 1;]
Say I want to parse this sequence of tokens to get a representation of 1 + (2 - 1) (which has parentheses in the wrong way, but it makes it easier to explain the idea).
My approach would be to define a recursive Expression type rather than using a general tree:
type Token =
| Integer of int
| Add
| Subtract
type Operator =
| AddOp | SubtractOp
type Expression =
| Binary of Operator * Expression * Expression
| Constant of int
To parse a sequence of tokens, you can write something like:
let rec parse input =
match input with
| Integer i::Add::rest ->
Binary(AddOp, Constant i, parse rest)
| Integer i::Subtract::rest ->
Binary(SubtractOp, Constant i, parse rest)
| Integer i::[] ->
Constant i
| _ -> failwith "Unexpected token"
This looks for lists starting with Integer i; Add; ... or similar with subtract and constructs a tree recursively. Using the above input, you get:
> parse input;;
val it : Expression =
Binary (AddOp, Constant 1,
Binary (SubtractOp, Constant 2, Constant 1))
I have attempted to translate my grammar into an AST.
Can an AST type be recursive? For instance, I have a production eprime -> PLUS t eprime | MINUS t eprime | epsilon. Is it correct to translate that to:
type eprime =
| Add of t eprime
| Minus of t eprime
| Eempty
Yes, an AST type can be recursive and often is. However the correct syntax would be Add of t * eprime. Without the * the t would be seen as a type argument to eprime, which doesn't take any.
PS: You don't have to (and probably shouldn't) model your AST after your grammar as closely as you do. It is perfectly okay to have "left recursion" in the AST, even if you've removed it from your grammar. Similarly you don't have to encode operator precedence in your AST types the same way you do in the grammar, so for example having Add and Mult in the same type is no problem. With that in mind the usual definition of an AST for expressions looks more like this:
type exp =
| Add of exp * exp
| Sub of exp * exp
| Mult of exp * exp
| Div of exp * exp
| FunctionCall of ident * exp list
| Var of ident
| Const of value
The short answer is yes. This is more or less exactly how you define a tree-shaped data structure.
A syntactically correct definition looks more like this:
type eprime =
| Add of t * eprime
| Minus of t * eprime
| Empty
If you assume t is int (for simplicity), you can create a value of this type like this:
# Add (3, Add (4, Empty));;
- : eprime = Add (3, Add (4, Empty))
How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.
So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.