How to write a recursive descent parser from scratch? - parsing

As a purely academic exercise, I'm writing a recursive descent parser from scratch -- without using ANTLR or lex/yacc.
I'm writing a simple function which converts math expressions into their equivalent AST. I have the following:
// grammar
type expr =
| Lit of float
| Add of expr * expr
| Mul of expr * expr
| Div of expr * expr
| Sub of expr * expr
// tokens
type tokens =
| Num of float
| LParen | RParen
| XPlus | XStar | XMinus | XSlash
let tokenize (input : string) =
Regex.Matches(input.Replace(" ", ""), "\d+|[+/*\-()]")
|> Seq.cast<Match>
|> Seq.map (fun x -> x.Value)
|> Seq.map (function
| "+" -> XPlus
| "-" -> XMinus
| "/" -> XSlash
| "*" -> XStar
| "(" -> LParen
| ")" -> RParen
| num -> Num(float num))
|> Seq.to_list
So, tokenize "10 * (4 + 5) - 1" returns the following token stream:
[Num 10.0; XStar; LParen; Num 4.0; XPlus; Num 5.0; RParen; XMinus; Num 1.0]
At this point, I'd like to map the token stream to its AST with respect to operator precedence:
Sub(
Mul(
Lit 10.0
,Add(Lit 4.0, Lit 5.0)
)
,Lit 1.0
)
However, I'm drawing a blank. I've never written a parser from scratch, and I don't know even in principle how to begin.
How do I convert a token stream its representative AST?

Do you know about language grammars?
Assuming yes, you have a grammar with rules along the lines
...
addTerm := mulTerm addOp addTerm
| mulTerm
addOp := XPlus | XMinus
mulTerm := litOrParen mulOp mulTerm
| litOrParen
...
which ends up turning into code like (writing code in browser, never compiled)
let rec AddTerm() =
let mulTerm = MulTerm() // will parse next mul term (error if fails to parse)
match TryAddOp with // peek ahead in token stream to try parse
| None -> mulTerm // next token was not prefix for addOp rule, stop here
| Some(ao) -> // did parse an addOp
let rhsMulTerm = MulTerm()
match ao with
| XPlus -> Add(mulTerm, rhsMulTerm)
| XMinus -> Sub(mulTerm, rhsMulTerm)
and TryAddOp() =
let next = tokens.Peek()
match next with
| XPlus | XMinus ->
tokens.ConsumeNext()
Some(next)
| _ -> None
...
Hopefully you see the basic idea. This assumes a global mutable token stream that allows both 'peek at next token' and 'consume next token'.

If I remember from college classes the idea was to build expression trees like:
<program> --> <expression> <op> <expression> | <expression>
<expression> --> (<expression>) | <constant>
<op> --> * | - | + | /
<constant> --> <constant><constant> | [0-9]
then once you have construction your tree completely so you get something like:
exp
exp op exp
5 + and so on
then you run your completed tree through another program that recursively descents into the tree calculating expressions until you have an answer. If your parser doesn't understand the tree, you have a syntax error. Hope that helps.

Related

How do I preform 'lookahead' in an OCaml lexer / how do I rollback a lexeme?

Well, I'm writing my first parser, in OCaml, and I immediately somehow managed to make one with an infinite-loop.
Of particular note, I'm trying to lex identifiers according to the rules of the Scheme specification (I have no idea what I'm doing, obviously) — and there's some language in there about identifiers requiring that they are followed by a delimiter. My approach, right now, is to have a delimited_identifier regex that includes one of the delimiter characters, that should not be consumed by the main lexer … and then once that's been matched, the reading of that lexeme is reverted by Sedlexing.rollback (well, my wrapper thereof), before being passed to a sublexer that only eats the actual identifier, hopefully leaving the delimiter in the buffer to be eaten as a different lexeme by the parent lexer.
I'm using Menhir and Sedlex, mostly synthesizing the examples from #smolkaj's ocaml-parsing example-repo and RWO's parsing chapter; here's the simplest reduction of my current parser and lexer:
%token LPAR RPAR LVEC APOS TICK COMMA COMMA_AT DQUO SEMI EOF
%token <string> IDENTIFIER
(* %token <bool> BOOL *)
(* %token <int> NUM10 *)
(* %token <string> STREL *)
%start <Parser.AST.t> program
%%
program:
| p = list(expression); EOF { p }
;
expression:
| i = IDENTIFIER { Parser.AST.Atom i }
%%
… and …
(** Regular expressions *)
let newline = [%sedlex.regexp? '\r' | '\n' | "\r\n" ]
let whitespace = [%sedlex.regexp? ' ' | newline ]
let delimiter = [%sedlex.regexp? eof | whitespace | '(' | ')' | '"' | ';' ]
let digit = [%sedlex.regexp? '0'..'9']
let letter = [%sedlex.regexp? 'A'..'Z' | 'a'..'z']
let special_initial = [%sedlex.regexp?
'!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~' ]
let initial = [%sedlex.regexp? letter | special_initial ]
let special_subsequent = [%sedlex.regexp? '+' | '-' | '.' | '#' ]
let subsequent = [%sedlex.regexp? initial | digit | special_subsequent ]
let peculiar_identifier = [%sedlex.regexp? '+' | '-' | "..." ]
let identifier = [%sedlex.regexp? initial, Star subsequent | peculiar_identifier ]
let delimited_identifier = [%sedlex.regexp? identifier, delimiter ]
(** Swallow whitespace and comments. *)
let rec swallow_atmosphere buf =
match%sedlex buf with
| Plus whitespace -> swallow_atmosphere buf
| ";" -> swallow_comment buf
| _ -> ()
and swallow_comment buf =
match%sedlex buf with
| newline -> swallow_atmosphere buf
| any -> swallow_comment buf
| _ -> assert false
(** Return the next token. *)
let rec token buf =
swallow_atmosphere buf;
match%sedlex buf with
| eof -> EOF
| delimited_identifier ->
Sedlexing.rollback buf;
identifier buf
| '(' -> LPAR
| ')' -> RPAR
| _ -> illegal buf (Char.chr (next buf))
and identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
(Yes, it's basically a no-op / the simplest thing possible rn. I'm trying to learn! :x)
Unfortunately, this combination results in an infinite loop in the parsing automaton:
State 0:
Lookahead token is now IDENTIFIER (1-1)
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
...
I'm new to parsing and lexing and all this; any advice would be welcome. I'm sure it's just a newbie mistake, but …
Thanks!
As said before, implementing too much logic inside the lexer is a bad idea.
However, the infinite loop does not come from the rollback but from your definition of identifier:
identifier buf =
match%sedlex buf with
| _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)
within this definition _ matches the shortest possible words in the language consisting of all possible characters. In other words, _ always matches the empty word μ without consuming any part of its input, sending the parser into an infinite loop.

How to fix this bug in a math expression evaluator

I've written a typical evaluator for simple math expressions (arithmetic with some custom functions) in F#. While it seems to be working correctly, some expressions don't evaluate as expected, for example, these work fine:
eval "5+2" --> 7
eval "sqrt(25)^2" --> 25
eval "1/(sqrt(4))" --> 0.5
eval "1/(2^2+2)" --> 1/6 ~ 0.1666...
but these don't:
eval "1/(sqrt(4)+2)" --> evaluates to 1/sqrt(6) ~ 0.408...
eval "1/(sqrt 4 + 2)" --> will also evaluate to 1/sqrt(6)
eval "1/(-1+3)" --> evaluates to 1/(-4) ~ -0.25
the code works as follows, tokenization (string as input) -> to rev-polish-notation (RPN) -> evalRpn
I thought that the problem seems to occur somewhere with the unary functions (functions accepting one operator), these are the sqrt function and the negation (-) function. I don't really see what's going wrong in my code. Can someone maybe point out what I am missing here?
this is my implementation in F#
open System.Collections
open System.Collections.Generic
open System.Text.RegularExpressions
type Token =
| Num of float
| Plus
| Minus
| Star
| Hat
| Sqrt
| Slash
| Negative
| RParen
| LParen
let hasAny (list: Stack<'T>) =
list.Count <> 0
let tokenize (input:string) =
let tokens = new Stack<Token>()
let push tok = tokens.Push tok
let regex = new Regex(#"[0-9]+(\.+\d*)?|\+|\-|\*|\/|\^|\)|\(|pi|e|sqrt")
for x in regex.Matches(input.ToLower()) do
match x.Value with
| "+" -> push Plus
| "*" -> push Star
| "/" -> push Slash
| ")" -> push LParen
| "(" -> push RParen
| "^" -> push Hat
| "sqrt" -> push Sqrt
| "pi" -> push (Num System.Math.PI)
| "e" -> push (Num System.Math.E)
| "-" ->
if tokens |> hasAny then
match tokens.Peek() with
| LParen -> push Minus
| Num v -> push Minus
| _ -> push Negative
else
push Negative
| value -> push (Num (float value))
tokens.ToArray() |> Array.rev |> Array.toList
let isUnary = function
| Negative | Sqrt -> true
| _ -> false
let prec = function
| Hat -> 3
| Star | Slash -> 2
| Plus | Minus -> 1
| _ -> 0
let toRPN src =
let output = new ResizeArray<Token>()
let stack = new Stack<Token>()
let rec loop = function
| Num v::tokens ->
output.Add(Num v)
loop tokens
| RParen::tokens ->
stack.Push RParen
loop tokens
| LParen::tokens ->
while stack.Peek() <> RParen do
output.Add(stack.Pop())
stack.Pop() |> ignore // pop the "("
loop tokens
| op::tokens when op |> isUnary ->
stack.Push op
loop tokens
| op::tokens ->
if stack |> hasAny then
if prec(stack.Peek()) >= prec op then
output.Add(stack.Pop())
stack.Push op
loop tokens
| [] ->
output.AddRange(stack.ToArray())
output
(loop src).ToArray()
let (#) op tok =
match tok with
| Num v ->
match op with
| Sqrt -> Num (sqrt v)
| Negative -> Num (v * -1.0)
| _ -> failwith "input error"
| _ -> failwith "input error"
let (##) op toks =
match toks with
| Num v,Num u ->
match op with
| Plus -> Num(v + u)
| Minus -> Num(v - u)
| Star -> Num(v * u)
| Slash -> Num(u / v)
| Hat -> Num(u ** v)
| _ -> failwith "input error"
| _ -> failwith "inpur error"
let evalRPN src =
let stack = new Stack<Token>()
let rec loop = function
| Num v::tokens ->
stack.Push(Num v)
loop tokens
| op::tokens when op |> isUnary ->
let result = op # stack.Pop()
stack.Push result
loop tokens
| op::tokens ->
let result = op ## (stack.Pop(),stack.Pop())
stack.Push result
loop tokens
| [] -> stack
if loop src |> hasAny then
match stack.Pop() with
| Num v -> v
| _ -> failwith "input error"
else failwith "input error"
let eval input =
input |> (tokenize >> toRPN >> Array.toList >> evalRPN)
Before answering your specific question, did you notice you have another bug? Try eval "2-4" you get 2.0 instead of -2.0.
That's probably because along these lines:
match op with
| Plus -> Num(v + u)
| Minus -> Num(v - u)
| Star -> Num(v * u)
| Slash -> Num(u / v)
| Hat -> Num(u ** v)
u and v are swapped, in commutative operations you don't notice the difference, so just revert them to u -v.
Now regarding the bug you mentioned, the cause seems obvious to me, by looking at your code you missed the precedence of those unary operations:
let prec = function
| Hat -> 3
| Star | Slash -> 2
| Plus | Minus -> 1
| _ -> 0
I tried adding them this way:
let prec = function
| Negative -> 5
| Sqrt -> 4
| Hat -> 3
| Star | Slash -> 2
| Plus | Minus -> 1
| _ -> 0
And now it seems to be fine.
Edit: meh, seems I was late, Gustavo posted the answer while I was wondering about the parentheses. Oh well.
Unary operators have the wrong precedence. Add the primary case | a when isUnary a -> 4 to prec.
The names of LParen and RParen are consistently swapped throughout the code. ( maps to RParen and ) to LParen!
It runs all tests from the question properly for me, given the appropriate precedence, but I haven't checked the code for correctness.

Incomplete match with AND patterns

I've defined an expression tree structure in F# as follows:
type Num = int
type Name = string
type Expr =
| Con of Num
| Var of Name
| Add of Expr * Expr
| Sub of Expr * Expr
| Mult of Expr * Expr
| Div of Expr * Expr
| Pow of Expr * Expr
| Neg of Expr
I wanted to be able to pretty-print the expression tree so I did the following:
let (|Unary|Binary|Terminal|) expr =
match expr with
| Add(x, y) -> Binary(x, y)
| Sub(x, y) -> Binary(x, y)
| Mult(x, y) -> Binary(x, y)
| Div(x, y) -> Binary(x, y)
| Pow(x, y) -> Binary(x, y)
| Neg(x) -> Unary(x)
| Con(x) -> Terminal(box x)
| Var(x) -> Terminal(box x)
let operator expr =
match expr with
| Add(_) -> "+"
| Sub(_) | Neg(_) -> "-"
| Mult(_) -> "*"
| Div(_) -> "/"
| Pow(_) -> "**"
| _ -> failwith "There is no operator for the given expression."
let rec format expr =
match expr with
| Unary(x) -> sprintf "%s(%s)" (operator expr) (format x)
| Binary(x, y) -> sprintf "(%s %s %s)" (format x) (operator expr) (format y)
| Terminal(x) -> string x
However, I don't really like the failwith approach for the operator function since it's not compile-time safe. So I rewrote it as an active pattern:
let (|Operator|_|) expr =
match expr with
| Add(_) -> Some "+"
| Sub(_) | Neg(_) -> Some "-"
| Mult(_) -> Some "*"
| Div(_) -> Some "/"
| Pow(_) -> Some "**"
| _ -> None
Now I can rewrite my format function beautifully as follows:
let rec format expr =
match expr with
| Unary(x) & Operator(op) -> sprintf "%s(%s)" op (format x)
| Binary(x, y) & Operator(op) -> sprintf "(%s %s %s)" (format x) op (format y)
| Terminal(x) -> string x
I assumed, since F# is magic, that this would just work. Unfortunately, the compiler then warns me about incomplete pattern matches, because it can't see that anything that matches Unary(x) will also match Operator(op) and anything that matches Binary(x, y) will also match Operator(op). And I consider warnings like that to be as bad as compiler errors.
So my questions are: Is there a specific reason why this doesn't work (like have I left some magical annotation off somewhere or is there something that I'm just not seeing)? Is there a simple workaround I could use to get the type of safety I want? And is there an inherent problem with this type of compile-time checking, or is it something that F# might add in some future release?
If you code the destinction between ground terms and complex terms into the type system, you can avoid the runtime check and make them be complete pattern matches.
type Num = int
type Name = string
type GroundTerm =
| Con of Num
| Var of Name
type ComplexTerm =
| Add of Term * Term
| Sub of Term * Term
| Mult of Term * Term
| Div of Term * Term
| Pow of Term * Term
| Neg of Term
and Term =
| GroundTerm of GroundTerm
| ComplexTerm of ComplexTerm
let (|Operator|) ct =
match ct with
| Add(_) -> "+"
| Sub(_) | Neg(_) -> "-"
| Mult(_) -> "*"
| Div(_) -> "/"
| Pow(_) -> "**"
let (|Unary|Binary|) ct =
match ct with
| Add(x, y) -> Binary(x, y)
| Sub(x, y) -> Binary(x, y)
| Mult(x, y) -> Binary(x, y)
| Div(x, y) -> Binary(x, y)
| Pow(x, y) -> Binary(x, y)
| Neg(x) -> Unary(x)
let (|Terminal|) gt =
match gt with
| Con x -> Terminal(string x)
| Var x -> Terminal(string x)
let rec format expr =
match expr with
| ComplexTerm ct ->
match ct with
| Unary(x) & Operator(op) -> sprintf "%s(%s)" op (format x)
| Binary(x, y) & Operator(op) -> sprintf "(%s %s %s)" (format x) op (format y)
| GroundTerm gt ->
match gt with
| Terminal(x) -> x
also, imo, you should avoid boxing if you want to be type-safe. If you really want both cases, make two pattern. Or, as done here, just make a projection to the type you need later on. This way you avoid the boxing and instead you return what you need for printing.
I think you can make operator a normal function rather than an active pattern. Because operator is just a function which gives you an operator string for an expr, where as unary, binary and terminal are expression types and hence it make sense to pattern match on them.
let operator expr =
match expr with
| Add(_) -> "+"
| Sub(_) | Neg(_) -> "-"
| Mult(_) -> "*"
| Div(_) -> "/"
| Pow(_) -> "**"
| Var(_) | Con(_) -> ""
let rec format expr =
match expr with
| Unary(x) -> sprintf "%s(%s)" (operator expr) (format x)
| Binary(x, y) -> sprintf "(%s %s %s)" (format x) (operator expr) (format y)
| Terminal(x) -> string x
I find the best solution is to restructure your original type defintion:
type UnOp = Neg
type BinOp = Add | Sub | Mul | Div | Pow
type Expr =
| Int of int
| UnOp of UnOp * Expr
| BinOp of BinOp * Expr * Expr
All sorts of functions can then be written over the UnOp and BinOp types including selecting operators. You may even want to split BinOp into arithmetic and comparison operators in the future.
For example, I used this approach in the (non-free) article "Language-oriented programming: The Term-level Interpreter
" (2008) in the F# Journal.

Haskell parser to AST data type, assignment

I have been searching around the interwebs for a couple of days, trying to get an answer to my questions and i'm finally admitting defeat.
I have been given a grammar:
Dig ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Int ::= Dig | Dig Int
Var ::= a | b | ... z | A | B | C | ... | Z
Expr ::= Int | - Expr | + Expr Expr | * Expr Expr | Var | let Var = Expr in Expr
And i have been told to parse, evaluate and print expressions using this grammar
where the operators * + - has their normal meaning
The specific task is to write a function parse :: String -> AST
that takes a string as input and returns an abstract syntax tree when the input is in the correct format (which i can asume it is).
I am told that i might need a suitable data type and that data type might need to derive from some other classes.
Following an example output
data AST = Leaf Int | Sum AST AST | Min AST | ...
Further more, i should consider writing a function
tokens::String -> [String]
to split the input string into a list of tokens
Parsing should be accomplished with
ast::[String] -> (AST,[String])
where the input is a list of tokens and it outputs an AST, and to parse sub-expressions i should simply use the ast function recursively.
I should also make a printExpr method to print the result so that
printE: AST -> String
printE(parse "* 5 5") yields either "5*5" or "(5*5)"
and also a function to evaluate the expression
evali :: AST -> Int
I would just like to be pointed in the right direction of where i might start. I have little knowledge of Haskell and FP in general and trying to solve this task i made some string handling function out of Java which made me realize that i'm way off track.
So a little pointer in the right direction, and maybe an explantion to 'how' the AST should look like
Third day in a row and still no running code, i really appreciate any attempt to help me find a solution!
Thanks in advance!
Edit
I might have been unclear:
I'm wondering how i should go about from having read and tokenized an input string to making an AST.
Parsing tokens into an Abstract Syntax Tree
OK, let's take your grammar
Dig ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Int ::= Dig | Dig Int
Var ::= a | b | ... z | A | B | C | ... | Z
Expr ::= Int | - Expr | + Expr Expr | * Expr Expr | Var | let Var = Expr in Expr
This is a nice easy grammar, because you can tell from the first token what sort of epression it will be.
(If there was something more complicated, like + coming between numbers, or - being used for subtraction as
well as negation, you'd need the list-of-successes trick, explained in
Functional Parsers.)
Let's have some sample raw input:
rawinput = "- 6 + 45 let x = - 5 in * x x"
Which I understand from the grammar represents "(- 6 (+ 45 (let x=-5 in (* x x))))",
and I'll assume you tokenised it as
tokenised_input' = ["-","6","+","4","5","let","x","=","-","5","in","*","x","x"]
which fits the grammar, but you might well have got
tokenised_input = ["-","6","+","45","let","x","=","-","5","in","*","x","x"]
which fits your sample AST better. I think it's good practice to name your AST after bits of your grammar,
so I'm going to go ahead and replace
data AST = Leaf Int | Sum AST AST | Min AST | ...
with
data Expr = E_Int Int | E_Neg Expr | E_Sum Expr Expr | E_Prod Expr Expr | E_Var Char
| E_Let {letvar::Char,letequal:: Expr,letin::Expr}
deriving Show
I've named the bits of an E_Let to make it clearer what they represent.
Writing a parsing function
You could use isDigit by adding import Data.Char (isDigit) to help out:
expr :: [String] -> (Expr,[String])
expr [] = error "unexpected end of input"
expr (s:ss) | all isDigit s = (E_Int (read s),ss)
| s == "-" = let (e,ss') = expr ss in (E_Neg e,ss')
| s == "+" = (E_Sum e e',ss'') where
(e,ss') = expr ss
(e',ss'') = expr ss'
-- more cases
Yikes! Too many let clauses obscuring the meaning,
and we'll be writing the same code for E_Prod and very much worse for E_Let.
Let's get this sorted out!
The standard way of dealing with this is to write some combinators;
instead of tiresomely threading the input [String]s through our definition, define ways to
mess with the output of parsers (map) and combine
multiple parsers into one (lift).
Clean up the code 1: map
First we should define pmap, our own equivalent of the map function so we can do pmap E_Neg (expr1 ss)
instead of let (e,ss') = expr1 ss in (E_Neg e,ss')
pmap :: (a -> b) -> ([String] -> (a,[String])) -> ([String] -> (b,[String]))
nonono, I can't even read that! We need a type synonym:
type Parser a = [String] -> (a,[String])
pmap :: (a -> b) -> Parser a -> Parser b
pmap f p = \ss -> let (a,ss') = p ss
in (f a,ss')
But really this would be better if I did
data Parser a = Par [String] -> (a,[String])
so I could do
instance Functor Parser where
fmap f (Par p) = Par (pmap f p)
I'll leave that for you to figure out if you fancy.
Clean up the code 2: combining two parsers
We also need to deal with the situation when we have two parsers to run,
and we want to combine their results using a function. This is called lifting the function to parsers.
liftP2 :: (a -> b -> c) -> Parser a -> Parser b -> Parser c
liftP2 f p1 p2 = \ss0 -> let
(a,ss1) = p1 ss0
(b,ss2) = p2 ss1
in (f a b,ss2)
or maybe even three parsers:
liftP3 :: (a -> b -> c -> d) -> Parser a -> Parser b -> Parser c -> Parser d
I'll let you think how to do that.
In the let statement you'll need liftP5 to parse the sections of a let statement,
lifting a function that ignores the "=" and "in". You could make
equals_ :: Parser ()
equals_ [] = error "equals_: expected = but got end of input"
equals_ ("=":ss) = ((),ss)
equals_ (s:ss) = error $ "equals_: expected = but got "++s
and a couple more to help out with this.
Actually, pmap could also be called liftP1, but map is the traditional name for that sort of thing.
Rewritten with the nice combinators
Now we're ready to clean up expr:
expr :: [String] -> (Expr,[String])
expr [] = error "unexpected end of input"
expr (s:ss) | all isDigit s = (E_Int (read s),ss)
| s == "-" = pmap E_Neg expr ss
| s == "+" = liftP2 E_Sum expr expr ss
-- more cases
That'd all work fine. Really, it's OK. But liftP5 is going to be a bit long, and feels messy.
Taking the cleanup further - the ultra-nice Applicative way
Applicative Functors is the way to go.
Remember I suggested refactoring as
data Parser a = Par [String] -> (a,[String])
so you could make it an instance of Functor? Perhaps you don't want to,
because all you've gained is a new name fmap for the perfectly working pmap and
you have to deal with all those Par constructors cluttering up your code.
Perhaps this will make you reconsider, though; we can import Control.Applicative,
then using the data declaration, we can
define <*>, which sort-of means then and use <$> instead of pmap, with *> meaning
<*>-but-forget-the-result-of-the-left-hand-side so you would write
expr (s:ss) | s == "let" = E_Let <$> var *> equals_ <*> expr <*> in_ *> expr
Which looks a lot like your grammar definition, so it's easy to write code that works first time.
This is how I like to write Parsers. In fact, it's how I like to write an awful lot of things.
You'd only have to define fmap, <*> and pure, all simple ones, and no long repetative liftP3, liftP4 etc.
Read up about Applicative Functors. They're great.
Applicative in Learn You a Haskell for Great Good!
Applicative in Haskell wikibook
Hints for making Parser applicative: pure doesn't change the list.
<*> is like liftP2, but the function doesn't come from outside, it comes as the output from p1.
To make a start with Haskell itself, I'd recommend Learn You a Haskell for Great Good!, a very well-written and entertaining guide. Real World Haskell is another oft-recommended starting point.
Edit: A more fundamental introduction to parsing is Functional Parsers. I wanted How to replace a failure by a list of successes by PHilip Wadler. Sadly, it doesn't seem to be available online.
To make a start with parsing in Haskell, I think you should first read monadic parsing in Haskell, then maybe this wikibook example, but also then the parsec guide.
Your grammar
Dig ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Int ::= Dig | Dig Int
Var ::= a | b | ... z | A | B | C | ... | Z
Expr ::= Int | - Expr | + Expr Expr | * Expr Expr | Var | let Var = Expr in Expr
suggests a few abstract data types:
data Dig = Dig_0 | Dig_1 | Dig_2 | Dig_3 | Dig_4 | Dig_5 | Dig_6 | Dig_7 | Dig_8 | Dig_9
data Integ = I_Dig Dig | I_DigInt Dig Integ
data Var = Var_a | Var_b | ... Var_z | Var_A | Var_B | Var_C | ... | Var_Z
data Expr = Expr_I Integ
| Expr_Neg Expr
| Expr_Plus Expr Expr
| Expr_Times Expr Expr Var
| Expr_Var Var
| Expr_let Var Expr Expr
This is inherently a recursively defined syntax tree, no need to make another one.
Sorry about the clunky Dig_ and Integ_ stuff - they have to start with an uppercase.
(Personally I'd want to be converting the Integs to Ints straight away, so would have done newtype Integ = Integ Int, and would probably have done newtype Var = Var Char but that might not suit you.)
Once you've done with the basic ones - dig and var, and neg_, plus_, in_ etcI'd go with the Applicative interface to build them up, so for example your parser expr for Expr would be something like
expr = Expr_I <$> integ
<|> Expr_Neg <$> neg_ *> expr
<|> Expr_Plus <$> plus_ *> expr <*> expr
<|> Expr_Times <$> times_ *> expr <*> expr
<|> Expr_Var <$> var
<|> Expr_let <$> let_ *> var <*> equals_ *> expr <*> in_ *> expr
So nearly all the time, your Haskell code is clean and closely resembles the grammar you were given.
OK, so it seems you're trying to build lots and lots of stuff, and you're not really sure exactly where it's all going. I would suggest that getting the definition for AST right, and then trying to implement evali would be a good start.
The grammer you've listed is interesting... You seem to want to input * 5 5, but output 5*5, whic is an odd choice. Is that really supposed to be a unary minus, not binary? Simlarly, * Expr Expr Var looks like perhaps you might have meant to type * Expr Expr | Var...
Anyway, making some assumptions about what you meant to say, your AST is going to look something like this:
data AST = Leaf Int | Sum AST AST | Minus AST | Var String | Let String AST AST
Now, let us try to do printE. It takes an AST and gives us a string. By the definition above, the AST has to be one of five possible things. You just need to figure out what to print for each one!
printE :: AST -> String
printE (Leaf x ) = show x
printE (Sum x y) = printE x ++ " + " ++ printE y
printE (Minus x ) = "-" ++ printE x
...
show turns an Int into a String. ++ joins two strings together. I'll let you work out the rest of the function. (The tricky thing is if you want it to print brackets to correctly show the order of subexpressions... Since your grammer doesn't mention brackets, I guess no.)
Now, how about evali? Well, this is going to be a similar deal. If the AST is a Leaf x, then x is an Int, and you just return that. If you have, say, Minus x, then x isn't an integer, it's an AST, so you need to turn it into an integer with evali. The function looks something like
evali :: AST -> Int
evali (Leaf x ) = x
evali (Sum x y) = (evali x) + (evali y)
evali (Minus x ) = 0 - (evali x)
...
That works great so far. But wait! It looks like you're supposed to be able to use Let to define new variables, and refer to them later with Var. Well, in that case, you need to store those variables somewhere. And that's going to make the function more complicated.
My recommendation would be to use Data.Map to store a list of variable names and their corresponding values. You'll need to add the variable map to a type signature. You can do it this way:
evali :: AST -> Int
evali ast = evaluate Data.Map.empty ast
evaluate :: Map String Int -> AST -> Int
evaluate m ast =
case ast of
...same as before...
Let var ast1 ast2 -> evaluate (Data.Map.insert var (evaluate m ast1)) ast2
Var var -> m ! var
So evali now just calls evaluate with an empty variable map. When evaluate sees Let, it adds the variable to the map. And when it sees Var, it looks up the name in the map.
As for parsing a string into an AST in the first place, that is an entire other answer again...

Parsing grammars using OCaml

I have a task to write a (toy) parser for a (toy) grammar using OCaml and not sure how to start (and proceed with) this problem.
Here's a sample Awk grammar:
type ('nonterm, 'term) symbol = N of 'nonterm | T of 'term;;
type awksub_nonterminals = Expr | Term | Lvalue | Incrop | Binop | Num;;
let awksub_grammar =
(Expr,
function
| Expr ->
[[N Term; N Binop; N Expr];
[N Term]]
| Term ->
[[N Num];
[N Lvalue];
[N Incrop; N Lvalue];
[N Lvalue; N Incrop];
[T"("; N Expr; T")"]]
| Lvalue ->
[[T"$"; N Expr]]
| Incrop ->
[[T"++"];
[T"--"]]
| Binop ->
[[T"+"];
[T"-"]]
| Num ->
[[T"0"]; [T"1"]; [T"2"]; [T"3"]; [T"4"];
[T"5"]; [T"6"]; [T"7"]; [T"8"]; [T"9"]]);;
And here's some fragments to parse:
let frag1 = ["4"; "+"; "3"];;
let frag2 = ["9"; "+"; "$"; "1"; "+"];;
What I'm looking for is a rulelist that is the result of the parsing a fragment, such as this one for frag1 ["4"; "+"; "3"]:
[(Expr, [N Term; N Binop; N Expr]);
(Term, [N Num]);
(Num, [T "3"]);
(Binop, [T "+"]);
(Expr, [N Term]);
(Term, [N Num]);
(Num, [T "4"])]
The restriction is to not use any OCaml libraries other than List... :/
Here is a rough sketch - straightforwardly descend into the grammar and try each branch in order. Possible optimization : tail recursion for single non-terminal in a branch.
exception Backtrack
let parse l =
let rules = snd awksub_grammar in
let rec descend gram l =
let rec loop = function
| [] -> raise Backtrack
| x::xs -> try attempt x l with Backtrack -> loop xs
in
loop (rules gram)
and attempt branch (path,tokens) =
match branch, tokens with
| T x :: branch' , h::tokens' when h = x ->
attempt branch' ((T x :: path),tokens')
| N n :: branch' , _ ->
let (path',tokens) = descend n ((N n :: path),tokens) in
attempt branch' (path', tokens)
| [], _ -> path,tokens
| _, _ -> raise Backtrack
in
let (path,tail) = descend (fst awksub_grammar) ([],l) in
tail, List.rev path
Ok, so the first think you should do is write a lexical analyser. That's the
function that takes the ‘raw’ input, like ["3"; "-"; "("; "4"; "+"; "2"; ")"],
and splits it into a list of tokens (that is, representations of terminal symbols).
You can define a token to be
type token =
| TokInt of int (* an integer *)
| TokBinOp of binop (* a binary operator *)
| TokOParen (* an opening parenthesis *)
| TokCParen (* a closing parenthesis *)
and binop = Plus | Minus
The type of the lexer function would be string list -> token list and the ouput of
lexer ["3"; "-"; "("; "4"; "+"; "2"; ")"]
would be something like
[ TokInt 3; TokBinOp Minus; TokOParen; TokInt 4;
TBinOp Plus; TokInt 2; TokCParen ]
This will make the job of writing the parser easier, because you won't have to
worry about recognising what is a integer, what is an operator, etc.
This is a first, not too difficult step because the tokens are already separated.
All the lexer has to do is identify them.
When this is done, you can write a more realistic lexical analyser, of type string -> token list, that takes a actual raw input, such as "3-(4+2)" and turns it into a token list.
I'm not sure if you specifically require the derivation tree, or if this is a just a first step in parsing. I'm assuming the latter.
You could start by defining the structure of the resulting abstract syntax tree by defining types. It could be something like this:
type expr =
| Operation of term * binop * term
| Term of term
and term =
| Num of num
| Lvalue of expr
| Incrop of incrop * expression
and incrop = Incr | Decr
and binop = Plus | Minus
and num = int
Then I'd implement a recursive descent parser. Of course it would be much nicer if you could use streams combined with the preprocessor camlp4of...
By the way, there's a small example about arithmetic expressions in the OCaml documentation here.

Resources