How Do I left factor and eliminate left recursion? - parsing

My Production rules are as follows:
S → id = Exp
S → id (Arglist)
Arglist → Arglist , Exp
Arglist → Exp
Exp → id (Arglist)
Exp → id
This is my first attempt:
S -> id S'
S' -> ϵ | = EXP | (Arglist)
Arglist -> Arglist'
Arglist' -> ϵ | ,Exp Arglist'
Exp -> id Exp'
Exp' -> ϵ | (Arglist)
My problem is with the Arglist production rule, I am wrong.

You just need to change Arglist to being right-recursive, which will recognise the same language (with a slightly different parse tree):
Arglist → Exp , Arglist
Arglist → Exp
Then left-factor:
Arglist → Exp Arglist'
Arglist' → ε | , Exp Arglist'

Related

How do parentheses work with custom data types?

Currently, I am working on a problem of parsing and showing expressions in Haskell.
type Name = String
data Expr = Val Integer
| Var Name
| Expr :+: Expr
| Expr :-: Expr
| Expr :*: Expr
| Expr :/: Expr
| Expr :%: Expr
This is the code of my data type Expr and this is how i define show function:
instance Show Expr where
show (Val x) = show x
show (Var y) = y
show (p :+: q) = par (show p ++ "+" ++ show q)
show (p :-: q) = par (show p ++ "-" ++ show q)
show (p :/: q) = par (show p ++ "/" ++ show q)
show (p :*: q) = par (show p ++ "*" ++ show q)
show (p :%: q) = par (show p ++ "%" ++ show q)
par :: String -> String
par s = "(" ++ s ++ ")"
Later i tried to transform string input into the expression but i encounter the following problem: I don't understand how parentheses in the second case are implemented in Haskell.
*Main> Val 2 :*:Val 2 :+: Val 3
((2*2)+3)
*Main> Val 2 :*:(Val 2 :+: Val 3)
(2*(2+3))
Because of that, i am a bit confused regarding how should i transform parentheses from my string into the expression. Currently i am using the following function for parsing, but for now, it just ignores parentheses which is not intended behavior:
toExpr :: String -> Expr
toExpr str = f (lexer str) (Val 0)
where
f [] expr = expr
f (c:cs) expr
|isAlpha (head c) = f cs (Var c)
|isDigit (head c) = f cs (Val (read c))
|c == "+" = (expr :+: f cs (Val 0))
|c == "-" = (expr :-: f cs (Val 0))
|c == "/" = (expr :/: f cs (Val 0))
|c == "*" = (expr :*: f cs (Val 0))
|c == "%" = (expr :%: f cs (Val 0))
|otherwise = f cs expr
Edit: few grammar mistakes
I don't understand how parentheses in the second case are implemented in Haskell.
The brackets just give precedence to a certain part of the expression to parse. The problem is not with the parenthesis you render. I think the problem is that you did not assign precedence to your operators. This thus means that, unless you specify brackets, Haskell will consider all operators to have the same precedence, and parse these left-to-right. This thus means that x ⊕ y ⊗ z is parsed as (x ⊕ y) ⊗ z.
You can define the precedence of your :+:, :*, etc. operators with infixl:
infixl 7 :*:, :/:, :%:
infixl 5 :+:, :-:
type Name = String
data Expr = Val Integer
| Var Name
| Expr :+: Expr
| Expr :-: Expr
| Expr :*: Expr
| Expr :/: Expr
| Expr :%: Expr
As for your parser (the toExpr), you will need a parsing mechanism like a LALR parser [wiki] that stores results on a stack, and thus makes proper operations.
This was my final parser which gave me the result I needed. To get the result i wanted proper grammar was added and i wrote a parses according to he grammar.
Thanks, everyone for the help.
{-
parser for the following grammar:
E -> T E'
E' -> + T E' | - T E' | <empty string>
T -> F T'
T' -> * F T' | / F T' | % F T' | <empty string>
F -> (E) | <integer> | <identifier>
-}
parseExpr :: String -> (Expr,[String])
parseExpr tokens = parseE (lexer tokens)
parseE :: [String] -> (Expr,[String])
parseE tokens = parseE' acc rest where (acc,rest) = parseT tokens
parseE' :: Expr -> [String] -> (Expr,[String])
parseE' accepted ("+":tokens) = let (acc,rest) = parseT tokens in parseE' (accepted :+: acc) rest
parseE' accepted ("-":tokens) = let (acc,rest) = parseT tokens in parseE' (accepted :-: acc) rest
parseE' accepted tokens = (accepted,tokens)
parseT :: [String] -> (Expr,[String])
parseT tokens = let (acc,rest) = parseF tokens in parseT' acc rest
parseT' :: Expr -> [String] -> (Expr,[String])
parseT' accepted ("*":tokens) = let (acc,rest) = parseF tokens in parseT' (accepted :*: acc) rest
parseT' accepted ("/":tokens) = let (acc,rest) = parseF tokens in parseT' (accepted :/: acc) rest
parseT' accepted ("%":tokens) = let (acc,rest) = parseF tokens in parseT' (accepted :%: acc) rest
parseT' accepted tokens = (accepted,tokens)
parseF :: [String] -> (Expr,[String])
parseF ("(":tokens) = (e, tail rest) where (e,rest) = parseE tokens
parseF (t:tokens)
| isAlpha (head t) = (Var t,tokens)
| isDigit (head t) = (Val (read t),tokens)
| otherwise = error ""
parseF [] = error ""
lexer :: String -> [String]
lexer [] = []
lexer (c:cs)
| elem c " \t\n" = lexer cs
| elem c "=+-*/%()" = [c]:(lexer cs)
| isAlpha c = (c:takeWhile isAlpha cs):lexer(dropWhile isAlpha cs)
| isDigit c = (c:takeWhile isDigit cs):lexer(dropWhile isDigit cs)
| otherwise = error ""

Calculating first and follow set of grammar

below is the grammar that i am using for a calculator language and my attempt at finding the follow set and the first set of the grammar.
I would love help in figuring out what i am doing wrong when trying to figure out these sets because I feel like i am not doing them correctly at all (at least for the follow sets)
Grammar
program → stmt_list $$$
stmt_list → stmt stmt_list | ε
stmt → id = expr | input id | print expr
expr → term term_tail
term_tail → add op term term_tail | ε
term → factor fact_tail
fact_tail → mult_op fact fact_tail | ε
factor → ( expr ) | number | id
add_op → + | -
mult_op → * | / | // | %
First set
first(p) = {id, input, print}
first(stmt_list) = {id, input, print, e}
first(s) = {id, input, print}
first(expr) = {(, id, number}
first(term_tail) = {+, -, e}
first(term) = {(, id, number}
first(fact_tail) = {, /, //, %, e}
first(factor) = {(, id, number}
first(add_op) = {+, -}
first(mult_op) = {, /, //, %}
Follow Set
follow(p) = {$}
follow(stmt_list) = {$}
follow(stmt) = {id, input, print}
follow(expr) = {(, id, number, ), input, print, , /, //, %}
follow(term_tail) = {), (, id, number, print, input}
follow(term) = {+, -}
follow(factor) = {, /, //, %}
follow(add_op) = {}
follow(mult_op) = {}
follow(fact_tail) = {*, /, //, %, +, -}
You have certain mistakes in First as well
first(p) = {id, input, print,e}
it will include epsilon
* is missing in the next two -
first(fact_tail) = { *,/, //, %, e} first(mult_op) = {*, /, //, %}
fact_tail → mult_op fact fact_tail | ε
Iam assuming here you actually mean
fact_tail → mult_op factor fact_tail | ε
Follow
follow(stmt) = {id, input, print,$}
if you refer to
stmt_list → stmt stmt_list | ε
then stmt is followed by first of stmt_list which includes e so string generated will end, hence stmt is followed by $
follow(expr) = {(, id, number, ), input, print, , /, //, %}
I don't know how you got this, follow of expr is equal to follow of stmt and )
follow(expr) = {id, ), input, print,$}
follow(term_tail) is equal to follow(expr)
follow(term) = {+,-,),id,input,print,$}
follow(fact_tail) is equal to follow(term)
follow(factor) = first(fact_tail)
follow(add_op) = first(term)
follow(mult_op) = first(factor)

Transform grammar into LL(1)

I have the following grammar:
START -> STM $
STM -> VAR = EXPR
STM -> EXPR
EXPR -> VAR
VAR -> id
VAR -> * EXPR
With this firstand follow sets:
First set Follow set
START id, * $
STM id, * $
EXPR id, * $, =
VAR id, * $, =
I've created the parsing table that follows:
$ = id * $
START START → STM $ START → STM $
STM STM → VAR = EXPR STM → VAR = EXPR
STM → EXPR STM → EXPR
EXPR EXPR → VAR EXPR → VAR
VAR VAR → id VAR → id
VAR → * EXPR VAR → * EXPR
From here I can see that this is not LL(1).
How can I modify this grammar so that it becomes LL(1)?
If you think about what sorts of strings can be generated by this particular grammar, it's all the strings of one of the following forms:
***....**id
***....**id = ***...**id
With this in mind, you can design an LL(1) grammar for this language by essentially building a new grammar for the language from scratch. Here's one way to do this:
Start → Statement $
Statement → StarredID OptExpr
StarredID → * StarredID | id
OptExpr → ε | = StarredID
Here, the FIRST sets are given as follows:
FIRST(Start) = {*, id}
FIRST(Statement) = {*, id}
FIRST(StarredID) = {*, id}
FIRST(OptExpr) = {ε, *, id}
FOLLOW(Statement) = {$}
FOLLOW(StarredID) = {=, $}
FOLLOW(OptExpr) = {$}
The parse table is then shown here:
* | id | = $
---------------+-------------------+-------------------+-------------+-----------
Start | Statement$ | Statement$ | |
Statement | StarredID OptExpr | StarredID OptExpr | |
StarredID | * StarredID | id | |
OptExpr | | | = StarredID | epsilon
So this grammar is LL(1).

Incomplete match with AND patterns

I've defined an expression tree structure in F# as follows:
type Num = int
type Name = string
type Expr =
| Con of Num
| Var of Name
| Add of Expr * Expr
| Sub of Expr * Expr
| Mult of Expr * Expr
| Div of Expr * Expr
| Pow of Expr * Expr
| Neg of Expr
I wanted to be able to pretty-print the expression tree so I did the following:
let (|Unary|Binary|Terminal|) expr =
match expr with
| Add(x, y) -> Binary(x, y)
| Sub(x, y) -> Binary(x, y)
| Mult(x, y) -> Binary(x, y)
| Div(x, y) -> Binary(x, y)
| Pow(x, y) -> Binary(x, y)
| Neg(x) -> Unary(x)
| Con(x) -> Terminal(box x)
| Var(x) -> Terminal(box x)
let operator expr =
match expr with
| Add(_) -> "+"
| Sub(_) | Neg(_) -> "-"
| Mult(_) -> "*"
| Div(_) -> "/"
| Pow(_) -> "**"
| _ -> failwith "There is no operator for the given expression."
let rec format expr =
match expr with
| Unary(x) -> sprintf "%s(%s)" (operator expr) (format x)
| Binary(x, y) -> sprintf "(%s %s %s)" (format x) (operator expr) (format y)
| Terminal(x) -> string x
However, I don't really like the failwith approach for the operator function since it's not compile-time safe. So I rewrote it as an active pattern:
let (|Operator|_|) expr =
match expr with
| Add(_) -> Some "+"
| Sub(_) | Neg(_) -> Some "-"
| Mult(_) -> Some "*"
| Div(_) -> Some "/"
| Pow(_) -> Some "**"
| _ -> None
Now I can rewrite my format function beautifully as follows:
let rec format expr =
match expr with
| Unary(x) & Operator(op) -> sprintf "%s(%s)" op (format x)
| Binary(x, y) & Operator(op) -> sprintf "(%s %s %s)" (format x) op (format y)
| Terminal(x) -> string x
I assumed, since F# is magic, that this would just work. Unfortunately, the compiler then warns me about incomplete pattern matches, because it can't see that anything that matches Unary(x) will also match Operator(op) and anything that matches Binary(x, y) will also match Operator(op). And I consider warnings like that to be as bad as compiler errors.
So my questions are: Is there a specific reason why this doesn't work (like have I left some magical annotation off somewhere or is there something that I'm just not seeing)? Is there a simple workaround I could use to get the type of safety I want? And is there an inherent problem with this type of compile-time checking, or is it something that F# might add in some future release?
If you code the destinction between ground terms and complex terms into the type system, you can avoid the runtime check and make them be complete pattern matches.
type Num = int
type Name = string
type GroundTerm =
| Con of Num
| Var of Name
type ComplexTerm =
| Add of Term * Term
| Sub of Term * Term
| Mult of Term * Term
| Div of Term * Term
| Pow of Term * Term
| Neg of Term
and Term =
| GroundTerm of GroundTerm
| ComplexTerm of ComplexTerm
let (|Operator|) ct =
match ct with
| Add(_) -> "+"
| Sub(_) | Neg(_) -> "-"
| Mult(_) -> "*"
| Div(_) -> "/"
| Pow(_) -> "**"
let (|Unary|Binary|) ct =
match ct with
| Add(x, y) -> Binary(x, y)
| Sub(x, y) -> Binary(x, y)
| Mult(x, y) -> Binary(x, y)
| Div(x, y) -> Binary(x, y)
| Pow(x, y) -> Binary(x, y)
| Neg(x) -> Unary(x)
let (|Terminal|) gt =
match gt with
| Con x -> Terminal(string x)
| Var x -> Terminal(string x)
let rec format expr =
match expr with
| ComplexTerm ct ->
match ct with
| Unary(x) & Operator(op) -> sprintf "%s(%s)" op (format x)
| Binary(x, y) & Operator(op) -> sprintf "(%s %s %s)" (format x) op (format y)
| GroundTerm gt ->
match gt with
| Terminal(x) -> x
also, imo, you should avoid boxing if you want to be type-safe. If you really want both cases, make two pattern. Or, as done here, just make a projection to the type you need later on. This way you avoid the boxing and instead you return what you need for printing.
I think you can make operator a normal function rather than an active pattern. Because operator is just a function which gives you an operator string for an expr, where as unary, binary and terminal are expression types and hence it make sense to pattern match on them.
let operator expr =
match expr with
| Add(_) -> "+"
| Sub(_) | Neg(_) -> "-"
| Mult(_) -> "*"
| Div(_) -> "/"
| Pow(_) -> "**"
| Var(_) | Con(_) -> ""
let rec format expr =
match expr with
| Unary(x) -> sprintf "%s(%s)" (operator expr) (format x)
| Binary(x, y) -> sprintf "(%s %s %s)" (format x) (operator expr) (format y)
| Terminal(x) -> string x
I find the best solution is to restructure your original type defintion:
type UnOp = Neg
type BinOp = Add | Sub | Mul | Div | Pow
type Expr =
| Int of int
| UnOp of UnOp * Expr
| BinOp of BinOp * Expr * Expr
All sorts of functions can then be written over the UnOp and BinOp types including selecting operators. You may even want to split BinOp into arithmetic and comparison operators in the future.
For example, I used this approach in the (non-free) article "Language-oriented programming: The Term-level Interpreter
" (2008) in the F# Journal.

How to write a recursive descent parser from scratch?

As a purely academic exercise, I'm writing a recursive descent parser from scratch -- without using ANTLR or lex/yacc.
I'm writing a simple function which converts math expressions into their equivalent AST. I have the following:
// grammar
type expr =
| Lit of float
| Add of expr * expr
| Mul of expr * expr
| Div of expr * expr
| Sub of expr * expr
// tokens
type tokens =
| Num of float
| LParen | RParen
| XPlus | XStar | XMinus | XSlash
let tokenize (input : string) =
Regex.Matches(input.Replace(" ", ""), "\d+|[+/*\-()]")
|> Seq.cast<Match>
|> Seq.map (fun x -> x.Value)
|> Seq.map (function
| "+" -> XPlus
| "-" -> XMinus
| "/" -> XSlash
| "*" -> XStar
| "(" -> LParen
| ")" -> RParen
| num -> Num(float num))
|> Seq.to_list
So, tokenize "10 * (4 + 5) - 1" returns the following token stream:
[Num 10.0; XStar; LParen; Num 4.0; XPlus; Num 5.0; RParen; XMinus; Num 1.0]
At this point, I'd like to map the token stream to its AST with respect to operator precedence:
Sub(
Mul(
Lit 10.0
,Add(Lit 4.0, Lit 5.0)
)
,Lit 1.0
)
However, I'm drawing a blank. I've never written a parser from scratch, and I don't know even in principle how to begin.
How do I convert a token stream its representative AST?
Do you know about language grammars?
Assuming yes, you have a grammar with rules along the lines
...
addTerm := mulTerm addOp addTerm
| mulTerm
addOp := XPlus | XMinus
mulTerm := litOrParen mulOp mulTerm
| litOrParen
...
which ends up turning into code like (writing code in browser, never compiled)
let rec AddTerm() =
let mulTerm = MulTerm() // will parse next mul term (error if fails to parse)
match TryAddOp with // peek ahead in token stream to try parse
| None -> mulTerm // next token was not prefix for addOp rule, stop here
| Some(ao) -> // did parse an addOp
let rhsMulTerm = MulTerm()
match ao with
| XPlus -> Add(mulTerm, rhsMulTerm)
| XMinus -> Sub(mulTerm, rhsMulTerm)
and TryAddOp() =
let next = tokens.Peek()
match next with
| XPlus | XMinus ->
tokens.ConsumeNext()
Some(next)
| _ -> None
...
Hopefully you see the basic idea. This assumes a global mutable token stream that allows both 'peek at next token' and 'consume next token'.
If I remember from college classes the idea was to build expression trees like:
<program> --> <expression> <op> <expression> | <expression>
<expression> --> (<expression>) | <constant>
<op> --> * | - | + | /
<constant> --> <constant><constant> | [0-9]
then once you have construction your tree completely so you get something like:
exp
exp op exp
5 + and so on
then you run your completed tree through another program that recursively descents into the tree calculating expressions until you have an answer. If your parser doesn't understand the tree, you have a syntax error. Hope that helps.

Resources