I am trying to create a parser for a given language
The input language contains conditional statements if ... then ...
else and if ... then, separated by a symbol ; (semicolon). Condition
statements contain identifiers, comparison signs <,>, =, hexadecimal
numbers, sign assignments (:=). Consider the sequence as hexadecimal
numbers digits and symbols a, b, c, d, e, f starting with a digit (for
example, 89, 45ac, 0abc )
I ended up with this version of the grammar. Also removed left recursion and ambiguity
S -> if E then S S' | I := H
S' -> else S S'' | ; S | $
S'' -> ; S | $
I -> p | q
E -> HE'
E' -> >HE' | <HE' | =HE'
H -> DH'
H' -> LH' | DH' | EPS
L -> a | b | c | d | e | f
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Will it be possible to create an LL(1) parser according to this grammar. I am confused by the rules with two non-terminals (for example HE')
Related
I must be missing something fundamental about recursively defined nonterminals giving issue, but all I want to do is to recognize something like a regular expression, where a series of numbers followed by a series of letters.
from nltk import CFG
import nltk
grammar = CFG.fromstring("""
S -> N L
N -> N | '1' | '2' | '3'
L -> L | 'A' | 'B' | 'C'
""")
from nltk.parse import BottomUpChartParser
parser = nltk.ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
the above code returns an empty parse and nothing is printed
The grammar would need to be something like:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N L
N -> N N | '1' | '2' | '3'
L -> L L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet#latest/snippet.min.js"></script>
Using N -> N N as an example: the first N could be "eaten up" and transformed into a 1 when parsing the sentence, leaving the next N to go on and produce another N -> N N.
But this will result in a lot of possible parses, for something more efficient you probably want something like this:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N L
N -> '1' N | '2' N | '3' N | '1' | '2' | '3'
L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet#latest/snippet.min.js"></script>
Regular Version. The language from the question: "one or more numbers followed by one or more letters" or (1,2,3)+(A,B,C)+ is a regular language, so we can represent it with a regular grammar:
from nltk import CFG, ChartParser
grammar = CFG.fromstring("""
S -> N
N -> '1' N | '2' N | '3' N | '1' | '2' | '3' | L
L -> 'A' L | 'B' L | 'C' L | 'A' | 'B' | 'C'
""")
parser = ChartParser(grammar)
sentence = '1 2 1 3 A C B C'.split()
for t in parser.parse(sentence):
print(t)
break
<script src="https://cdn.jsdelivr.net/gh/pysnippet/pysnippet#latest/snippet.min.js"></script>
Try all three out and see what the parses look like on different inputs!
How to remove ambiguity in following grammar?
E -> E * F | F + E | F
F -> F - F | id
First, we need to find the ambiguity.
Consider the rules for E without F; change F to f and consider it a terminal symbol. Then the grammar
E -> E * f
E -> f + E
E -> f
is ambiguous. Consider f + f * f:
E E
| |
+-------+--+ +-+-+
| | | | | |
E * f f + E
+-+-+ |
| | | +-+-+
f + E E * f
| |
f f
We can resolve this ambiguity by forcing * or + to take precedence. Typically, * takes precedence in the order of operations, but this is totally arbitrary.
E -> f + E | A
A -> A * f | f
Now, the string f + f * f has just one parsing:
E
|
+-+-+
| | |
f + E
|
A
|
+-+-+
A * f
|
f
Now, consider our original grammar which uses F instead of f:
E -> F + E | A
A -> A * F | F
F -> F - F | id
Is this ambiguous? It is. Consider the string id - id - id.
E E
| |
A A
| |
F F
| |
+-----+----+----+ +----+----+----+
| | | | | |
F - F F - F
| | | |
+-+-+ id id +-+-+
F - F F - F
| | | |
id id id id
The ambiguity here is that - can be left-associative or right-associative. We can choose the same convention as for +:
E -> F + E | A
A -> A * F | F
F -> id - F | id
Now, we have only one parsing:
E
|
A
|
F
|
+----+----+----+
| | |
id - F
|
+--+-+
| | |
id - F
|
id
Now, is this grammar ambiguous? It is not.
s will have #(+) +s in it, and we always need to use production E -> F + E exactly #(+) times and then production E -> A once.
s will have #(*) *s in it, and we always need to use production A -> A * F exactly #(*) times and then production E -> F once.
s will have #(-) -s in it, and we always need to use production F -> id - F exactly #(-) times and the production F -> id once.
That s has exactly #(+) +s, #(*) *s and #(-) -s can be taken for granted (the numbers can be zero if not present in s). That E -> A, A -> F and F -> id have to be used exactly once can be shown as follows:
If E -> A is never used, any string derived will still have E, a nonterminal, in it, and so will not be a string in the language (nothing is generated without taking E -> A at least once). Also, every string that can be generated before using E -> A has at most one E in it (you start with one E, and the only other production keeps one E) so it is never possible to use E -> A more than once. So E -> A is used exactly once for all derived strings. The demonstration works the same way for A -> F and F -> id.
That E -> F + E, A -> A * F and F -> id - F are used exactly #(+), #(*) and #(-) times, respectively, is apparent from the fact that these are the only productions that introduce their respective symbols and each introduces one instance.
If you consider the sub-grammars of our resulting grammars, we can prove they are unambiguous as follows:
F -> id - F | id
This is an unambiguous grammar for (id - )*id. The only derivation of (id - )^kid is to use F -> id - F k times and then use F -> id exactly once.
A -> A * F | F
We have already seen that F is unambiguous for the language it recognizes. By the same argument, this is an unambiguous grammar for the language F( * F)*. The derivation of F( * F)^k will require the use of A -> A * F exactly k times and then the use of A -> F. Because the language generated from F is unambiguous and because the language for A unambiguously separates instances of F using *, a symbol not generated by F, the grammar
A -> A * F | F
F -> id - F | id
Is also unambiguous. To complete the argument, apply the same logic to the grammar generating (F + )*A from the start symbol E.
To remove an ambiguity means that you must choose one of all possible ambiguities. This grammar is as simple as it can be, for a mathematical expression.
To make the multiplication with a higher priority than the addition and the subtraction (where the last two have the same priority, but are traditionally computed from left to right) you do that (in ABNF like syntax):
expression = addition
addition = multiplication *(("+" / "-") multiplication)
multiplication = identifier *("*" identifier)
identifier = 'a'-'z'
The idea is as follows:
first create your lowest grammar rule: the identifier
continue with the highest priority operation, in your case multiplication: *
create a rule that has this on its right hand side: X *(P X), where X is the previous rule you have created, and P is your operation sign.
if you have more than one operation with the same priority they must be in a group: (P1 / P2 / ...)
continue to do the last two operations until there are no more operations to add.
add your main rule that uses the latest one.
Then for input like: a+b+c*d+e you get this tree:
More advanced tools will get you a tree that has more than two nodes. That means that all multiplications in one addition will be in a list that you can iterate from any direction.
This grammar is easy to upgrade, and to add parentheses you can do that:
expression = addition
addition = multiplication *(("+" / "-") multiplication)
multiplication = primary *("*" primary)
primary = identifier / "(" expression ")"
identifier = 'a'-'z'
Then for input (a+b)*c you will get this tree:
If you want to add a division, you can modify the multiplication rule like that:
multiplication = primary *(("*" / "/") primary)
These are all detailed trees, there are trees with less details as well, often called abstract syntax trees.
I need to know how to make a grammar for expressions to create a parser and ast. I have four levels of precedence:
1. ** !
2. * / % &
3. + - ^ |
4. <= >= < >
I have made this:
Exps -> RExp4
RExp4 -> OpEq Exps | RExp3
RExp3 -> OpAd Exps | RExp2
RExp2 -> OpMul Exps | RExp1
RExp1 -> OpExp Exps | Exp
Exp -> Val | '('Exps')'
OpExp -> ** | !
OpMul -> * | / | % | &
OpAd -> + | - | ^
OpEq -> <= | >= | < | >
Val -> Id | Int
I'm not sure if this will work because when I make the tree in a paper I do not get the correct form for expressions like:
x*7+7
Basically because I make the sum first. My gramma have to be LL(1) and recursive by the right because my compiler will be top-down.
Thanks for help and sorry for my english
Edit
Sorry, I was wrong, I mean parser top-down.The problem I see on paper is the next. I have the next expression "7*5+5" for example and the order that my BNF follow is the next:
Take 7 with Exp in Exps, that follow to Val
Go to RExp and continue to RExp3.
Take '*' and then return to Exps.
The tree that I see on paper is the next:
*
/ \
3 +
/ \
7 5
The tree that I should have is the next:
+
/ \
* 5
/ \
7 5
Lets say I have this grammar
E -> T+Ex | F
T -> T*Fy | w
F -> E | z | ε
Now I need to make it LL(1). I've been following the steps but the solution I came up with doesn't seem quite right.
Fist lets eliminate ε-productions
E -> T+Ex | F | T+x
T -> T*Fy | w | T*y
F -> E | z
Now we'll eliminate cycles
E -> T+Ex | T+x | z
T -> T*Fy | w | T*y
F -> T+Ex | T+x | z
No we'll eliminate immediate left recursion
E -> T+Ex | T+x | z
T -> wT'
T' -> *FyT' | *yT' | ε
F -> T+Ex | T+x | z
Finally we'll replace certain RHS productions where T occurred
E -> wT'+Ex | wT'+x | z
T -> wT'
T' -> *FyT' | *yT' | ε
F -> wT'+Ex | wT'+x | z
Now this doesn't seem to be LL(1) to me since the parse table generated by this will have multiple entries for several of the terminals. What do I seem to be missing?
I figured out how to do it, so out last step we have this configuration
E -> wT'+Ex | wT'+x | z
T -> wT'
T' -> *FyT' | *yT' | ε
F -> wT'+Ex | wT'+x | z
We now need to perform left-factorization to remove productions of the form
S -> aB | aC
And make them into the proper form
S -> aA
A -> B | C
Using this we can transform our grammar into
E -> wT'+E' | z
E' -> Ex | x
T -> wT'
T' -> *T'' | ε
T'' -> FyT' | yT'
F -> wT'+F' | z
F' -> Ex | x
Since F and F' are the same as E and E' we can just remove this production so we are left with.
E -> wT'+E' | z
E' -> Ex | x
T -> wT'
T' -> *T'' | ε
T'' -> EyT' | yT'
I've written a typical evaluator for simple math expressions (arithmetic with some custom functions) in F#. While it seems to be working correctly, some expressions don't evaluate as expected, for example, these work fine:
eval "5+2" --> 7
eval "sqrt(25)^2" --> 25
eval "1/(sqrt(4))" --> 0.5
eval "1/(2^2+2)" --> 1/6 ~ 0.1666...
but these don't:
eval "1/(sqrt(4)+2)" --> evaluates to 1/sqrt(6) ~ 0.408...
eval "1/(sqrt 4 + 2)" --> will also evaluate to 1/sqrt(6)
eval "1/(-1+3)" --> evaluates to 1/(-4) ~ -0.25
the code works as follows, tokenization (string as input) -> to rev-polish-notation (RPN) -> evalRpn
I thought that the problem seems to occur somewhere with the unary functions (functions accepting one operator), these are the sqrt function and the negation (-) function. I don't really see what's going wrong in my code. Can someone maybe point out what I am missing here?
this is my implementation in F#
open System.Collections
open System.Collections.Generic
open System.Text.RegularExpressions
type Token =
| Num of float
| Plus
| Minus
| Star
| Hat
| Sqrt
| Slash
| Negative
| RParen
| LParen
let hasAny (list: Stack<'T>) =
list.Count <> 0
let tokenize (input:string) =
let tokens = new Stack<Token>()
let push tok = tokens.Push tok
let regex = new Regex(#"[0-9]+(\.+\d*)?|\+|\-|\*|\/|\^|\)|\(|pi|e|sqrt")
for x in regex.Matches(input.ToLower()) do
match x.Value with
| "+" -> push Plus
| "*" -> push Star
| "/" -> push Slash
| ")" -> push LParen
| "(" -> push RParen
| "^" -> push Hat
| "sqrt" -> push Sqrt
| "pi" -> push (Num System.Math.PI)
| "e" -> push (Num System.Math.E)
| "-" ->
if tokens |> hasAny then
match tokens.Peek() with
| LParen -> push Minus
| Num v -> push Minus
| _ -> push Negative
else
push Negative
| value -> push (Num (float value))
tokens.ToArray() |> Array.rev |> Array.toList
let isUnary = function
| Negative | Sqrt -> true
| _ -> false
let prec = function
| Hat -> 3
| Star | Slash -> 2
| Plus | Minus -> 1
| _ -> 0
let toRPN src =
let output = new ResizeArray<Token>()
let stack = new Stack<Token>()
let rec loop = function
| Num v::tokens ->
output.Add(Num v)
loop tokens
| RParen::tokens ->
stack.Push RParen
loop tokens
| LParen::tokens ->
while stack.Peek() <> RParen do
output.Add(stack.Pop())
stack.Pop() |> ignore // pop the "("
loop tokens
| op::tokens when op |> isUnary ->
stack.Push op
loop tokens
| op::tokens ->
if stack |> hasAny then
if prec(stack.Peek()) >= prec op then
output.Add(stack.Pop())
stack.Push op
loop tokens
| [] ->
output.AddRange(stack.ToArray())
output
(loop src).ToArray()
let (#) op tok =
match tok with
| Num v ->
match op with
| Sqrt -> Num (sqrt v)
| Negative -> Num (v * -1.0)
| _ -> failwith "input error"
| _ -> failwith "input error"
let (##) op toks =
match toks with
| Num v,Num u ->
match op with
| Plus -> Num(v + u)
| Minus -> Num(v - u)
| Star -> Num(v * u)
| Slash -> Num(u / v)
| Hat -> Num(u ** v)
| _ -> failwith "input error"
| _ -> failwith "inpur error"
let evalRPN src =
let stack = new Stack<Token>()
let rec loop = function
| Num v::tokens ->
stack.Push(Num v)
loop tokens
| op::tokens when op |> isUnary ->
let result = op # stack.Pop()
stack.Push result
loop tokens
| op::tokens ->
let result = op ## (stack.Pop(),stack.Pop())
stack.Push result
loop tokens
| [] -> stack
if loop src |> hasAny then
match stack.Pop() with
| Num v -> v
| _ -> failwith "input error"
else failwith "input error"
let eval input =
input |> (tokenize >> toRPN >> Array.toList >> evalRPN)
Before answering your specific question, did you notice you have another bug? Try eval "2-4" you get 2.0 instead of -2.0.
That's probably because along these lines:
match op with
| Plus -> Num(v + u)
| Minus -> Num(v - u)
| Star -> Num(v * u)
| Slash -> Num(u / v)
| Hat -> Num(u ** v)
u and v are swapped, in commutative operations you don't notice the difference, so just revert them to u -v.
Now regarding the bug you mentioned, the cause seems obvious to me, by looking at your code you missed the precedence of those unary operations:
let prec = function
| Hat -> 3
| Star | Slash -> 2
| Plus | Minus -> 1
| _ -> 0
I tried adding them this way:
let prec = function
| Negative -> 5
| Sqrt -> 4
| Hat -> 3
| Star | Slash -> 2
| Plus | Minus -> 1
| _ -> 0
And now it seems to be fine.
Edit: meh, seems I was late, Gustavo posted the answer while I was wondering about the parentheses. Oh well.
Unary operators have the wrong precedence. Add the primary case | a when isUnary a -> 4 to prec.
The names of LParen and RParen are consistently swapped throughout the code. ( maps to RParen and ) to LParen!
It runs all tests from the question properly for me, given the appropriate precedence, but I haven't checked the code for correctness.