I really have some troubles to cauculate the lookahead when building the LR(1) item sets, i had tried some lecture notes form different sites, but still...
My example is
S -> E + S | E
E -> num | ( S )
The item set is
I0:
S’ -> . S $
S -> . E + S $
S -> . E $
E -> . num +,$
E -> . ( S ) +,$
I1:
S ->E .+ S $
S ->E . $
The first item in set I0
S’ -> . S $
is initialization.
The second item in set I0
S -> . E + S $
means there is nothing on stack, we expect to read E+S, then reduce iff the token after E+S is $.
The third item in set I0
S -> . E $
means that we expect to read E and reduce iff the token after E is $.
Then i am confused about the fouth item in set I0,
E -> . num +,$
I have no ideas why there are + and $ tokens.
and if anyone can explain this for me in plain English please.
For each configuration [A –> u•Bv, a] in I, for each production B –> w in G', and for
each terminal b in First(va) such that [B –> •w, b] is not in I: add [B –> •w, b] to I.
Thanks!!!
I think i figured it out.
i am using the algorithm of
for set I0:
Begin with [S' -> .S, $]
Match [A -> α.Bβ, a]
Then add in [B -> .γ, b]
Where terminal b is FIRST(βa)
for set I1...In
Compute GOTO(I0,X)
Add in X productions and LOOKAHEAD token
In the example
S -> E + S
S -> E
E -> num
E -> ( S )
Firstly,
S’ -> . S $
we try to match it to [A -> α.Bβ, a], That is
A =S', α = ε, B = S , β = ε , a = $ and
FIRST(βa) = {$}
Add in [B -> .γ, b], which are
S -> . E + S $ ...1
S -> . E $ ...2
in I0.
Then, we need to add in productions for E as 1 and 2.
In this case, our [A -> α.Bβ, a] are 1 and 2.
Thus, FIRST(βa) = { + , $ }, and we have
E -> . num +,$
E -> . ( S ) +,$
Now, we compute GOTO(I0, X)
For X = E
we move dot one position and found no productions need to be added. So we just add in second component $ from
S -> . E + S $
S -> . E $
which gives us I1
S ->E .+ S $
S ->E . $
and so on...
So, is this the correct and efficient way when building LR(1) item sets?
For
E -> . num +,$
E -> . ( S ) +,$
the +,$ indicate that only these tokens can follow a number or a closing parenthesis. Think about it: The grammar does noty allow adjacent num's or ()'s, they must either be at the end of the sentence or followed by a +.
As for translation request, it is a fancy way of saying how to calculate the set of tokens that can follow a given token. The +,$ above are an example. They are the only legal tokens that can follow num and ).
Related
So I have this grammar I'm trying to build an LR(1) table for
E' -> E
E -> E + E
E -> E * E
E -> ( E )
E -> a
So far, this my table
I'm trying to solve the conflicts here. I thought about changing the grammar to postfix instead of infix but I'm not really sure if I can do that. Any ideas?
Here is your grammar, with precedence:
E' -> E
E -> E + T
E -> T
T -> T * F
T -> F
F -> ( E )
F -> a
Don't forget the extra E -> T, and T -> F, as without it the grammar will be useless.
Note: This will not work with LR(0), because you'll get a conflict.
How to remove ambiguity in following grammar?
E -> E * F | F + E | F
F -> F - F | id
First, we need to find the ambiguity.
Consider the rules for E without F; change F to f and consider it a terminal symbol. Then the grammar
E -> E * f
E -> f + E
E -> f
is ambiguous. Consider f + f * f:
E E
| |
+-------+--+ +-+-+
| | | | | |
E * f f + E
+-+-+ |
| | | +-+-+
f + E E * f
| |
f f
We can resolve this ambiguity by forcing * or + to take precedence. Typically, * takes precedence in the order of operations, but this is totally arbitrary.
E -> f + E | A
A -> A * f | f
Now, the string f + f * f has just one parsing:
E
|
+-+-+
| | |
f + E
|
A
|
+-+-+
A * f
|
f
Now, consider our original grammar which uses F instead of f:
E -> F + E | A
A -> A * F | F
F -> F - F | id
Is this ambiguous? It is. Consider the string id - id - id.
E E
| |
A A
| |
F F
| |
+-----+----+----+ +----+----+----+
| | | | | |
F - F F - F
| | | |
+-+-+ id id +-+-+
F - F F - F
| | | |
id id id id
The ambiguity here is that - can be left-associative or right-associative. We can choose the same convention as for +:
E -> F + E | A
A -> A * F | F
F -> id - F | id
Now, we have only one parsing:
E
|
A
|
F
|
+----+----+----+
| | |
id - F
|
+--+-+
| | |
id - F
|
id
Now, is this grammar ambiguous? It is not.
s will have #(+) +s in it, and we always need to use production E -> F + E exactly #(+) times and then production E -> A once.
s will have #(*) *s in it, and we always need to use production A -> A * F exactly #(*) times and then production E -> F once.
s will have #(-) -s in it, and we always need to use production F -> id - F exactly #(-) times and the production F -> id once.
That s has exactly #(+) +s, #(*) *s and #(-) -s can be taken for granted (the numbers can be zero if not present in s). That E -> A, A -> F and F -> id have to be used exactly once can be shown as follows:
If E -> A is never used, any string derived will still have E, a nonterminal, in it, and so will not be a string in the language (nothing is generated without taking E -> A at least once). Also, every string that can be generated before using E -> A has at most one E in it (you start with one E, and the only other production keeps one E) so it is never possible to use E -> A more than once. So E -> A is used exactly once for all derived strings. The demonstration works the same way for A -> F and F -> id.
That E -> F + E, A -> A * F and F -> id - F are used exactly #(+), #(*) and #(-) times, respectively, is apparent from the fact that these are the only productions that introduce their respective symbols and each introduces one instance.
If you consider the sub-grammars of our resulting grammars, we can prove they are unambiguous as follows:
F -> id - F | id
This is an unambiguous grammar for (id - )*id. The only derivation of (id - )^kid is to use F -> id - F k times and then use F -> id exactly once.
A -> A * F | F
We have already seen that F is unambiguous for the language it recognizes. By the same argument, this is an unambiguous grammar for the language F( * F)*. The derivation of F( * F)^k will require the use of A -> A * F exactly k times and then the use of A -> F. Because the language generated from F is unambiguous and because the language for A unambiguously separates instances of F using *, a symbol not generated by F, the grammar
A -> A * F | F
F -> id - F | id
Is also unambiguous. To complete the argument, apply the same logic to the grammar generating (F + )*A from the start symbol E.
To remove an ambiguity means that you must choose one of all possible ambiguities. This grammar is as simple as it can be, for a mathematical expression.
To make the multiplication with a higher priority than the addition and the subtraction (where the last two have the same priority, but are traditionally computed from left to right) you do that (in ABNF like syntax):
expression = addition
addition = multiplication *(("+" / "-") multiplication)
multiplication = identifier *("*" identifier)
identifier = 'a'-'z'
The idea is as follows:
first create your lowest grammar rule: the identifier
continue with the highest priority operation, in your case multiplication: *
create a rule that has this on its right hand side: X *(P X), where X is the previous rule you have created, and P is your operation sign.
if you have more than one operation with the same priority they must be in a group: (P1 / P2 / ...)
continue to do the last two operations until there are no more operations to add.
add your main rule that uses the latest one.
Then for input like: a+b+c*d+e you get this tree:
More advanced tools will get you a tree that has more than two nodes. That means that all multiplications in one addition will be in a list that you can iterate from any direction.
This grammar is easy to upgrade, and to add parentheses you can do that:
expression = addition
addition = multiplication *(("+" / "-") multiplication)
multiplication = primary *("*" primary)
primary = identifier / "(" expression ")"
identifier = 'a'-'z'
Then for input (a+b)*c you will get this tree:
If you want to add a division, you can modify the multiplication rule like that:
multiplication = primary *(("*" / "/") primary)
These are all detailed trees, there are trees with less details as well, often called abstract syntax trees.
In my current compilers course, I've understood how to find the first and follow sets of a grammar, and so far all of the grammars I have dealt with have contained epsilon. Now I am being asked to find the first and follow sets of a grammar without epsilon, and to determine whether it is LR(0) and SLR. Not having epsilon has thrown me off, so I don't know if I've done it correctly. I would appreciate any comments on whether I am on the right track with the first and follow sets, and how to begin determining if it is LR(0)
Consider the following grammar describing Lisp arithmetic:
S -> E // S is start symbol, E is expression
E -> (FL) // F is math function, L is a list
L -> LI | I // I is an item in a list
I -> n | E // an item is a number n or an expression E
F -> + | - | *
FIRST:
FIRST(S)= FIRST(E) = {(}
FIRST(L)= FIRST(I) = {n,(}
FIRST(F) = {+, -, *}
FOLLOW:
FOLLOW(S) = {$}
FOLLOW(E) = FOLLOW(L) = {), n, $}
FOLLOW(I) = {),$}
FOLLOW(F) = {),$}
The FIRST sets are right, but the FOLLOW sets are incorrect.
The FOLLOW(S) = {$} is right, though technically this is for the augmented grammar S' -> S$ .
E appears on the right side of S -> E and I -> E, both of which mean that the follow of that set is in the follow of E, so: FOLLOW(E) = FOLLOW(S) ∪ FOLLOW(I) .
L appears on the right hand side of L -> LI, which gives FOLLOW(L) ⊇ FIRST(I) , and E -> (FL), which gives FOLLOW(L) ⊇ {)} .
I appears on the right side of L -> LI | I , which gives FOLLOW(I) = FOLLOW(L) .
F appears on the right side in E -> (FL) , which gives FOLLOW(F) = FIRST(L)
Solving for these gives:
FOLLOW(F) = {n, (}
FOLLOW(L) = FIRST(I) ∪ {)} = {n, (, )}
FOLLOW(I) = {n, (, )}
FOLLOW(E) = {$} ∪ {n, (, )} = {n, (, ), $}
I'm given the following grammar :
S -> A a A b | B b B a
A -> epsilon
B -> epsilon
I know that it's obvious that it's LL(1), but I'm facing troubles constructing the parsing table.. I followed the algorithm word by word to find the first and follow of each non-terminal , correct me if I'm wrong:
First(S) = {a,b}
First(A) = First(B) = epsilon
Follow(S) = {$}
Follow(A) = {a,b}
Follow(B) = {a,b}
when I construct the parsing table, according to the algorithm, I get a conflict under the $ symbol... what the hell am I doing wrong??
a b $
A A-> epsilon
B B-> epsilon
S S -> AaAb
S -> BbBa
is it ok if I get 2 productions under $ or something?? or am I constructing the parsing table wrong? please help I'm new to the compiler course
There is a tiny mistake. Algorithm is as follows from dragon book,
for each rule (S -> A):
for each terminal a in First(A):
add (S -> A) to M[S, a]
if First(A) contains empty:
for each terminal b in Follow(S):
add (S -> A) to M[S, b]
Let's take them one by one.
S -> AaAb. Here, First(AaAb) = {a}. So add S -> AaAb to M[S, a].
S -> BbBa. Here, First(BbBa) = {b}. So add S -> BbBa to M[S, b].
A -> epsilon. Here, Follow(A) = {a, b}. So add A -> epsilon to M[A, a] and M[A, b].
B -> epsilon. Here, Follow(B) = {a, b}. So add B -> epsilon to M[B, a] and M[B, b].
For this project I'm parsing in two stages. The first stage handles include/ifdef/define directives and chunks the input up into [Span] items which define their start/end points in the original inputs along with the body text. This stream is then parsed by the second stage into my AST for subsequent processing.
Each element of the AST carries it's source position and any semantic error caught after parsing prints the correct error position regardless of include depth. This part is crucial since it comes after the stage that has the problem.
The problem is given a parse error in the second stage from an included file it reports a bogus error with a location at the top level rule in the input. A parse error in the initial file works fine. The presence of any directives will divide even the initial file into multiple chunks so it's not a 'single chunk' vs. 'multiple chunks' issue.
Given the fact that the AST is getting the locations correct I'm stumped as to how Megaparsec is reporting bad info when parse errors are encountered.
I'm included my stream instance and (set|get)(Position|Input) code since these seem like the relevant bits. i feel like there must be some bit of megaparsec housekeeping that I'm not doing or that my Stream instance is invalid for some reason.
data Span = Span
{ spanStart :: SourcePos
, spanEnd :: SourcePos
, spanBody :: T.Text
} deriving (Eq, Ord, Show)
instance Stream [Span] where
type Token [Span] = Span
type Tokens [Span] = [Span]
tokenToChunk Proxy = pure
tokensToChunk Proxy = id
chunkToTokens Proxy = id
chunkLength Proxy = foldl1 (+) . map (T.length . spanBody)
chunkEmpty Proxy = all ((== 0) . T.length . spanBody)
positionAt1 Proxy pos (Span start _ _) = trace ("pos1" ++ show start) start
positionAtN Proxy pos [] = pos
positionAtN Proxy _ (Span start _ _:_) = trace ("posN" ++ show start) start
advance1 Proxy _ _ (Span _ end _) = end
advanceN Proxy _ pos [] = pos
advanceN Proxy _ _ ts = let Span _ end _ = last ts in end
take1_ [] = Nothing
take1_ s = case takeN_ 1 s of
Nothing -> Nothing
Just (sp, s') -> Just (head sp, s')
takeN_ _ [] = Nothing
takeN_ n s#(t:ts)
| s == [] = Nothing
| n <= 0 = Just ([t {spanEnd = spanStart t, spanBody = ""}], s)
| n < (T.length . spanBody) t = let (l, r) = T.splitAt n (spanBody t)
sL = spanStart t
eL = foldl (defaultAdvance1 (mkPos 3)) sL (T.unpack (T.tail l))
sR = defaultAdvance1 (mkPos 3) eL (T.last l)
eR = spanEnd t
l' = [Span sL eL l]
r' = (Span sR eR r):ts
in Just (trace (show n) l', r')
| n == (T.length . spanBody) t = Just ([t], ts)
| otherwise = case takeN_ (n - T.length (spanBody t)) ts of
Nothing -> Just ([t], [])
Just (t', ts') -> Just (t:t', ts')
takeWhile_ p s = fromJust $ takeN_ (go 0 s) s
where go n s = case take1_ s of
Nothing -> n
Just (c, s') -> if p c
then go (n + 1) s'
else n
Find include and swap to it:
"include" -> do
file <- between dquote dquote (many (alphaNumChar <|> char '.' <|> char '/' <|> char '_'))
s <- liftIO (Data.Text.IO.readFile file)
p <- getPosition
i <- getInput
pushPosition p
stack %= (:) (p, i)
setPosition (initialPos file)
setInput s
And if we reach the end of input pop stack and continue:
parseStream' :: StreamParser [Span]
parseStream' = concat <$> many p
where p = do
b <- tick <|> block
end <- option False (True <$ hidden eof)
h <- use stack
when (end && (h /= [])) $ do
popPosition
setInput (h ^?! ix 0 . _2)
stack %= tail
return b