Real-world LR(k > 1) grammars? - parsing

Making artificial LR(k) grammars for k > 1 is easy:
Input: A1 B x
Input: A2 B y (introduce reduce-reduce conflict for terminal a)
A1 : a
A2 : a
B : b b b ... b (terminal b occurs k-1 times)
However, are there any real-world non-LR(1) computer languages that are LR(k > 1)-parsable?
Or are non-LR(1) languages also not LR(k) either?

If a language has an LR(k) grammar, then it has an LR(1) grammar which can be generated mechanically from the LR(k) grammar; furthermore, the original parse tree can be recreated from the LR(1) parse tree. A proof of this theorem appears in Section 6.7 of Parsing Theory Vol. II, Sippu & Soisalon-Soininen; they attribute it to "On the complete covering problem for LR(k) grammars" (1976) by MD Mickunas, in JACM 23:17-30.
So there's no language parseable as LR(k) for k>1 which is not also parseable as LR(1), and consequently there really isn't such as thing as an LR(k) language for values of k other than 0 and 1.
However, there are languages for which an LR(2) grammar is much easier. One classic example is the Posix grammar for yacc, which does not require productions to be terminated with a ;. This is unambiguous, because a production must start with an identifier followed by a colon. Written that way, the grammar is clearly LR(2); by the above theorem, an LR(1) grammar exists but it's nowhere near as clean.

Related

Method to calculate predict set of a grammar production in recursive descent parser

I understand first and follow but I am totally lost on the predict sets. can someone explain to me how to go about finding a predict set of a production in a grammar using the first and follow sets? I have not provided a grammar because this is for a homework assignment and I want to know how to do it not how to do it for this specific grammar.
Intuitively, the predict set for a production A → α [Note 1] is the set of terminal symbols which might be the next symbol to be read if that production is to be predicted. (That implies that the production's non-terminal (A) has already been predicted, and the parser must now decide which of the non-terminal's productions to predict.)
Obviously, that includes all the terminal symbols which might be the first symbol of the right-hand side. But what if the right-hand side might derive ε, the empty string? In that case, the next symbol in the input will be the first symbol which comes after the predicted non-terminal, A; in other words, it will be a member of FOLLOW(A). So the predict set contains the terminals which might start the right-hand side α, plus all the symbols in FOLLOW(A) if α could derive the empty string. [Note 2]
More formally, PREDICT(A → α) is:
FIRST(α) if ε ∉ FIRST(α)
(FIRST(α) ∪ FOLLOW(A)) - {ε} if ε ∈ FIRST(α)
Remember that we compute FIRST on a sentential form by "looking through" epsilons:
FIRST(aβ) is
FIRST(a) if ε ∉ FIRST(a)
(FIRST(a) - {ε}) ∪ FIRST(β) if ε ∈ FIRST(a)
Consequently, FIRST of a right hand side only include ε if every symbol in the right-hand side is nullable.
Notes:
I use the common convention that capital letters (A...) refer to non-terminals, lower-case letters (a...) refer to grammar symbols (terminals or non-terminals) and Greek letters (α...) refer to possibly empty sequences of grammar symbols.
Aside from the first step when the start symbol is predicted, the current prediction always contains more than one symbol. So if A is the next non-terminal to expand and we see that it is nullable (i.e., it could derive nothing), we don't really need to lookup FOLLOW(A) because we could just look at the predict stack and see what we've predicted will follow A. In some cases, this might allow us to avoid a conflict with one of the other alternatives for A.
However, it is normal to use FOLLOW(A), regardless. Always using FOLLOW(A) is usually referred to as the "Strong LL" (SLL) algorithm. Although it seems like computing the FIRST set of the known prediction stack is more powerful than using a precomputed FOLLOW set, it does not actually improve the power of LL parsing at all; every non-LL grammar can be converted to an SLL grammar.

Finding an equivalent LR grammar for the same number of "a" and "b" grammar?

I can't seem to find an equivalent LR grammar for:
S → aSbS | bSaS | ε
which I think recognize strings with the same number of 'a' than 'b'.
What would be a workaround for this? Is it possible to find and LR grammar for this?
Thanks in advance!
EDIT:
I have found what I think is an equivalent grammar but I haven't been able to prove it.
I think I need to prove that the original grammar generates the language above, and then prove that language is generated for the following equivalent grammar. But I am not sure how to do it. How should I do it?
S → aBS | bAS | ε
B → b | aBB
A → a | bAA
Thanks in advance...
PS: I have already proven that this new grammar is LL(1), SLR(1), LR(1) and LALR(1).
Unless a grammar is directly related to another grammar -- for example through standard transformations such as normalization, null-production eliminate, and so on -- proving that two grammars derivee the same language is very difficult without knowing what the language is. It is usually easier to prove (independently) that each grammar derives the language.
The first grammar you provide:
S → aSbS | bSaS | ε
does in fact derive the language of all strings over the alphabet {a, b}* where the number of as is the same as the number of bs. We can prove that in two parts: first, that every sentence recognized by the grammar has that property, and second that every sentence which has that property can be derived by that grammar. Both proofs proceed by induction.
For the forward proof, we proceed by induction on the number of derivations. Suppose we have some derivation S → α → β → … → ω where all the greek letters represent sequences of non-terminals and terminals.
If the length of the derivation is exactly zero, so that it starts and ends with S, then there are no terminals in any derived sentence so its clear that every derived sentence has the same number of as and bs. (Base step)
Now for the induction step. Suppose that every derivation of length i is known to end with a derived sentence which has the same number of as and bs. We want to prove from that premise that every derivation of length i+1 ends with a sentence which has the same number of as and bs. But that is also clear: each of the three possible production steps preserves parity.
Now, let's look at the opposite direction: every sentence with the same number of as and bs can be derived from that grammar. We'll do this by induction on the length of the string. Our induction premise will be that if it is the case that for every j ≤ i, every sentence with exactly j as and j bs has a derivation from S, then every sentence with exactly i+1 as and i+1 bs. (Here we are only considering sentences consisting only of terminals.)
Consider such a ssentence. It either starts with an a or a b. Suppose that it starts with an a: then there is at least one b in the sentence such that the prefix ending with that b has the same number of each terminal. (Think of the string as a walk along a square grid: every a moves diagonally up and right one unit, and every b moves diagonally down and right. Since the endpoint is at exactly the same height as the beginning point and there are no wormholes in the graph, once we ascend we must sooner or later descend back to the starting height, which is a prefix ending b.) So the interior of that prefix (everything except the a at the beginning and the b at the end) is balanced, as is the remainder of the string. Both of those are shorter, so by the induction hypothesis they can be derived from S. Making those substitutions, we get aSbS, which can be derived from S. An identical argument applies to strings starting with b. Again, the base step is trivial.
So that's basically the proof procedure you'll need to adapt for your grammar.
Good luck.
By the way, this sort of question can also be posed on cs.stackexchange.com or math.stackexchange.com, where the MathJax is available. MathJax makes writing out mathematical proofs much less tedious, so you may well find that you'll get more readable answers there.

How to determine the k value of an LL(k) grammar

Suppose I'm given the grammar
Z-> X
X-> Y
-> b Y a
Y-> c
-> c a
The grammar is LL(K) What is the K value?
All I know is its not LL(1) since there is a predict set conflict on Y and LL(1) grammar predict set must be disjoint.
Ok, so luckily this question was not on my exam.
As I mentioned, the predict set conflict means its not LL(1) next you just have to observe the minimum number of look ahead need to determine a production value.
In this case two.

how to parse Context-sensitive grammar?

CSG is similar to CFG but the reduce symbol is multiple.
So, can I just use CFG parser to parse CSG with reducing production to multiple terminals or non-terminals?
Like
1. S → a bc
2. S → a S B c
3. c B → W B
4. W B → W X
5. W X → B X
6. B X → B c
7. b B → b b
When we meet W X, can we just reduce W X to W B?
When we meet W B, can we just reduce W B to c B?
So if CSG parser is based on CFG parser, it's not hard to write, is it true?
But when I checked wiki, it said to parse CSG, we should use linear bounded automaton.
What is linear bounded automaton?
Context sensitive grammars are non-deterministic. So you can not assume that a reduction will take place, just because the RHS happens to be visible at some point in a derivation.
LBAs (linear-bounded automata) are also non-deterministic, so they are not really a practical algorithm. (You can simulate one with backtracking, but there is no convenient bound on the amount of time it might take to perform a parse.) The fact that they are acceptors for CSGs is interesting for parsing theory but not really for parsing practice.
Just as with CFGs, there are different classes of CSGs. Some restricted subclasses of CSGs are easier to parse (CFGs are one subclass, for example), but I don't believe there has been much investigation into practical uses; in practice, CSGs are hard to write, and there is no obvious analog of a parse tree which can be constructed from a derivation.
For more reading, you could start with the wikipedia entry on LBAs and continue by following its references. Good luck.

Directly identifying grammar as SLR(1), LR(0), LR(1) etc [duplicate]

Is there a simple way to determine whether a grammar is LL(1), LR(0), SLR(1)... just from looking on the grammar without doing any complex analysis?
For instance: To decide whether a BNF Grammar is LL(1) you have to calculate First and Follow sets - which can be time consuming in some cases.
Has anybody got an idea how to do this faster?
Any help would really be appreciated!
First off, a bit of pedantry. You cannot determine whether a language is LL(1) from inspecting a grammar for it, you can only make statements about the grammar itself. It is perfectly possible to write non-LL(1) grammars for languages for which an LL(1) grammar exists.
With that out of the way:
You could write a parser for the grammar and have a program calculate first and follow sets and other properties for you. After all, that's the big advantage of BNF grammars, they are machine comprehensible.
Inspect the grammar and look for violations of the constraints of various grammar types. For instance: LL(1) allows for right but not left recursion, thus, a grammar that contains left recursion is not LL(1). (For other grammar properties you're going to have to spend some quality time with the definitions, because I can't remember anything else off the top of my head right now :).
In answer to your main question: For a very simple grammar, it may be possible to determine whether it is LL(1) without constructing FIRST and FOLLOW sets, e.g.
A → A + A | a
is not LL(1), while
A → a | b
is.
But when you get more complex than that, you'll need to do some analysis.
A → B | a
B → A + A
This is not LL(1), but it may not be immediately obvious
The grammar rules for arithmetic quickly get very complex:
expr → term { '+' term }
term → factor { '*' factor }
factor → number | '(' expr ')'
This grammar handles only multiplication and addition, and already it's not immediately clear whether the grammar is LL(1). It's still possible to evaluate it by looking through the grammar, but as the grammar grows it becomes less feasable. If we're defining a grammar for an entire programming language, it's almost certainly going to take some complex analysis.
That said, there are a few obvious telltale signs that the grammar is not LL(1) — like the A → A + A above — and if you can find any of these in your grammar, you'll know it needs to be rewritten if you're writing a recursive descent parser. But there's no shortcut to verify that the grammar is LL(1).
One aspect, "is the language/grammar ambiguous", is a known undecidable question like the Post correspondence and halting problems.
Straight from the book "Compilers: Principles, Techniques, & Tools" by Aho, et. al.
Page 223:
A grammar G is LL(1) if and only if whenever A -> alpha | beta are two distinct productions of G, the following conditions hold:
For no terminal "a" do both alpha and beta derive strings beginning with "a"
At most one of alpha and beta can derive the empty string
If beta can reach the empty transition via zero or more transitions, then alpha does not derive any string beginning with a terminal in FOLLOW(A). Likewise, if alpha can reach the empty transition via zero or more transitions, then beta does not derive any string beginning with a terminal in FOLLOW(A)
Essentially this is a matter of verifying the grammar passes the Pairwise Disjointness Test and also does not involve Left Recursion. Or more succinctly a grammar G that is left-recursive or ambiguous cannot be LL(1).
Check whether the grammar is ambiguous or not. If it is, then the grammar is not LL(1) because no ambiguous grammar is LL(1).
ya there are shortcuts for ll(1) grammar
1) if A->B1|B2|.......|Bn
then first(B1)intersection first(B2)intersection .first(Bn)=empty set then it is ll(1) grammar
2) if A->B1|epsilon
then B1 intersection follow(A)is empty set
3) if G is any grammar such that every non terminal derives only one production then the grammar is LL(1)
p0 S' → E
p1 E → id
p2 E → id ( E )
p3 E → E + id
Construct the LR(0) DFA, the FOLLOW set for E and the SLR action/goto tables.
Is this an LR(0) grammar? Prove your answer.
Using the SLR tables, show the steps (shifts, reductions, accept) of an LR parser parsing: id ( id + id )

Resources