Parsing table size (bottom-up) - parsing

I've seen a comparison between sizes of parsing tables constructed for ambiguous and unambiguous grammar (of the same language). The one created for ambiguous was significantly smaller. The used parser was SLR(1).
I would like to ask you, is it always true that the size of parsing tables (of the bottom-up parser) representing an ambiguous grammar is always smaller than parsing tables of corresponding unambiguous grammar? Obviously assuming that conflicts are resolved correctly.
I have done some research, but I cannot find any proof or answer to this question.

It is not always the case. Consider the classic grammars for the language of balanced parentheses
The unambiguous one has 5 states in the SLR(1) automaton.
S -> '(' S ')' S | \epsilon
At the same time, ambiguous grammar has 6 states in the SLR(1) automaton.
S -> S S | '(' S ')' | \epsilon
Thus the table size for the ambiguous grammar is greater than the table size for the unambiguous grammar.
The same is true about two grammars for the a+ language: S -> a S | a and S -> S S | S S S | a.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Needed example of LR(1) grammar that is not LR(0)?

Anyone would give me an processed example of LR(1) grammar which is not LR(0) Grammar? I was just trying to find out why LR(1) parser is more efficient and powerful, and tried an example of grammar and found it non LR(0) ,there was conflict in parsing table,then tried LR(1) also no use...
A very simple example of grammar ,(augmented)
S->A
A->aBed | aEef
B->m
E->m
Needed details analysis.
Anyone would explain with examples? Getting confused here.
For example:
S -> Aa | Bb
A->c
B->c
In order to decide if a c is an A or a B, you need to know the following symbol.
In real life, you most commonly need LR(1) for epsilon productions:
OPTIONAL_A -> ε | A
MULTI_A -> ε | MULTI_A A
... where ε matches only the empty string. In order to reduce an epsilon production, you always need to see past it.

Grammar for arithmetic expressions without alternatives?

How to write Unambiguous Grammar for arithmetic expressions e.g. a+(b+c)*d
E.g.
E -> E + T | T
T -> T * F | F
F -> ( E ) | i
WITHOUT alternatives - in my case without |T and |F and |i
This should be possible by adding more sentences to the grammar but I'm having hard time to figure out how...
NOTE: this is for University... so may be not a good real world Grammar :)
What you're asking for is impossible. If you do not have alternative productions in your grammar, then it is not possible for there to be any decisions about which productions to use. As a result, your grammar will either generate no strings, or will generate a single string. Grammars with these properties are called LL(0) grammars and are not at all practical.
Hope this helps!

Nondeterministic, unamigious Grammar?

According to wikipedias GLR description, they "handle nondeterministic and ambiguous grammars."
I can visualize an ambiguous grammar, like the dangling else problem, but what's a nondeterministic CF grammar which isn't ambiguous?
Pretty much any non LR(k) grammar is non-deterministic, but not necessarily ambiguous. The obvious is example is when you have some abitrarily large construct that can be parsed two ways, and which is correct depends on something AFTER the large construct. eg:
S ::= A x | B y
A ::= A a | a
B ::= B a | a
However, such non-deterministic grammars can often be reworked so as to be deterministic, if the two ways of parsing the large construct can be combined (as with S ::= A x | A y for the above grammar which is a deterministic way of parsing the same language.)
More interesting is LANGUAGES that are inherently non-deterministic -- that is there is no deterministic grammar for the language. For that there needs to be something inside the arbitrarily large construct that needs to match what comes after. eg:
S ::= X x | Y y
X ::= a X a | x
Y ::= a Y a | y

Example for LL(1) Grammar which is NOT LALR?

I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

Resources