Invalid grammar for a top-down recursive parser - parsing

I am trying to create grammar for a naive top-down recursive parser. As I understand the basic idea is to write a list of functions (top-down) that correspond to the productions in the grammar. Each function can call other functions (recursive).
The rules for a list include any number of numbers, but they must be separated by commas.
Here's an example of grammar I came up with:
LIST ::= NUM | LIST "," NUM
NUM ::= [0-9]+
Apparently this is incorrect, so my question is: why is this grammar not able to be parsed by a naive top-down recursive descent parser? What would be an example of a valid solution?

The issue is that for a LL(1) recursive decent parser such as this:
For any i and j (where j ≠ i) there is no symbol that can start both an instance of Wi and an instance of Wj.
This is because otherwise the parser will have errors knowing what path to take.
The correct solution can be obtained by left-factoring, it would be:
LIST ::= NUM REST
REST ::= "" | "," NUM
NUM ::= [0-9]+

Related

ANTLR does not automatically do lookahead matching?

I'm currently writing a simple grammar that requires operator precedence and mixed associativities in one expression. An example expression would be a -> b ?> C ?> D -> e, which should be parsed as (a -> (((b ?> C) ?> D) -> e). That is, the ?> operator is a high-precedence left-associative operator wheras the -> operator is a lower-precedence right-associative operator.
I'm prototyping the grammar in ANTLR 3.5.1 (via ANTLRWorks 1.5.2) and find that it can't handle the following grammar:
prog : expr EOF;
expr : term '->' expr
| term;
term : ID rest;
rest : '?>' ID rest
| ;
It produces rule expr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2 error.
The term and rest productions work fine in isolation when I tested it , so I assumed this happened because the parser is getting confused by expr. To get around that, I did the following refactor:
prog : expr EOF;
expr : term exprRest;
exprRest
: '->' expr
| ;
term : ID rest;
rest : DU ID rest
| ;
This works fine. However, because of this refactor I now need to check for empty exprRest nodes in the output parse tree, which is non-ideal. Is there a way to make ANTLR work around the ambiguity in the initial declaration of expr? I would of assumed that the generated parser would fully match term and then do a lookahead search for "->" and either continue parsing or return the lone term. What am I missing?
As stated, the problem is in this rule:
expr : term '->' expr
| term;
The problematic part is the term which is common to both alternatives.
LL(1) grammar doesn't allow this at all (unless term only matches zero tokens - but such rules would be pointless), because it cannot decide which alternative to use with only being able to see one token ahead (that's the 1 in LL(1)).
LL(k) grammar would only allow this if the term rule could match at most k - 1 tokens.
LL(*) grammar which ANTLR 3.5 uses does some tricks that allows it to handle rules that match any number of tokens (ANTLR author calls this "variable look-ahead").
However, one thing that these tricks cannot handle is if the rule is recursive, i.e. if it or any rules it invokes reference itself in any way (direct or indirect) - and that is exactly what your term rule does:
term : ID rest;
rest : '?>' ID rest
| ;
- the rule rest, referenced from term, recursively references itself. Thus, the error message
rule expr has non-LL(*) decision due to recursive rule invocations ...
The way to solve this limitation of LL grammars is called left-factoring:
expr : term
( '->' expr )?
;
What I did here is said "match term first" (since you want to match it in both alternatives, there's no point in deciding which one to match it in), then decide whether to match '->' expr (this can be decided just by looking at the very next token - if it's ->, use it - so this is even LL(1) decision).
This is very similar to what you came to as well, but the parse tree should look very much like you intended with the original grammar.

Building AST in OCaml

I'm using OCaml to build a recursive descent parser for a subset of Scheme. Here's the grammar:
S -> a|b|c|(T)
T -> ST | Epsilon
So say I have:
type expr =
Num of int | String of string | Tuple of expr * expr
Pseudocode
These functions have to return expr type to build the AST
parseS lr =
if head matches '(' then
parseL lr
else
match tokens a, b, or c
Using First Set of S which are tokens and '(':
parseL lr =
if head matches '(' or the tokens then
Tuple (parseS lr, parseL lr)
else
match Epsilon
My question is "How do I return for the Epsilon part since I just can't return ()?". An OCaml function requires same return type and even if I leave blank for Epsilon part, OCaml still assumes unit type.
As far as I can see, your AST doesn't match your grammar.
You can solve the problem by having a specifically empty node in your AST type to represent the Epsilon in your grammar.
Or, you can change your grammar to factor out the Epsilon.
Here's an equivalent grammar with no Epsilon:
S -> a|b|c|()|(T)
T -> S | S T
Maybe instead of creating parser-functions manually it is better to use existent approaches: for example, LALR(1) ocamlyacc or camlp4 based LL(k) parsers ?

Is this BNF grammar LL(1)?

Can someone please confirm for me if the following BNF grammar is LL(1):
S ::= A B B A
A ::= a
A ::=
B ::= b
B ::=
where S is the start symbol and non terminals A and B can derive to epsilon. I know if there are 2 or more productions in a single cell in the parse table, then the grammar isn't LL(1). But if a cell already contains epsilon, can we safely replace it with the new production when constructing the parse table?
This grammar is ambiguous, and thus not LL(1), nor LL(k) for any k.
Take a single a or b as input, and see that it can be matched by either of the A or B references from S. Thus there are two different parse trees, proving that the grammar is ambiguous.

BNF grammar for left-associative operators

I have the following EBNF grammar for simple arithmetic expressions with left-associative operators:
expression:
term {+ term}
term:
factor {* factor}
factor:
number
( expression )
How can I convert this into a BNF grammar without changing the operator associativity? The following BNF grammar does not work for me, because now the operators have become right-associative:
expression:
term
term + expression
term:
factor
factor * term
factor:
number
( expression )
Wikipedia says:
Several solutions are:
rewrite the grammar to be left recursive, or
rewrite the grammar with more nonterminals to force the correct precedence/associativity, or
if using YACC or Bison, there are operator declarations, %left, %right and %nonassoc, which tell the parser generator which associativity to force.
But it does not say how to rewrite the grammar, and I don't use any parsing tools like YACC or Bison, just simple recursive descent. Is what I'm asking for even possible?
expression
: term
| expression + term;
Just that simple. You will, of course, need an LR parser of some description to recognize a left-recursive grammar. Or, if recursive descent, recognizing such grammars is possible, but not as simple as right-associative ones. You must roll a small recursive ascent parser to match such.
Expression ParseExpr() {
Expression term = ParseTerm();
while(next_token_is_plus()) {
consume_token();
Term next = ParseTerm();
term = PlusExpression(term, next);
}
return term;
}
This pseudocode should recognize a left-recursive grammar in that style.
What Puppy suggests can also be expressed by the following grammar:
expression: term opt_add
opt_add: '+' term opt_add
| /* empty */
term: factor opt_mul
opt_mul: '*' factor opt_mul
| /* emtpty */
factor: number
| '(' expression ')

Example for LL(1) Grammar which is NOT LALR?

I am learning now about parsers on my Theory Of Compilation course.
I need to find an example for grammar which is in LL(1) but not in LALR.
I know it should be exist. please help me think of the most simple example to this problem.
Some googling brings up this example for a non-LALR(1) grammar, which is LL(1):
S ::= '(' X
| E ']'
| F ')'
X ::= E ')'
| F ']'
E ::= A
F ::= A
A ::= ε
The LALR(1) construction fails, because there is a reduce-reduce conflict between E and F. In the set of LR(0) states, there is a state made up of
E ::= A . ;
F ::= A . ;
which is needed for both S and X contexts. The LALR(1) lookahead sets for these items thus mix up tokens originating from the S and X productions. This is different for LR(1), where there are different states for these cases.
With LL(1), decisions are made by looking at FIRST sets of the alternatives, where ')' and ']' always occur in different alternatives.
From the Dragon book (Second Edition, p. 242):
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what the right side derives. Thus, it should not be surprising that LR grammars can describe more languages than LL grammars.

Resources