Which grammars can be parsed using recursive descent without backtracking? - parsing

According to "Recursive descent parser" on Wikipedia, recursive descent without backtracking (a.k.a. predictive parsing) is only possible for LL(k) grammars.
Elsewhere, I have read that the implementation of Lua uses such a parser. However, the language is not LL(k). In fact, Lua is inherently ambiguous: does a = f(g)(h)[i] = 1 mean a = f(g); (h)[i] = 1 or a = f; (g)(h)[i] = 1? This ambiguity is resolved by greediness in the parser (so the above is parsed as the erroneous a = f(g)(h)[i]; = 1).
This example seems to show that predictive parsers can handle grammars which are not LL(k). Is it true they can, in fact, handle a superset of LL(k)? If so, is there a way to find out whether a given grammar is in this superset?
In other words, if I am designing a language which I would like to parse using a predictive parser, do I need to restrict the language to LL(k)? Or is there a looser restriction I can apply?

TL;DR
For a suitable definition of a recursive descent parser, it is absolutely correct that only LL(k) languages can be parsed by recursive descent.
Lua can be parsed with a recursive descent parser precisely because the language is LL(k); that is, an LL(k) grammar exists for Lua. [Note 1]
1. An LL(k) language may have non-LL(k) grammars.
A language is LL(k) if there is an LL(k) grammar which recognizes the language. That doesn't mean that every grammar which recognizes the language is LL(k); there might be any number of non-LL(k) grammars which recognize the language. So the fact that some grammar is not LL(k) says absolutely nothing about the language itself.
2. Many practical programming languages are described with an ambiguous grammar.
In formal language theory, a language is inherently ambiguous only if every grammar for the language is ambiguous. It is probably safe to say that no practical programming language is inherently ambiguous, since practical programming languages are deterministically parsed (somehow). [Note 2].
Because writing a strictly non-ambiguous grammar can be tedious, it is pretty common for the language documentation to provide an ambiguous grammar, along with textual material which indicates how the ambiguities are to be resolved.
For example, many languages (including Lua) are documented with a grammar which does not explicitly include operator precedence, allowing a simple rule for expressions:
exp ::= exp Binop exp | Unop exp | term
That rule is clearly ambiguous, but given a list of operators, their relative precedences and an indication of whether each operator is left- or right-associative, the rule can be mechanically expanded into an unambiguous expression grammar. Indeed, many parser generators allow the user to provide the precedence declarations separately, and perform the mechanical expansion in the course of producing the parser. The resulting parser, it should be noted, is a parser for the disambiguated grammar so the ambiguity of the original grammar does not imply that the parsing algorithm is capable of dealing with ambiguous grammars.
Another common example of ambiguous reference grammars which can be mechanically disambiguated is the "dangling else" ambiguity found in languages like C (but not in Lua). The grammar:
if-statement ::= "if" '(' exp ')' stmt
| "if" '(' exp ')' stmt "else" stmt
is certainly ambiguous; the intention is that the parse be "greedy". Again, the ambiguity is not inherent. There is a mechanical transformation which produces an unambiguous grammar, something like the following:
matched-statement ::= matched-if-stmt | other-statement
statement ::= matched-if-stmt | unmatched-if-stmt
matched-if-stmt ::= "if" '(' exp ')' matched-statement "else" matched-statement
unmatched-if-stmt ::= "if" '(' exp ')' statement
| "if" '(' exp ')' matched-statement "else" unmatched-if-stmt
It is quite common for parser generators to implicitly perform this transformation. (For an LR parser generator, the transformation is actually implemented by deleting reduce actions if they conflict with a shift action. This is simpler than transforming the grammar, but it has exactly the same effect.)
So Lua (and other programming languages) are not inherently ambiguous; and therefore they can be parsed with parsing algorithms which require unambiguous deterministic parsers. Indeed, it might even be a little surprising that there are languages for which every possible grammar is ambiguous. As is pointed out in the Wikipedia article cited above, the existence of such languages was proven by Rohit Parikh in 1961; a simple example of an inherently-ambiguous context-free language is
{anbmcmdn|n,m≥0} ∪ {anbncmdm|n,m≥0}.
3. Greedy LL(1) parsing of Lua assignment and function call statements
As with the dangling else construction above, the disambiguation of Lua statement sequences is performed by only allowing the greedy parse. Intuitively, the procedure is straight-forward; it is based on forbidding two consecutive statements (without intervening semicolon) where the second one starts with a token which might continue the first one.
In practice, it is not really necessary to perform this transformation; it can be done implicitly during the construction of the parser. So I'm not going to bother to generate a complete Lua grammar here. But I trust that the small subset of the Lua grammar here is sufficient to illustrate how the transformation can work.
The following subset (largely based on the reference grammar) exhibits precisely the ambiguity indicated in the OP:
program ::= statement-list
statement-list ::= Ø
| statement-list statement
statement ::= assignment | function-call | block | ';'
block ::= "do" statement-list "end"
assignment ::= var '=' exp
exp ::= prefixexp [Note 3]
prefixexp ::= var | '(' exp ')' | function-call
var ::= Name | prefixexp '[' exp ']'
function-call ::= prefixexp '(' exp ')'
(Note: (I'm using Ø to represent the empty string, rather ε, λ, or %empty.)
The Lua grammar as is left-recursive, so it is clearly not LL(k) (independent of the ambiguity). Removing the left-recursion can be done mechanically; I've done enough of it here in order to demonstrate that the subset is LL(1). Unfortunately, the transformed grammar does not preserve the structure of the parse tree, which is a classic problem with LL(k) grammars. It is usually simple to reconstruct the correct parse tree during a recursive descent parse and I'm not going to go into the details.
It is simple to provide an LL(1) version of exp, but the result eliminates the distinction between var (which can be assigned to) and function-call (which cannot):
exp ::= term exp-postfix
exp-postfix ::= Ø
| '[' exp ']' exp-postfix
| '(' exp ')' exp-postfix
term ::= Name | '(' exp ')'
But now we need to recreate the distinction in order to be able to parse both assignment statements and function calls. That's straight-forward (but does not promote understanding of the syntax, IMHO):
a-or-fc-statement ::= term a-postfix
a-postfix ::= '=' exp
| ac-postfix
c-postfix ::= Ø
| ac-postfix
ac-postfix ::= '(' exp ')' c-postfix
| '[' exp ']' a-postfix
In order to make the greedy parse unambiguous, we need to ban (from the grammar) any occurrence of S1 S2 where S1 ends with an exp and S2 starts with a '('. In effect, we need to distinguish different types of statement, depending on whether or not the statement starts with a (, and independently, whether or not the statement ends with an exp. (In practice, there are only three types because there are no statements which start with a ( and do not end with an exp. [Note 4])
statement-list ::= Ø
| s1 statement-list
| s2 s2-postfix
| s3 s2-postfix
s2-postfix ::= Ø
| s1 statement-list
| s2 s2-postfix
s1 ::= block | ';'
s2 ::= Name a-postfix
s3 ::= '(' exp ')' a-postfix
4. What is recursive descent parsing, and how can it be modified to incorporate disambiguation?
In the most common usage, a predictive recursive descent parser is an implementation of the LL(k) algorithm in which each non-terminal is mapped to a procedure. Each non-terminal procedure starts by using a table of possible lookahead sequences of length k to decide which alternative production for that non-terminal to use, and then simply "executes" the production symbol by symbol: terminal symbols cause the next input symbol to be discarded if it matches or an error to be reported if it doesn't match; non-terminal symbols cause the non-terminal procedure to be called.
The tables of lookahead sequences can be constructed using FIRSTk and FOLLOWk sets. (A production A→ω is mapped to a sequence α of terminals if α ∈ FIRSTk(ω FOLLOWk(A)).) [Note 5]
With this definition of recursive descent parsing, a recursive descent parser can handle precisely and solely LL(k) languages. [Note 6]
However, the alignment of LL(k) and recursive descent parsers ignores an important aspect of a recursive descent parser, which is that it is, first and foremost, a program normally written in some Turing-complete programming language. If that program is allowed to deviate slightly from the rigid structure described above, it could parse a much larger set of languages, even languages which are not context-free. (See, for example, the C context-sensitivity referenced in Note 2.)
In particular, it is very easy to add a "default" rule to a table mapping lookaheads to productions. This is a very tempting optimization because it considerably reduces the size of the lookahead table. Commonly, the default rule is used for non-terminals whose alternatives include an empty right-hand side, which in the case of an LL(1) grammar would be mapped to any symbol in the FOLLOW set for the non-terminal. In that implementation, the lookahead table only includes lookaheads from the FIRST set, and the parser automatically produces an empty right-hand side, corresponding to an immediate return, for any other symbol. (As with the similar optimisation in LR(k) parsers, this optimization can delay recognition of errors but they are still recognized before an additional token is read.)
An LL(1) parser cannot include a nullable non-terminal whose FIRST and FOLLOW sets contain a common element. However, if the recursive descent parser uses the "default rule" optimization, that conflict will never be noticed during the construction of the parser. In effect, ignoring the conflict allows the construction of a "greedy" parser from (certain) non-deterministic grammars.
That's enormously convenient, because as we have seen above producing unambiguous greedy grammars is a lot of work and does not lead to anything even vaguely resembling a clear exposition of the language. But the modified recursive parsing algorithm is not more powerful; it simply parses an equivalent SLL(k) grammar (without actually constructing that grammar).
I do not intend to provide a complete proof of the above assertion, but a first step is to observe that any non-terminal can be rewritten as a disjunction of new non-terminals, each with a single distinct FIRST token, and possibly a new non-terminal with an empty right-hand side. It is then "only" necessary to remove non-terminals from the FOLLOW set of nullable non-terminals by creating new disjunctions.
Notes
Here, I'm talking about the grammar which operates on a tokenized stream, in which comments have been removed and other constructs (such as strings delimited by "long brackets") reduced to a single token. Without this transformation, the language would not be LL(k) (since comments -- which can be arbitrarily long -- interfere with visibility of the lookahead token). This allows me to also sidestep the question of how long brackets can be recognised with an LL(k) grammar, which is not particularly relevant to this question.
There are programming languages which cannot be deterministically parsed by a context-free grammar. The most notorious example is probably Perl, but there is also the well-known C construct (x)*y which can only be parsed deterministically using information about the symbol x -- whether it is a variable name or a type alias -- and the difficulties of correctly parsing C++ expressions involving templates. (See, for example, the questions Why can't C++ be parsed with a LR(1) parser? and Is C++ context-free or context-sensitive?)
For simplicity, I've removed the various literal constants (strings, numbers, booleans, etc.) as well as table constructors and function definitions. These tokens cannot be the target of a function-call, which means that an expression ending with one of these tokens cannot be extended with a parenthesized expression. Removing them simplifies the illustration of disambiguation; the procedure is still possible with the full grammar, but it is even more tedious.
With the full grammar, we will need to also consider expressions which cannot be extended with a (, so there will be four distinct options.
There are deterministic LL(k) grammars which fail to produce unambiguous parsing tables using this algorithm, which Sippu & Soisalon-Soininen call the Strong LL(k) algorithm. It is possible to augment the algorithm using an additional parsing state, similar to the state in an LR(k) parser. This might be convenient for particular grammars but it does not change the definition of LL(k) languages. As Sippu & Soisalon-Soininen demonstrate, it is possible to mechanically derive from any LL(k) grammar an SLL(k) grammar which produces exactly the same language. (See Theorem 8.47 in Volume 2).
The recursive definition algorithm is a precise implementation of the canonical stack-based LL(k) parser, where the parser stack is implicitly constructed during the execution of the parser using the combination of the current continuation and the stack of activation records.

Related

Simplifying grammar via operator precedence

I am trying to parse C. I have been consulting some free-context C grammars and I have observed they usually model expressions by using "chained" production rules, for example [here][1] something like this is done to model logical or and logical and expressions:
<logical-or-expression> ::= <logical-and-expression>
| <logical-or-expression> || <logical-and-expression>
<logical-and-expression> ::= <inclusive-or-expression>
| <logical-and-expression> && <inclusive-or-expression>
I say the expressions are chained because they follow this structure:
expression with operator(N) ::= expression with operator(N+1)
| (expression with operator(N)) operator(N) (expression with operator(N+1))
where N is the precedence of the operator.
I understand that objetive is to disambiguate the language and introduce precedence and association rules in a purely syntactic manner.
Is there any reason to model expressions like this in an actual parser with operator precedence support? My initial idea was to implement them simply as:
constant_expression ::= expression1 binary_op expression2
where binary_op is any binary operation and then disambiguate by setting the precedence of all the operators. For example:
logical_expr ::= simple_expr | logical_expr && logical_expr | logical_expr || logical_expr
and then set the precedence of && operator higher than ||. I think this tactic would give a much simpler grammar, as it would eliminate the necessity of a different rule for every level of precedence but I am reluctant to use it because all the implementations I have seen use the former strategy, even in cases where the parser had precedence support.
Thanks in advance.
[1]: https://cs.wmich.edu/~gupta/teaching/cs4850/sumII06/The%20syntax%20of%20C%20in%20Backus-Naur%20form.htm
Many LR-style parsers can handle operator precedence rules using some mechanism external to the grammar itself in part because it allows you to skip this “layering” approach to writing CFGs. If you have a parser generator that supports this, it’s fine to write an ambiguous grammar and then add those external rules in to get the precedence and associativity right.
As a note - parsers for CFGs and BNF rules usually are insensitive to the order in which rules are written, so listing the operators from highest-precedence to lowest-precedence alone isn’t sufficient. (PEG parsers, on the other hand, do represent ordered choices). Also, due to how most parser generators work (having code to execute associated with each production, and using the terminals in a production to determine operator precedence), it’s often easier to have separate rules, one per binary operator, than it is to have one rule of the form “Expr Operator Expr.” But otherwise the basic approach is sound.

Grammar for recursive descent parsing

Is there an easy way to tell whether a simple grammar is suitable for recursive descent? Is eliminating left recursion and left factoring the grammar enough to achieve this ?
Not necessarily.
To build a recursive descent parser (without backtracking), you need to eliminate or resolve all predict conflicts. So one definitive test is to see if the grammar is LL(1); LL(1) grammars have no predict conflicts, by definition. Left-factoring and left-recursion elimination are necessary for this task, but they might not be sufficient, since a predict conflict might be hiding behind two competing non-terminals:
list ::= item list'
list' ::= ε
| ';' item list'
item ::= expr1
| expr2
expr1 ::= ID '+' ID
expr2 ::= ID '(' list ')
The problem with the above (or, at least, one problem) is that when the parser expects an item and sees an ID, it can't know which of expr1 and expr2 to try. (That's a predict conflict: Both non-terminals could be predicted.) In this particular case, it's pretty easy to see how to eliminate that conflict, but it's not really left-factoring since it starts by combining two non-terminals. (And in the full grammar this might be excerpted from, combining the two non-terminals might be much more difficult.)
In the general case, there is no algorithm which can turn an arbitrary grammar into an LL(1) grammar, or even to be able to say whether the language recognised by that grammar has an LL(1) grammar as well. (However, it's easy to tell whether the grammar itself is LL(1).) So there's always going to be some art and/or experimentation involved.
I think it's worth adding that you don't really need to eliminate left-recursion in a practical recursive descent parser, since you can usually turn it into a while-loop instead of recursion. For example, leaving aside the question of the two expr types above, the original grammar in an extended BNF with repetition operators might be something like
list ::= item (';' item)*
Which translates into something like:
def parse_list():
parse_item()
while peek(';'):
match(';')
parse_item()
(Error checking and AST building omitted.)

bison/yacc - limits of precedence settings

So I've been trying to parse a haskell-like language grammar with bison. I'll omit the standard problems with grammars and unary minus (like, what is (-5) from -5 and \x->x-5 or if a-b is a-(b) or apply a (-b) which itself can still be apply a \x->x-b, haha.) and go straight to the thing that suprised me.
To simplify the whole thing to the point where it matters, consider following situation:
expression: '(' expression ')'
| expression expression /* lambda application */
| '\\' IDENTIFIER "->" expression /* lambda abstraction */
| expression '+' expression /* some operators to play with */
| expression '*' expression
| IDENTIFIER | CONSTANT /* | ..... */
;
I solved all shift/reduce conflicts with '+' and '*' with %left and %right
precedence macros, but I somehow failed to find any good solution how to set
precedence for the expression expression lambda application. I tried
%precedence and %left and %prec marker as shown for example here
%http://www.gnu.org/software/bison/manual/html_node/Non-Operators.html#Non-Operators,
but it looks like bison is completely ignoring any precedence setting on this
rule. At least all combinations I tried failed. Documentation on exactly this
topic is pretty sparse, whole thing looks like suited only for handling the
"classic" expr. OPER expr. case.
Question: Am I doing something wrong, or is this impossible in Bison? If not,
is it just unsupported or is there some theoretical justification why not?
Remark: Of course there's an easy workaround to force left-folding and
precedence that would look schematically like
expression: expression_without_lambda_application
| expression expression_without_lambda_application
;
expression_without_lambda_application: /* ..operators.. */
| '(' expression ')'
;
...but that's not as neat as it could be, right? :]
Thanks!
It's easiest to understand how bison precedence works if you understand how LR parsing works, since it's based on a simple modification of the LR algorithm. (Here, I'm just combining SLR, LALR and LR grammars, because the basic algorithm is the same.)
An LR(1) machine has two possible classes of action:
Reduce the right-hand side of the production which ends just before the lookahead token (and consequently is at the top of the stack).
Shift the lookahead token.
In an LR(1) grammar, the decision can always be made on the basis of the machine state and the lookahead token. But certain common constructs -- notably infix expressions -- apparently require grammars which appear more complicated than they need to be, and which require more unit reductions than should be necessary.
In an era in which LR parsing was new, and most practitioners were used to some sort of operator precedence grammar (see below for definition), and in which cycles were a lot more expensive than they are now so that the extra unit reductions seemed annoying, the modification of the LR algorithm to use standard precedence techniques was attractive.
The modification -- which is based on a classic algorithm for parsing operator precedence grammars -- involves assigning a precedence value (an integer) to every right-hand side (i.e. every production) and to every terminal. Then, when constructing the LR machine, if a given state and lookahead can trigger either a shift or a reduce action, the conflict is resolved by comparing the precedence of the possible reduction with the precedence of the lookahead token. If the reduction has a higher precedence, it wins; otherwise the machine shifts.
Note that reduction precedences are never compared with each other, and neither are token precedences. They can actually come from different domains. Furthermore, for a simple expression grammar, intuitively the comparison is with the operator "at the top of the stack"; this is actually accomplished by using the right-most terminal in a production to assign the precedence of the production. To handle left vs. right associativity, we don't actually use the same precedence value for a production as for a terminal. Left-associative productions are given a precedence slightly higher than the terminal's precedence, and right-associative productions are given a precedence slightly lower. This could be done by making the terminal precedences multiples of 3 and the reduction precedences a value one greater or less than the terminal. (Actually in practice the comparison is > rather than ≥ so it's possible to use even numbers for terminals, but that's an implementation detail.)
As it turns out, languages are not always quite so simple. So sometimes -- the case of unary operators is a classic example -- it's useful to explicitly provide a reduction precedence which is different from the default. (Another case is where the precedence is more related to the first terminal than the last, in the case where there are more than one.)
Editorial note:
Really, this is all a hack. It's a good hack, and it can be useful. But like all hacks, it can be pushed too far. Intricate tricks with precedence which require a full understanding of the algorithm and a detailed analysis of the grammar are not, IMHO, elegant. They are confusing. The whole point of using a context-free-grammar formalism and a parser generator is to simplify the presentation of the grammar and make it easier to verify. /Editorial note.
An operator precedence grammar is an operator grammar which can be bottom-up parsed using only precedence relations (using an algorithm such as the classic "shunting-yard" algorithm). An operator grammar is a grammar in which no right-hand side has two consecutive non-terminals. And the production:
expression: expression expression
cannot be expressed in an operator grammar.
In that production, the shift reduce conflict comes in the middle, just before where the operator would be if there were an operator. In that case, one would want to compare the precedence of whichever reduction gave rise to the first expression with the invisible operator which separates the expressions.
In some circumstances (and this requires careful grammar analysis, and is consequently very fragile), it's possible to distinguish between terminals which could start an expression and terminals which could be operators. In that case, it would be possible to use the precedence of the terminals in the FIRST set of expression as the comparators in the precedence comparison. Since those terminals will never be used as the comparators in an operator production, no additional ambiguity is created.
Of course, that fails as soon as it is possible for a terminal to be either an infix or a prefix operator, such as unary minus. So it's probably only of theoretical interest in most languages.
In summary, I personally think that the solution of explicitly defining non-application expressions is clear, elegant and consistent with the theory of LR parsing, while any attempt to use precedence relations will turn out to be far less easy to understand and verify.
But, if you insist, here is a grammar which will work in this particular case (without unary operators), based on assigning precedence values to the tokens which might start an expression:
%token IDENTIFIER CONSTANT APPLY
%left '(' ')' '\\' IDENTIFIER CONSTANT APPLY
%left '+'
%left '*'
%%
expression: '(' expression ')'
| expression expression %prec APPLY
| '\\' IDENTIFIER "->" expression
| expression '+' expression
| expression '*' expression
| IDENTIFIER | CONSTANT
;

LR(1) grammar: how to tell? examples for/against?

I'm currently having a look at GNU Bison to parse program code (or actually to extend a program that uses Bison for doing that). I understand that Bison can only (or: best) handle LR(1) grammars, i.e. a special form of context-free grammars; and I actually also (believe to) understand the rules of context-free and LR(1) grammars.
However, somehow I'm lacking a good understanding of the notion of a LR(1) grammar. Assume SQL, for instance. SQL incorporates - I believe - a context-free grammar. But is it also a LR(1) grammar? How could I tell? And if yes, what would violate the LR(1) rules?
LR(1) means that you can choose proper rule to reduce by knowing all tokens that will be reduced plus one token after them. There are no problems with AND in boolean queries and in BETWEEN operation. The following grammar, for example is LL(1), and thus is LR(1) too:
expr ::= and_expr | between_expr | variable
and_expr ::= expr "and" expr
between_expr ::= "between" expr "and" expr
variable ::= x
I believe that the whole SQL grammar is even simpler than LR(1). Probably LR(0) or even LL(n).
Some of my customers created SQL and DB2 parsers using my LALR(1) parser generator and used them successfully for many years. The grammars they sent me are LALR(1) (except for the shift-reduce conflicts which are resolved the way you would want). For the purists -- not LALR(1), but work fine in practice, no GLR or LR(1) needed. You don't even need the more powerful LR(1), AFAIK.
I think the best way to figure this out is to find an SQL grammar and a good LALR/LR(1) parser generator and see if you get a conflict report. As I remember an SQL grammar (a little out of date) that is LALR(1), is available in this download: http://lrstar.tech/downloads.html
LRSTAR is an LR(1) parser generator that will give you a conflict report. It's also LR(*) if you cannot resolve the conflicts.

Postfix and right-associative operators in LR(0) parsers

Is it possible to construct an LR(0) parser that could parse a language with both prefix and postfix operators? For example, if I had a grammar with the + (addition) and ! (factorial) operators with the usual precedence then 1+3! should be 1 + 3! = 1 + 6 = 7, but surely if the parser were LR(0) then when it had 1+3 on the stack it would reduce rather than shift?
Also, do right associative operators pose a problem? For example, 2^3^4 should be 2^(3^4) but again, when the parser have 2^3 on the stack how would it know to reduce or shift?
If this isn't possible is there still a way to use an LR(0) parser, possibly by altering the grammar to add brackets in the appropriate places?
LR(0) parsers have a weakness in that they can only parse prefix-free languages, languages where no string in the language is a prefix of any other. This generally makes it a bit tricky to parse expressions like these, since something like 5 is a prefix of 5!. This also explains why it's hard to get right-associative operators - given a production like
S → F | F ^ S
the parser will have a shift/reduce conflict after seeing an F because it can't tell whether to extend it or to reduce again. This is related to the prefix-free property mentioned earlier.
This weakness of LR(0) is one of the reasons why people don't use it much in practice. SLR(1) and LALR(1) parsers can usually parse these grammars because they have a token of lookahead that lets them decide whether to shift or reduce. In the above case, the parsers wouldn't encounter shift/reduce conflicts because when deciding whether to reduce an F or shift a ^, they can see to shift the ^ because there's no correct string where a ^ should appear after an S.

Resources