From the Yacc introduction by Stephen C. Johnson:
With right recursive rules, such as
seq : item
| item seq
;
the parser would be a bit bigger, and the items would be seen, and
reduced, from right to left. More seriously, an internal stack in the
parser would be in danger of overflowing if a very long sequence were
read. Thus, the user should use left recursion wherever reasonable.
I know Yacc generates LR parsers, so I tried parsing some simple right recursive grammar by hand. I couldn't see the problem so far. Anyone can give an example to demonstrate these problems?
The parser size is not a serious issue (most of the time).
The runtime stack size can be an issue. The trouble is that the right recursive rules mean that stack can't be reduced until the parsers reached the end of the sequence, whereas with the left recursive rule, each time the grammar encounters an seq item, it can reduce the number of items on the stack.
Classically, the stack for tokens was fixed and limited in size. Consequently, a right recursive rule such as one to handle:
IF <cond> THEN
<stmt-list>
ELSIF <cond> THEN
<stmt-list>
ELSIF <cond> THEN
<stmt-list>
ELSE
<stmt-list>
ENDIF
would limit the number of terms in the chain of ELIF clauses that the grammar could accept.
Hypothetical grammar:
if_stmt: if_clause opt_elif_clause_list opt_else_clause ENDIF
;
if_clause: IF condition THEN stmt_list
;
opt_elif_clause_list: /* Nothing */
| elif_clause opt_elif_clause_list /* RR */
;
elif_clause: ELIF condition THEN stmt_list
;
opt_else_clause: /* Nothing */
| ELSE stmt_list
;
stmt_list: stmt
| stmt_list stmt /* LR */
;
I seem to remember doing some testing of this a long time ago now (a decade or more ago), and the Yacc that I was using at the time, in conjunction with the language grammar, similar to that above, meant that after about 300 ELIF clauses, the parser stopped (I think it stopped under control, recognizing it had run out of space, rather than crashing oblivious of the space exhaustion).
I'm not at all sure why he says the right recursive parser will be bigger -- in general it will require one fewer state in its state machine (which should if anything make it smaller), but the real issue is that the right recursive grammar requires unbounded stack space, while the left recursive grammar requires constant space (O(n) space vs O(1) space).
Now O(n) vs O(1) may sound like a big deal, but depending on what you are doing, it may be unimportant. In particular, if you're reading your entire input into memory to process it, that's O(n) space that completely overwhelms the O(n) vs O(1) distinction for right vs left recursion. If you're using a particularly old version of yacc that still has a fixed parser stack, it may be a problem, but recent versions of yacc (Berkeley yacc, bison) automatically expand their parse stack on demand, so the only limit is available memory.
Related
I am trying to build my first parser. Unfortunately I am not familiar with the theory of grammars and now I wonder whether it is
plainly forbidden
just a bad idea or
kind of OK
to have circular dependencies in my grammar. My intuition raises a yellow flag, but since I am not familiar with the theory of parsers, I am not sure.
Be my lexer well-defined and be its tokens the ones one would expect from their names, then I have the following grammar:
list_content : value
| list_content COMMA list_content
list : LBRACE list_content RBRACE
value : INT
| list
In there, value depends on list, list depends on list_content and list_content depends on value.
I have seen recursive definitions in grammars before, such as:
sum | NUMBER + NUMBER
| NUMBER + sum
| LBRACE sum RBRACE
However, I think, my circular definition is different (in terms of: dirtier), because it is harder to overview and the definitory circle spans multiple grammar rules. I am not sure, whether my circular definition creates an ambiguity in my grammar. I also fear it might make my code hard to debug.
So, I have two questions:
A) Should I restructure my grammar (and my lexer) or is it OK to live with this circular definition?
B) If I should restructure, how would I best do so?
A circular dependence like this is fine -- it's a recursive definition and is analogous to using recursion in a program. As such, the important thing to look at is how the base case is realized, since that is how the recusion terminates. If you don't have a base case (or it can't be reached without also triggering additional recursion), you have a problem -- an infinite loop that can never match any finite input.
In your case, the base case is the primitive rule -- since value can reduce to a single primitive and list_content to a single value, everything is fine.
You do have one issue in your grammar in that the rule
list_content: list_content COMMA list_content
is ambiguous. What this means is that for any list with three or more elements (two or more COMMAs), there are multiple way to parse it -- either matching the left comma (left recursive) or the right comma (right recursive) first. This will cause problems with most parser tools that cannot deal with ambiguity, and in you case is probably irrelevant (you don't really care which way it is parsed, since you'll likely just concatenate the lists).
The fix for this is to rewrite the rule as a simple left- or right- recusive rule (but not both). Which you want to use depends on the parser style you are using -- for an LL (top down or recursive descent) parser, you want a right recursive rule. For an LR (bottom up or shift/reduce) parser, you (generally) want a left recursive rule.
left recursive: list_content : value | list_content COMMA value ;
right recursive: list_content : value | value COMMA list_content ;
As far as I understand, most languages are context-free with some exceptions. For instance, a * b may stand for type * pointer_declaration or multiplication in C++. Which one takes place depends on the context, the meaning of the first identifier. Another example is name production in VHDL
enum_literal ::= char_literal | identifer
physical_literal ::= [num] unit_identifier
func_call ::= func_identifier [parenthized_args]
array_indexing ::= arr_name (index_expr)
name ::= func_call | physical_literal | enum_litral | array_indexing
You see that syntactic forms are different but they can match if optional parameters are omitted, like f, does it stand for func_call, physical_literal, like 1 meter with optional amount 1 is implied, or enum_literal.
Talking to Scala plugin designers, I was educated to know that you build AST to re-evaluate it when dependencies change. There is no need to re-parse the file if you have its AST. AST also worth to display the file contents. But, AST is invalidated if grammar is context-sensitive (suppose that f was a function, defined in another file, but later user requalified it into enum literal or undefined). AST changes in this case. AST changes on whenever you change the dependencies. Another option, that I am asking to evaluate and let me know how to make it, is to build an ambiguous AST.
As far as I know, parser combinators are of PEG kind. They hide the ambiguity by returning you the first matched production and f would match a function call because it is the first alternative in my grammar. I am asking for a combinator that instead of falling back on the first success, it proceeds to the next alternative. In the end, it would return me a list of all matching alternatives. It would return me an ambiguity.
I do not know how would you display the ambiguous file contents tree to the user but it would eliminate the need to re-parse the dependent files. I would also be happy to know how modern language design solve this problem.
Once ambiguous node is parsed and ambiguity of results is returned, I would like the parser to converge because I would like to proceed parsing beyond the name and I do not want to parse to the end of file after every ambiguity. The situation is complicated by situations like f(10), which can be a function call with a single argument or a nullary function call, which return an array, which is indexed afterwards. So, f(10) can match name two ways, either as func_call directly or recursively, as arr_indexing -> name ~ (expr). So, it won't be ambiguity like several parallel rules, like fcall | literal. Some branches may be longer than 1 parser before re-converging, like fcall ~ (expr) | fcall.
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First you claim that "most languages are context-free with some exceptions", this is not totally true. When designing a computer language, we mostly try to keep it as context-free as possible, since CFGs are the de-facto standard for that. It will ease a lot of work. This is not always feasible, though, and a lot[?] of languages depend on the semantic analysis phase to disambiguate any possible ambiguities.
Parser combinators do not use a formal model usually; PEGs, on the other hand, are a formalism for grammars, as are CFGs. On the last decade a few people have decided to use PEGs over CFGs due to two facts: PEGs are, by design, unambiguous, and they might always be parsed in linear time. A parser combinator library might use PEGs as underlying formalism, but might as well use CFGs or even none.
PEGs are attractive for designing computer languages because we usually do not want to handle ambiguities, which is something hard (or even impossible) to avoid when using CFGs. And, because of that, they might be parsed O(n) time by using dynamic programming (the so called packrat parser). It's not simple to "add ambiguities to them" for a few reasons, most importantly because the language they recognize depend on the fact that the options are deterministic, which is used for example when checking for lookahead. It isn't as simple as "just picking the first choice". For example, you could define a PEG:
S = "a" S "a" / "aa"
Which only parse sequences of N "a", where N is a power of 2. So it recognizes a sequence of 2, 4, 8, 16, 32, 64, etc, letter "a". By adding ambiguity, as a CFG would have, then you would recognize any even number of "a" (2, 4, 6, 8, 10, etc), which is a different language.
To answer your question,
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First I must say that this is probably not a good idea. If you wish to keep ambiguity on the AST, you probably should use a CFG parser instead.
One could, for example, make a parser for PEGs which is similar to a parser for boolean grammars, but then our asymptotic parsing time would grow from O(n) to O(n3) by keeping all alternatives alive while keeping the same language. And we actually lose both good things about PEGs at once.
Another way would be to keep a packrat parser in memory, and transverse its table to handle the semantics from the AST. Not really a good idea either, since this would imply a large memory footprint.
Ideally, one should build an AST which already has information regarding possible ambiguities by changing the grammar structure. While this requires manual work, and usually isn't simple, you wouldn't have to go back a phase to check the grammar again.
I am making a parser using bison. I just wanna ask if it still necessary for a grammar to be left-factored when used in bison. I tried giving bison a non-left-factored grammar and it didn't gave any warning or error and it also accepted the example syntax I gave to the parser, but I'm worried that it the parser may not be accurate in every input.
Left factoring is how you remove LL-conflicts in a grammar. Since Bison uses LALR it has no problems with left recursion or any other LL-conflicts (indeed, left recursion is preferable as it minimizes stack requirements), so left factoring is neither necessary nor desirable.
Note that left factoring won't break anything -- bison can deal with a left-factored grammar as well as a non-left factored one, but it may require more resources (memory) to parse the left-factored grammar, so in general, don't.
edit
You seem to be confused about how LL-vs-LR parsing work and how the structure of the grammar affects each.
LL parsing is top down -- you start with just the start symbol on the parse stack, and at each step, you replace the non-terminal on top of the stack with the symbols from the right side of some rule for that non-terminal. When there is a terminal on top of the stack, it must match the next token of input, so you pop it and consume the input. The goal being to consume all the input and end up with an empty stack.
LR parsing is bottom up -- you start with an empty stack, and at each step you either copy a token from the input to the stack (consuming it), or you replace a sequence of symbols on the top of the stack corresponding to the right side of some rule with the single symbol from the left side of the rule. The goal being to consume all the input and be left with just the start symbol on the stack.
So different rules for the same non-terminal which start with the same symbols on the right side are a big problem for LL parsing -- you could replace that non-terminal with the symbols from either rule and match the next few tokens of input, so you would need more lookahead to know which to do. But for LR parsing, there's no problem -- you just shift (move) the tokens from the input to the stack and when you get to the later tokens you decide which right side it matches.
LR parsing tends to have problems with rules that end with the same tokens on the right hand side, rather than rules that start with the same tokens. In your example from John Levine's book, there are rules "cart_animal ::= HORSE" and "work_animal ::= HORSE", so after shifting a HORSE symbol, it could be reduced (replace by) either "cart_animal" or "work_animal". Since the context allows either to be followed by the "AND" token, you end up with a reduce/reduce (LR) conflict when the next token is "AND".
In fact, the opposite is true. Parsers generated by LALR(1) parser generators not only support left recursion, they in fact work better with left recursion. Ironically, you may have to refactor right recursion out of your grammar.
Right recursion works; however, it delays reduction, causing parse stack space that is proportional to the size of the recursive construct being parsed.
For instance, building a Lisp-style list like this:
list : item { $$ = cons($1, nil); }
| item list { $$ = cons($1, $2); }
means that the parser stack is proportional to the length of the list. No reduction takes place until the rightmost item is reached, and then a cascade of reductions takes place, building the list from right to left by a sequence of cons calls.
You might not encounter this issue until you start parsing data, rather than code, and the data gets large.
If you modify this for left recursion, you can build a the list in a constant amount parser stack, because the action will be "reduce as you go":
list : item { $$ = cons($1, nil); }
| list item { $$ = append($1, cons($2, nil)); }
(Now there is a performance problem with append searching for the tail of the list; for which there are various solutions, unrelated to the parsing.)
I wrote a parser that analyzes code and reduces it as would by lex and yacc (pretty much.)
I am wondering about one aspect of the matter. If I have a set of rules such as the following:
unary: IDENTIFIER
| IDENTIFIER '(' expr_list ')'
The first rule with just an IDENTIFIER can be reduced as soon as an identifier is found. However, the second rule can only be reduced if the input also includes a valid list of expressions written between parenthesis.
How is a parser expected to work in a case like this one?
If I reduce the first identifier immediately, I can keep the result and throw it away if I realize that the second rule does match. If the second rule does not match, then I can return the result of the early reduction.
This also means that both reduction functions are going to be called if the second rule applies.
Are we instead expected to hold on the early reduction and apply it only if the second, longer rule applies?
For those interested, I put a more complete version of my parser grammar in this answer: https://codereview.stackexchange.com/questions/41769/numeric-expression-parser-calculator/41786#41786
Bottom-up parsers (like bison and yacc) do not reduce until they reach the end of the production. They don't have to guess which reduction they will use until they need it. So it's really not an issue to have two productions with the same prefix. In this sense, the algorithm is radically different from a top-down algorithm, used in recursive descent parsing, for example.
In order for the fragment which you provide to be parseable by an LALR(1) parser-generator -- that is, a bottom-up parser with the ability to examine only one (1) token beyond the end of a production -- the grammar must be such that there is no place in which unary might be followed by (. As long as that is true, the fact that the parser can see a ( is sufficient to prevent the unit reduction unary: IDENTIFIER from happening in a context in which it should reduce with the other unary production.
(That's a bit of an oversimplification, but I don't think that it would be correct to reproduce a standard text on LALR parsing here on SO.)
Can I write a parser for Communicating sequential processes(CSP) in ANTLR? I think it uses left recursion like in statement
VMS = (coin → (choc → VMS))
complete language specification can be found at CSPM : A Reference Manua
so it is not a LL grammer. Am I right?
In general, even if you have a grammar with left recursion, you can refactor the grammar to remove it. So ANTLR is reasonably likely to be able to process your grammar. There's no a priori reason you can't write a CSP grammar for ANTLR.
Whether the one you have is suitable is another question.
If your quoted phrase is a grammar rule, it doesn't have left recursion. (If it is, I don't understand the syntax of your grammar rules, especially why the parentheses [terminals?] would be unbalanced; that's pretty untradional.)
So ANTLR should be able to process it, modulo converting it to ANTRL grammar rule syntax.
You didn't show the rest of the grammar, so one can't have an opinion about the rest of it.
In the case above does not have left recursion. It would looks something like. Note this is a simplified version, CSP is much more complicated. I'm just showing it is possible.
assignment : PROCNAME EQ process
;
process : LPAREN EVENT ARROW process RPAREN
| PROCNAME
;
Besides, you can factor out left recursion with ANTLRWorks 'Remove Left Recursion' function.
CSP's are definitely possible in ANTLR, check http://www.carnap.info/ for an example.