Circular Dependency in Parser Grammar - parsing

I am trying to build my first parser. Unfortunately I am not familiar with the theory of grammars and now I wonder whether it is
plainly forbidden
just a bad idea or
kind of OK
to have circular dependencies in my grammar. My intuition raises a yellow flag, but since I am not familiar with the theory of parsers, I am not sure.
Be my lexer well-defined and be its tokens the ones one would expect from their names, then I have the following grammar:
list_content : value
| list_content COMMA list_content
list : LBRACE list_content RBRACE
value : INT
| list
In there, value depends on list, list depends on list_content and list_content depends on value.
I have seen recursive definitions in grammars before, such as:
sum | NUMBER + NUMBER
| NUMBER + sum
| LBRACE sum RBRACE
However, I think, my circular definition is different (in terms of: dirtier), because it is harder to overview and the definitory circle spans multiple grammar rules. I am not sure, whether my circular definition creates an ambiguity in my grammar. I also fear it might make my code hard to debug.
So, I have two questions:
A) Should I restructure my grammar (and my lexer) or is it OK to live with this circular definition?
B) If I should restructure, how would I best do so?

A circular dependence like this is fine -- it's a recursive definition and is analogous to using recursion in a program. As such, the important thing to look at is how the base case is realized, since that is how the recusion terminates. If you don't have a base case (or it can't be reached without also triggering additional recursion), you have a problem -- an infinite loop that can never match any finite input.
In your case, the base case is the primitive rule -- since value can reduce to a single primitive and list_content to a single value, everything is fine.
You do have one issue in your grammar in that the rule
list_content: list_content COMMA list_content
is ambiguous. What this means is that for any list with three or more elements (two or more COMMAs), there are multiple way to parse it -- either matching the left comma (left recursive) or the right comma (right recursive) first. This will cause problems with most parser tools that cannot deal with ambiguity, and in you case is probably irrelevant (you don't really care which way it is parsed, since you'll likely just concatenate the lists).
The fix for this is to rewrite the rule as a simple left- or right- recusive rule (but not both). Which you want to use depends on the parser style you are using -- for an LL (top down or recursive descent) parser, you want a right recursive rule. For an LR (bottom up or shift/reduce) parser, you (generally) want a left recursive rule.
left recursive: list_content : value | list_content COMMA value ;
right recursive: list_content : value | value COMMA list_content ;

Related

Are the following grammars LL(1)?

S -> Abc|aAcb
A -> b|c|ε
I think the first one is LL(1)
S -> aAS|b
A -> a|bSA
But the problem is second one. There's no conflict problem, but I think it doesn't satisfy right-recursion.
I'm not sure about those problems.
The first grammar is not LL(1), because of several conflicts, as for example for input bc, an LL parser will need 2 tokens of look-ahead to parse it:
enter in rule S
enter in rule A
recognize character b
load the next token c
exit rule A
another b is expected, but the current token is c
go back into A
move one token backwards, again to b
exit rule A without recognizing anything (because of the epsilon)
recognize the b character after the reference to rule A that was "jumped" without the use of any token
load the next token c
recognize c
success
You have a similar case in the second alternative of S for an input acb. The grammar is not ambiguous, because in the end there is only one possible syntax tree. Its not LL(1), as in fact is LL(2).
The second grammars is deterministic - there is only one way to parse any input that is valid according to the grammar. This means that it can be used for parsing by an LL(1) parser.
I have made a tool (Tunnel Grammar Studio) that detects the grammar conflicts for nondeterministic grammars and generates parsers. This grammar in ABNF (RFC 5234) like syntax is:
S = 'a' A S / 'b'
A = 'a' / 'b' S A
The right recursion by itself does not create ambiguities inside the grammar. One way to have a right recursion ambiguity is to have some dangling element as in this grammar:
S = 'c' / 'a' S 0*1 'b'
You can read it as: rule S recognizes character c or character a followed by rule S itself and maybe (zero or one time) followed by character b.
The grammar up has a right recursion related ambiguity because of the dangling b character. Meaning that for an input aacb, there is more then one way to parse it: recognize the first a in S; enter into S again; recognize the second a; enter again into S and recognize c; exit S one time; then there are two choices:
case one) recognize the b character, exit S two times or case two) first exit S one time and then recognize the b character. Both cases are these (screen shots from the visual debugger of TGS):
This grammar is thus ambiguous (i.e. not LL(1)) because more then one syntax tree can be generated for some valid inputs. For this input the possible trees are only two, but for an input aaacb there are three trees, as there are three trees for aaacbb, because of the possible 3 places where you can 'attach' the two b characters, two of this places will have b, and one will remain empty. For input aaacbbb there is of course only one possible syntax tree, but the grammar is defined to be ambiguous if there is at least one input for which there is more then one possible syntax tree.

ANTLR4 '+' operation

I'm using ANTLR4 for a class I'm taking right now and I seem to understand most of it, but I can't figure out what '+' does. All I can tell is that it's usually after a set of characters in brackets.
The plus is one of the BNF operators in ANTLR that allow to determine the cardinality of an expression. There are 3 of them: plus, star (aka. kleene operator) and question mark. The meaning is easily understandable:
Question mark stands for: zero or one
Plus stands for: one or more
Star stands for: zero or more
Such an operator applies to the expression that directly preceeds it, e.g. ab+ (one a and one or more b), [AB]? (zero or one of either A or B) or a (b | c | d)* (a followed by zero or more occurences of either b, c or d).
ANTLR4 also uses a special construct to denote ungreedy matches. The syntax is one of the BNF operators plus a question mark (+?, *?, ??). This is useful when you have: an introducer match, any content, and then match a closing token. Take for instance a string (quote, any char, quote). With a greedy match ANTLR4 would match multiple strings as one (until the final quote). An ungreedy match however only matches until the first found end token (here the quote char).
Side note: I don't know what ?? could be useful for, since it matches a single entry, hence greedyness doesn't play a role here.
Actually, these operators are not part of traditional BNF, but rather of the Extended Backus-Naur Form. These are one of the reasons it's easier (or even possible) to document certain grammars in EBNF than in old-school BNF, which lacks many of these operators.

How do I convert PEG parser into ambiguous one?

As far as I understand, most languages are context-free with some exceptions. For instance, a * b may stand for type * pointer_declaration or multiplication in C++. Which one takes place depends on the context, the meaning of the first identifier. Another example is name production in VHDL
enum_literal ::= char_literal | identifer
physical_literal ::= [num] unit_identifier
func_call ::= func_identifier [parenthized_args]
array_indexing ::= arr_name (index_expr)
name ::= func_call | physical_literal | enum_litral | array_indexing
You see that syntactic forms are different but they can match if optional parameters are omitted, like f, does it stand for func_call, physical_literal, like 1 meter with optional amount 1 is implied, or enum_literal.
Talking to Scala plugin designers, I was educated to know that you build AST to re-evaluate it when dependencies change. There is no need to re-parse the file if you have its AST. AST also worth to display the file contents. But, AST is invalidated if grammar is context-sensitive (suppose that f was a function, defined in another file, but later user requalified it into enum literal or undefined). AST changes in this case. AST changes on whenever you change the dependencies. Another option, that I am asking to evaluate and let me know how to make it, is to build an ambiguous AST.
As far as I know, parser combinators are of PEG kind. They hide the ambiguity by returning you the first matched production and f would match a function call because it is the first alternative in my grammar. I am asking for a combinator that instead of falling back on the first success, it proceeds to the next alternative. In the end, it would return me a list of all matching alternatives. It would return me an ambiguity.
I do not know how would you display the ambiguous file contents tree to the user but it would eliminate the need to re-parse the dependent files. I would also be happy to know how modern language design solve this problem.
Once ambiguous node is parsed and ambiguity of results is returned, I would like the parser to converge because I would like to proceed parsing beyond the name and I do not want to parse to the end of file after every ambiguity. The situation is complicated by situations like f(10), which can be a function call with a single argument or a nullary function call, which return an array, which is indexed afterwards. So, f(10) can match name two ways, either as func_call directly or recursively, as arr_indexing -> name ~ (expr). So, it won't be ambiguity like several parallel rules, like fcall | literal. Some branches may be longer than 1 parser before re-converging, like fcall ~ (expr) | fcall.
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First you claim that "most languages are context-free with some exceptions", this is not totally true. When designing a computer language, we mostly try to keep it as context-free as possible, since CFGs are the de-facto standard for that. It will ease a lot of work. This is not always feasible, though, and a lot[?] of languages depend on the semantic analysis phase to disambiguate any possible ambiguities.
Parser combinators do not use a formal model usually; PEGs, on the other hand, are a formalism for grammars, as are CFGs. On the last decade a few people have decided to use PEGs over CFGs due to two facts: PEGs are, by design, unambiguous, and they might always be parsed in linear time. A parser combinator library might use PEGs as underlying formalism, but might as well use CFGs or even none.
PEGs are attractive for designing computer languages because we usually do not want to handle ambiguities, which is something hard (or even impossible) to avoid when using CFGs. And, because of that, they might be parsed O(n) time by using dynamic programming (the so called packrat parser). It's not simple to "add ambiguities to them" for a few reasons, most importantly because the language they recognize depend on the fact that the options are deterministic, which is used for example when checking for lookahead. It isn't as simple as "just picking the first choice". For example, you could define a PEG:
S = "a" S "a" / "aa"
Which only parse sequences of N "a", where N is a power of 2. So it recognizes a sequence of 2, 4, 8, 16, 32, 64, etc, letter "a". By adding ambiguity, as a CFG would have, then you would recognize any even number of "a" (2, 4, 6, 8, 10, etc), which is a different language.
To answer your question,
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First I must say that this is probably not a good idea. If you wish to keep ambiguity on the AST, you probably should use a CFG parser instead.
One could, for example, make a parser for PEGs which is similar to a parser for boolean grammars, but then our asymptotic parsing time would grow from O(n) to O(n3) by keeping all alternatives alive while keeping the same language. And we actually lose both good things about PEGs at once.
Another way would be to keep a packrat parser in memory, and transverse its table to handle the semantics from the AST. Not really a good idea either, since this would imply a large memory footprint.
Ideally, one should build an AST which already has information regarding possible ambiguities by changing the grammar structure. While this requires manual work, and usually isn't simple, you wouldn't have to go back a phase to check the grammar again.

Grammar: Precedence of grammar alternatives

This is a very basic question about grammar alternatives. If you have the following alternative:
Myalternative: 'a' | .;
Myalternative2: 'a' | 'b';
Would 'a' have higher priority over the '.' and over 'b'?
I understand that this may also depend on the behaviour of the parser generated by this syntax but in pure theoretical grammar terms could you imagine these rules being matched in parallel i.e. test against 'a' and '.' at the same time and select the one with highest priority? Or is the 'a' and . ambiguous due to the lack of precedence in grammars?
The answer depends primarily on the tool you are using, and what the semantics of that tool is. As written, this is not a context-free grammar in canonical form, and you'd need to produce that to get a theoretical answer, because only in that way can you clarify the intended semantics.
Since the question is tagged antlr, I'm going to guess that this is part of an Antlr lexical definition in which . is a wildcard character. In that case, 'a' | . means exactly the same thing as ..
Since MyAlternative matches everything that MyAlternative2 matches, and since MyAlternative comes first in the Antlr lexical definition, MyAlternative2 can never match anything. Any single character will be matched by MyAlternative (unless there is some other lexical rule which matches a longer sequence of input characters).
If you put the definition of MyAlternative2 first in the grammar file, then a or b would be matched as MyAlternative2, while any other character would be matched as MyAlternative.
The question of precedence within alternatives is meaningless. It doesn't matter whether MyAlternative considers the match of an a to be a match of a or a match of .. It is, in the end, a match of MyAlternative, and that symbol can only have one associated action.
Between lexical rules, there is a precedence relationship: The first one wins. (More accurately, as far as I know, Antlr obeys the usual convention that the longest match wins; between two rules which both match the same longest sequence, the first one in the grammar file wins.) That is not in any way influenced by alternative bars in the rules themselves.

Is there such thing as a left-associative prefix operator or right-associative postfix operator?

This page says "Prefix operators are usually right-associative, and postfix operators left-associative" (emphasis mine).
Are there real examples of left-associative prefix operators, or right-associative postfix operators? If not, what would a hypothetical one look like, and how would it be parsed?
It's not particularly easy to make the concepts of "left-associative" and "right-associative" precise, since they don't directly correspond to any clear grammatical feature. Still, I'll try.
Despite the lack of math layout, I tried to insert an explanation of precedence relations here, and it's the best I can do, so I won't repeat it. The basic idea is that given an operator grammar (i.e., a grammar in which no production has two non-terminals without an intervening terminal), it is possible to define precedence relations ⋖, ≐, and ⋗ between grammar symbols, and then this relation can be extended to terminals.
Put simply, if a and b are two terminals, a ⋖ b holds if there is some production in which a is followed by a non-terminal which has a derivation (possibly not immediate) in which the first terminal is b. a ⋗ b holds if there is some production in which b follows a non-terminal which has a derivation in which the last terminal is a. And a ≐ b holds if there is some production in which a and b are either consecutive or are separated by a single non-terminal. The use of symbols which look like arithmetic comparisons is unfortunate, because none of the usual arithmetic laws apply. It is not necessary (in fact, it is rare) for a ≐ a to be true; a ≐ b does not imply b ≐ a and it may be the case that both (or neither) of a ⋖ b and a ⋗ b are true.
An operator grammar is an operator precedence grammar iff given any two terminals a and b, at most one of a ⋖ b, a ≐ b and a ⋗ b hold.
If a grammar is an operator-precedence grammar, it may be possible to find an assignment of integers to terminals which make the precedence relationships more or less correspond to integer comparisons. Precise correspondence is rarely possible, because of the rarity of a ≐ a. However, it is often possible to find two functions, f(t) and g(t) such that a ⋖ b is true if f(a) < g(b) and a ⋗ b is true if f(a) > g(b). (We don't worry about only if, because it may be the case that no relation holds between a and b, and often a ≐ b is handled with a different mechanism: indeed, it means something radically different.)
%left and %right (the yacc/bison/lemon/... declarations) construct functions f and g. They way they do it is pretty simple. If OP (an operator) is "left-associative", that means that expr1 OP expr2 OP expr3 must be parsed as <expr1 OP expr2> OP expr3, in which case OP ⋗ OP (which you can see from the derivation). Similarly, if ROP were "right-associative", then expr1 ROP expr2 ROP expr3 must be parsed as expr1 ROP <expr2 ROP expr3>, in which case ROP ⋖ ROP.
Since f and g are separate functions, this is fine: a left-associative operator will have f(OP) > g(OP) while a right-associative operator will have f(ROP) < g(ROP). This can easily be implemented by using two consecutive integers for each precedence level and assigning them to f and g in turn if the operator is right-associative, and to g and f in turn if it's left-associative. (This procedure will guarantee that f(T) is never equal to g(T). In the usual expression grammar, the only ≐ relationships are between open and close bracket-type-symbols, and these are not usually ambiguous, so in a yacc-derivative grammar it's not necessary to assign them precedence values at all. In a Floyd parser, they would be marked as ≐.)
Now, what about prefix and postfix operators? Prefix operators are always found in a production of the form [1]:
non-terminal-1: PREFIX non-terminal-2;
There is no non-terminal preceding PREFIX so it is not possible for anything to be ⋗ PREFIX (because the definition of a ⋗ b requires that there be a non-terminal preceding b). So if PREFIX is associative at all, it must be right-associative. Similarly, postfix operators correspond to:
non-terminal-3: non-terminal-4 POSTFIX;
and thus POSTFIX, if it is associative at all, must be left-associative.
Operators may be either semantically or syntactically non-associative (in the sense that applying the operator to the result of an application of the same operator is undefined or ill-formed). For example, in C++, ++ ++ a is semantically incorrect (unless operator++() has been redefined for a in some way), but it is accepted by the grammar (in case operator++() has been redefined). On the other hand, new new T is not syntactically correct. So new is syntactically non-associative.
[1] In Floyd grammars, all non-terminals are coalesced into a single non-terminal type, usually expression. However, the definition of precedence-relations doesn't require this, so I've used different place-holders for the different non-terminal types.
There could be in principle. Consider for example the prefix unary plus and minus operators: suppose + is the identity operation and - negates a numeric value.
They are "usually" right-associative, meaning that +-1 is equivalent to +(-1), the result is minus one.
Suppose they were left-associative, then the expression +-1 would be equivalent to (+-)1.
The language would therefore have to give a meaning to the sub-expression +-. Languages "usually" don't need this to have a meaning and don't give it one, but you can probably imagine a functional language in which the result of applying the identity operator to the negation operator is an operator/function that has exactly the same effect as the negation operator. Then the result of the full expression would again be -1 for this example.
Indeed, if the result of juxtaposing functions/operators is defined to be a function/operator with the same effect as applying both in right-to-left order, then it always makes no difference to the result of the expression which way you associate them. Those are just two different ways of defining that (f g)(x) == f(g(x)). If your language defines +- to mean something other than -, though, then the direction of associativity would matter (and I suspect the language would be very difficult to read for someone used to the "usual" languages...)
On the other hand, if the language doesn't allow juxtaposing operators/functions then prefix operators must be right-associative to allow the expression +-1. Disallowing juxtaposition is another way of saying that (+-) has no meaning.
I'm not aware of such a thing in a real language (e.g., one that's been used by at least a dozen people). I suspect the "usually" was merely because proving a negative is next to impossible, so it's easier to avoid arguments over trivia by not making an absolute statement.
As to how you'd theoretically do such a thing, there seem to be two possibilities. Given two prefix operators # and # that you were going to treat as left associative, you could parse ##a as equivalent to #(#(a)). At least to me, this seems like a truly dreadful idea--theoretically possible, but a language nobody should wish on even their worst enemy.
The other possibility is that ##a would be parsed as (##)a. In this case, we'd basically compose # and # into a single operator, which would then be applied to a.
In most typical languages, this probably wouldn't be terribly interesting (would have essentially the same meaning as if they were right associative). On the other hand, I can imagine a language oriented to multi-threaded programming that decreed that application of a single operator is always atomic--and when you compose two operators into a single one with the left-associative parse, the resulting fused operator is still a single, atomic operation, whereas just applying them successively wouldn't (necessarily) be.
Honestly, even that's kind of a stretch, but I can at least imagine it as a possibility.
I hate to shoot down a question that I myself asked, but having looked at the two other answers, would it be wrong to suggest that I've inadvertently asked a subjective question, and that in fact that the interpretation of left-associative prefixes and right-associative postfixes is simply undefined?
Remembering that even notation as pervasive as expressions is built upon a handful of conventions, if there's an edge case that the conventions never took into account, then maybe, until some standards committee decides on a definition, it's better to simply pretend it doesn't exist.
I do not remember any left-associated prefix operators or right-associated postfix ones. But I can imagine that both can easily exist. They are not common because the natural way of how people are looking to operators is: the one which is closer to the body - is applying first.
Easy example from C#/C++ languages:
~-3 is equal 2, but
-~3 is equal 4
This is because those prefix operators are right associative, for ~-3 it means that at first - operator applied and then ~ operator applied to the result of previous. It will lead to value of the whole expression will be equal to 2
Hypothetically you can imagine that if those operators are left-associative, than for ~-3 at first left-most operator ~ is applied, and after that - to the result of previous. It will lead to value of the whole expression will be equal to 4
[EDIT] Answering to Steve Jessop:
Steve said that: the meaning of "left-associativity" is that +-1 is equivalent to (+-)1
I do not agree with this, and think it is totally wrong. To better understand left-associativity consider the following example:
Suppose I have hypothetical programming language with left-associative prefix operators:
# - multiplies operand by 3
# - adds 7 to operand
Than following construction ##5 in my language will be equal to (5*3)+7 == 22
If my language was right-associative (as most usual languages) than I will have (5+7)*3 == 36
Please let me know if you have any questions.
Hypothetical example. A language has prefix operator # and postfix operator # with the same precedence. An expression #x# would be equal to (#x)# if both operators are left-associative and to #(x#) if both operators are right-associative.

Resources