Parsing optional recursive parser runs in infinite recursion

Parsing optional recursive parser runs in infinite recursion - parsing

I'm currently writing a parser for ECMAScript 5 (as a toy). The standard dictates how logical or expressions should be parsed:
<LogicalORExpression> :
<LogicalANDExpression>
<LogicalORExpression> || <LogicalANDExpression>
basicly this is equivalent to
<logicalOrExpression> = [<logicalOrExpression> ||] <LogicalAndExpression>
but how should I parse this without running into an infite loop? My current parser obviously does:
logicalOrExpression :: Parser LogicalOrExpression
logicalOrExpression = do
orExpr <- optional $ do
e <- logicalOrExpression
_ <- symbol "||"
return e
andExpr <- logicalAndExpression
case orExpr of
Just e -> return $ LogicalOrExpression (e, andExpr)
Nothing -> return $ AndExpression andExpr
Thanks

It's easiest to use megaparsec's built-in tools if you need to parse a grammar of operators with precedence and associativity.
expr = makeExprParser term table
where
term = literal <|> parenthesised expr
table = [[InfixL (string "&&" $> And)], [InfixL (string "||" $> Or)]]
For suitable definitions of literal and parenthesised, this'll parse a grammar of literal expressions composed with left-associative infix && and || operators, with && having greater precedence than ||. Megaparsec takes care of the tedious work of generating an LL(k) parser, and produces correct (left-associative, in this instance) parse trees.
Of course JavaScript's expression grammar is much larger than two operators. This example can be straightforwardly extended to include (eg) unary prefix operators like !, postfix function calls, etc. See the module's documentation.

That grammar looks equivalent to
<LogicalORExpression> :
<LogicalANDExpression>
<LogicalANDExpression> || <LogicalORExpression>
which becomes
<LogicalORExpression> :
<LogicalANDExpression> [|| <LogicalORExpression>]
In general, you need to rewrite the grammar in (roughly) LL(1) form, if possible.

The empty string matches this parser, which I believe, causes infinite recursion in Magaparsec. I think you are missing a "term" or "boolean" somewhere in your function. If I wrote True || False what would capture the first "True"

As I want to stay true to the spec in terms of the AST produced, I decided to switch to an Earley based parser and not parser combinators, as Earley's algorithm can handle left recursion.
If I would be OK with flattening the grammar, I would use the answer of Benjamin Hodgson

Related

Combining unary operators with different precedence

I was having some trouble with Bison creating an operator as such:
<- = identity postfix operator with a low precedence to force evaluation of what's on the left first, e.g. 1+2<-*3 (equivalent (1+2)*3) as well as -> which is a prefix operator which does the same thing but to the right.
I was not able to get the syntax to work properly and tested with Python using - not False, which resulted in a syntax error (in Python, - has a greater precedence than not). However, this is not a problem in C or C++, where - and !/not have the same precedence.
Of course, the difference in precedence has nothing to do with the relationship between the 2 operators, only a relationship with other operators that result in the relative precedences between them.
Why is chaining prefix or postfix operators with different precedences a problem when parsing and how can implement the <- and -> operators while still having higher-precedence operators like !, ++, NOT, etc.?
Obligatory Bison (this pattern is repeated for all operators, where copy has greater precedence than post_unary):
post_unary:
copy
| post_unary "++"
| post_unary "--"
| post_unary '!'
;
Chaining operators in this category, e.g. x ! -- ! works fine syntactically.

Ok, let me suggest a possible erroneous grammar based on your sketch:
low_postfix:
mid_infix
| low_postfix "<-"
mid_infix:
high_postfix
| mid_infix '+' high_postfix
high_postfix:
term
| high_postfix "++"
term:
ID
'(' expr ')'
It should be clear just looking at those productions that var <- ++ is not part of the language. The only things that can be used as an operand to ++ are terms and other applications of ++. var <- is neither of these things.
On the other hand, var ++ <- is fine, because the operand to <- can be a mid_infix which can be a high_postfix which is an application of the ++ operator.
If the intention were to allow both of those postfix sequences, then that grammar is incorrect.
A version of that cascade is present in the Python grammar (albeit using prefix operators) which is why not - False is OK, but - not False is a syntax error. I'm reluctant to call that a bug because it may have been intentional. (Really, neither of those expressions makes much sense.) We could disagree about the value of such an intention but not on SO, which prefers to avoid opinionated discussions.
Note that what we might call "strict precedence" in this grammar and the Python grammar is by no means restricted to combinations of unary operators. Here's another one which you have likely never tried:
$ python3 -c 'print(41 + not False)'
File "<string>", line 1
print(41 + not False)
^
SyntaxError: invalid syntax
So, how can we fix that?
On some level, it would be nice to be able to just write an unambiguous grammar which conveyed our intention. And it is certainly possible to write an unambiguous grammar, which would convey the intention to bison. But it's at least an open question as to whether it would convey anything to a human reader, because the massive clutter of multiple rules necessary in order to keep track of what is and is not an acceptable grouping would be pretty daunting.
On the other hand, it's dead simple to do with bison/yacc precedence declarations. We just list the operators in order, and the parser generator resolves all the ambiguities accordingly. [See Note 1 below]
Here's a similar grammar to the above, with precedence declarations. (I left the actions in place in case you want to play with it, although it's by no means a Reproducible Example; the infrastructure it relies upon is much bigger than the grammar itself, and of little use to anyone other than me. So you'll have to define the three functions and fill in some of the bison type declarations. Or just delete the AST functions and use your own.)
%left ','
%precedence "<-"
%precedence "->"
%left '+'
%left '*'
%precedence NEG
%right "++" '('
%%
expr: expr ',' expr { $$ = make_binop(OP_LIST, $1, $3); }
| "<-" expr { $$ = make_unop(OP_LARR, $2); }
| expr "->" { $$ = make_unop(OP_RARR, $1); }
| expr '+' expr { $$ = make_binop(OP_ADD, $1, $3); }
| expr '*' expr { $$ = make_binop(OP_MUL, $1, $3); }
| '-' expr %prec NEG { $$ = make_unop(OP_NEG, $2); }
| expr '(' expr ')' %prec '(' { $$ = make_binop(OP_CALL, $1, $3); }
| "++" expr { $$ = make_unop(OP_PREINC, $2); }
| expr "++" { $$ = make_unop(OP_POSTINC, $1); }
| VALUE { $$ = make_ident($1); }
| '(' expr ')' { $$ = $2; }
A couple of notes:
I used %prec NEG on the unary minus production in order to separate that production from the subtraction production. I also used a %prec declaration to modify the precedence of the call production (the default would be ')'), although in this particular case that's unnecessary. It is necessary to put '(' into the precedence list, though. ( is the lookahead symbol which is used in precedence comparisons.
For many unary operators, I used bison %precedence declaration in the precedence list, rather than %right or %left. Really, there is no such thing as associativity with unary operators, so I think that it's more self-documenting to use %precedence, which doesn't resolve conflicts involving reductions and shifts in the same precedence level. However, even though there is no such thing as associativity between unary operators, the nature of the precedence resolution algorithm is that you can put prefix operators and postfix operators in the same precedence level and choose whether the postfix or prefix operators have priority by using %right or %left, respectively. %right is almost always correct. I did that with ++, because I was a bit lazy by the time I got to that point.
This does "work" (I think). It certainly resolves all the conflicts; bison happily produces a parser without warnings. And the tests that I tried worked at least as I expected them to:
? a++->
=> [-> [++/post a]]
? a->++
=> [++/post [-> a]]
? 3*f(a)+2
=> [+ [* 3 [CALL f a]] 2]
? 3*f(a)->+2
=> [+ [-> [* 3 [CALL f a]]] 2]
? 2+<-f(a)*3
=> [+ 2 [<- [* [CALL f a] 3]]]
? 2+<-f(a)*3->
=> [+ 2 [<- [-> [* [CALL f a] 3]]]]
But there are some expressions where the operator precedence, while "correct", might not be easily explained to a novice user. For example, although the arrow operators look somewhat like parentheses, they don't group that way. Furthermore, the decision as to which of the two operators has higher precedence seems to me to be totally arbitrary (and indeed I might have done it differently from what you expected). Consider:
? <-2*f(a)->+3
=> [<- [+ [-> [* 2 [CALL f a]]] 3]]
? <-2+f(a)->*3
=> [<- [* [-> [+ 2 [CALL f a]]] 3]]
? 2+<-f(a)->*3
=> [+ 2 [<- [* [-> [CALL f a]] 3]]]
There's also something a bit odd about how the arrow operators override normal operator precedence, so that you can't just drop them into a formula without changing its meaning:
? 2+f(a)*3
=> [+ 2 [* [CALL f a] 3]]
? 2+f(a)->*3
=> [* [-> [+ 2 [CALL f a]]] 3]
If that's your intention, fine. It's your language.
Note that there are operator precedence problems which are not quite so easy to solve by just listing operators in precedence order. Sometimes it would be convenient for a binary operator to have different binding power on the left- and right-hand sides.
A classic (but perhaps controversial) case is the assignment operator, if it is an operator. Assignment must associate to the right (because parsing a = b = 0 as (a = b) = 0 would be ridiculous), and the usual expectation is that it greedily accepts as much to the right as possible. If assignment had consistent precedence, then it would also accept as much to the left as possible, which seems a bit strange, at least to me. If a = 2 + b = 7 is meaningful, my intuitions say that its meaning should be a = (2 + (b = 7)) [Note 2]. That would require differential precedence, which is a bit complicated but not unheard of. C solves this problem by restricting the left-hand side of the assignment operators to (syntactic) lvalues, which cannot be binary operator expressions. But in C++, it really does mean a = ((2 + b) = 7), which is semantically valid if 2 + b has been overloaded by a function which returns a reference.
Notes
Precedence declarations do not really add any power to the parser generator. The languages it can produce a parser for are exactly the same languages; it produces the same sort of parsing machine (a pushdown automaton); and it is at least theoretically possible to take that pushdown automaton and reverse engineer a grammar out of it. (In practice, the grammars produced by this process are usually monstrous. But they exist.)
All that the precedence declarations do is resolve parsing conflicts (typically in an ambiguous grammar) according to some user-supplied rules. So it's worth asking why it's so much simpler with precedence declarations than by writing an unambiguous grammar.
The simple hand-waving answer is that precedence rules only apply when there is a conflict. If the parser is in a state where only one action is possible, that's the action which remains, regardless of what the precedence rules might say. In a simple expression grammar, an infix operator followed by a prefix operator is not at all ambiguous: the prefix operator must be shifted, because there is no reduce action for a partial sequence ending with an infix operator.
But when we're writing a grammar, we have to specify explicitly what constructs are possible at each point in the grammar, which we usually do by defining a bunch of non-terminals, each corresponding to some parsing state. An unambiguous grammar for expressions already has split the expression non-terminal into a cascading series of non-terminals, one for each operator precedence value. But unary operators do not have the same binding power on both sides (since, as noted above, one side of the unary operator cannot take an operand). That means that a binary operator could well be able to accept a unary operator for one of its operands, and not be able to accept the same unary operator for its other operand. Which in turn means that we need to split all of our non-terminals again, corresponding to whether the non-terminal appears on the left or the right side of a binary operator.
That's a lot of work, and it's really easy to make a mistake. If you're lucky, the mistake will result in a parsing conflict; but equally it could result in the grammar not being able to recognise a particular construct which you would never think of trying, but which some irate language user feels is an absolute necessity. (Like 41 + not False)
It's possible that my intuitions have been permanently marked by learning APL at a very early age. In APL, all operators associate to the right, basically without any precedence differences.

Invalid grammar for a top-down recursive parser

I am trying to create grammar for a naive top-down recursive parser. As I understand the basic idea is to write a list of functions (top-down) that correspond to the productions in the grammar. Each function can call other functions (recursive).
The rules for a list include any number of numbers, but they must be separated by commas.
Here's an example of grammar I came up with:
LIST ::= NUM | LIST "," NUM
NUM ::= [0-9]+
Apparently this is incorrect, so my question is: why is this grammar not able to be parsed by a naive top-down recursive descent parser? What would be an example of a valid solution?

The issue is that for a LL(1) recursive decent parser such as this:
For any i and j (where j ≠ i) there is no symbol that can start both an instance of Wi and an instance of Wj.
This is because otherwise the parser will have errors knowing what path to take.
The correct solution can be obtained by left-factoring, it would be:
LIST ::= NUM REST
REST ::= "" | "," NUM
NUM ::= [0-9]+

How to differentiate identifier from function call in LL(1) parser

My graduate student and I are working on a training compiler, which we will use to teach students at the subject "Compilers and Interpreters".
The input program language is a limited subset of the Java language and the compiler implementation language is Java.
The grammar of the input language syntax is LL(1), because it is easier to be understood and implemented by students. We have the following general problem in the parser implementation. How to differentiate identifier from function call during the parsing?
For example we may have:
b = sum(10,5) //sum is a function call
or
b = a //a is an identifier
In both cases after the = symbol we have an identifier.
Is it possible to differentiate what kind of construct (a function call or an identifier) we have after the equality symbol =?
May be it is not possible in LL(1) parser, as we can look only 1 symbol ahead? If this is true, how do you recommend to define the function call in the grammar? Maybe some additional symbol in front of the function call is necessary, e.g. b = #sum(10,5)?
Do You think this symbol would be confusing for students? What kind of symbol for the function call would be proper?

You indeed can't have separate rules for function calls and variables in an LL(1) grammar because that would require additional lookahead. The common solution to this is to combine them into one rule that matches an identifier, optionally followed by an argument list:
primary_expression ::= ID ( "(" expression_list ")" )?
| ...
In a language where a function can be an arbitrary expression, not just an identifier, you'll want to treat it just like any other postfix operator:
postfix_expression ::= primary_expression postfix_operator*
postfix_operator ::= "++"
| "--"
| "[" expression "]"
| "(" expression_list ")"

ANTLR does not automatically do lookahead matching?

I'm currently writing a simple grammar that requires operator precedence and mixed associativities in one expression. An example expression would be a -> b ?> C ?> D -> e, which should be parsed as (a -> (((b ?> C) ?> D) -> e). That is, the ?> operator is a high-precedence left-associative operator wheras the -> operator is a lower-precedence right-associative operator.
I'm prototyping the grammar in ANTLR 3.5.1 (via ANTLRWorks 1.5.2) and find that it can't handle the following grammar:
prog : expr EOF;
expr : term '->' expr
| term;
term : ID rest;
rest : '?>' ID rest
| ;
It produces rule expr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2 error.
The term and rest productions work fine in isolation when I tested it , so I assumed this happened because the parser is getting confused by expr. To get around that, I did the following refactor:
prog : expr EOF;
expr : term exprRest;
exprRest
: '->' expr
| ;
term : ID rest;
rest : DU ID rest
| ;
This works fine. However, because of this refactor I now need to check for empty exprRest nodes in the output parse tree, which is non-ideal. Is there a way to make ANTLR work around the ambiguity in the initial declaration of expr? I would of assumed that the generated parser would fully match term and then do a lookahead search for "->" and either continue parsing or return the lone term. What am I missing?

As stated, the problem is in this rule:
expr : term '->' expr
| term;
The problematic part is the term which is common to both alternatives.
LL(1) grammar doesn't allow this at all (unless term only matches zero tokens - but such rules would be pointless), because it cannot decide which alternative to use with only being able to see one token ahead (that's the 1 in LL(1)).
LL(k) grammar would only allow this if the term rule could match at most k - 1 tokens.
LL(*) grammar which ANTLR 3.5 uses does some tricks that allows it to handle rules that match any number of tokens (ANTLR author calls this "variable look-ahead").
However, one thing that these tricks cannot handle is if the rule is recursive, i.e. if it or any rules it invokes reference itself in any way (direct or indirect) - and that is exactly what your term rule does:
term : ID rest;
rest : '?>' ID rest
| ;
- the rule rest, referenced from term, recursively references itself. Thus, the error message
rule expr has non-LL(*) decision due to recursive rule invocations ...
The way to solve this limitation of LL grammars is called left-factoring:
expr : term
( '->' expr )?
;
What I did here is said "match term first" (since you want to match it in both alternatives, there's no point in deciding which one to match it in), then decide whether to match '->' expr (this can be decided just by looking at the very next token - if it's ->, use it - so this is even LL(1) decision).
This is very similar to what you came to as well, but the parse tree should look very much like you intended with the original grammar.

BNF grammar for left-associative operators

I have the following EBNF grammar for simple arithmetic expressions with left-associative operators:
expression:
term {+ term}
term:
factor {* factor}
factor:
number
( expression )
How can I convert this into a BNF grammar without changing the operator associativity? The following BNF grammar does not work for me, because now the operators have become right-associative:
expression:
term
term + expression
term:
factor
factor * term
factor:
number
( expression )
Wikipedia says:
Several solutions are:
rewrite the grammar to be left recursive, or
rewrite the grammar with more nonterminals to force the correct precedence/associativity, or
if using YACC or Bison, there are operator declarations, %left, %right and %nonassoc, which tell the parser generator which associativity to force.
But it does not say how to rewrite the grammar, and I don't use any parsing tools like YACC or Bison, just simple recursive descent. Is what I'm asking for even possible?

expression
: term
| expression + term;
Just that simple. You will, of course, need an LR parser of some description to recognize a left-recursive grammar. Or, if recursive descent, recognizing such grammars is possible, but not as simple as right-associative ones. You must roll a small recursive ascent parser to match such.
Expression ParseExpr() {
Expression term = ParseTerm();
while(next_token_is_plus()) {
consume_token();
Term next = ParseTerm();
term = PlusExpression(term, next);
}
return term;
}
This pseudocode should recognize a left-recursive grammar in that style.

What Puppy suggests can also be expressed by the following grammar:
expression: term opt_add
opt_add: '+' term opt_add
| /* empty */
term: factor opt_mul
opt_mul: '*' factor opt_mul
| /* emtpty */
factor: number
| '(' expression ')

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsing optional recursive parser runs in infinite recursion - parsing

That grammar looks equivalent to <LogicalORExpression> : <LogicalANDExpression> <LogicalANDExpression> || <LogicalORExpression> which becomes <LogicalORExpression> : <LogicalANDExpression> [|| <LogicalORExpression>] In general, you need to rewrite the grammar in (roughly) LL(1) form, if possible.

The empty string matches this parser, which I believe, causes infinite recursion in Magaparsec. I think you are missing a "term" or "boolean" somewhere in your function. If I wrote True || False what would capture the first "True"

As I want to stay true to the spec in terms of the AST produced, I decided to switch to an Earley based parser and not parser combinators, as Earley's algorithm can handle left recursion. If I would be OK with flattening the grammar, I would use the answer of Benjamin Hodgson

Related

Combining unary operators with different precedence

Invalid grammar for a top-down recursive parser

How to differentiate identifier from function call in LL(1) parser

ANTLR does not automatically do lookahead matching?

BNF grammar for left-associative operators

Categories

Resources