Basic parsing structure for nested structures? - parsing

I've come across the following bnf quite frequently for parsing a basic arithmetic expression:
S :== EXPRESSION
EXPRESSION :== TERM | TERM { [+,-] TERM] }
TERM :== FACTOR | FACTOR { [*,/] FACTOR] }
FACTOR :== number | '(' EXPRESSION ')'
-- or --
expression : term | term + term | term − term
term : factor | factor * factor | factor / factor
factor : number | ( expression ) | + factor | − factor
This would parse something like 2+3-4*(1+2). However, what is required to parse something like 1+1+1+1, as the above factor cannot also refer to the expression production itself?

Your first version is correct, but not BNF. It uses "extended" syntax, which includes the repetition operator { ... }, which means "zero or more instances of the pattern inside the braces". So TERM { [+,-] TERM] } means TERM or TERM + TERM or TERM - TERM + TERM or TERM + TERM + TERM - TERM, etc.
Your second version is not correct (and does not correspond to the first version). You don't provide any citation for it, so I can't tell if you copied it incorrectly or your source was simply wrong. Here's a correct version:
expression: term | expression + term | expression − term
term: factor | term * factor | term / factor
factor: number | ( expression ) | + factor | − factor
I hope it becomes clear how that analyses 1 + 1 + 1 + 1 into a left-associated sequence of additions.

Related

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Parsing + and * in boolean expressions by recursive descent

I am writing a recursive descent parser for Boolean expressions, for example:
(1 * 0)
(0 + ~1)
(0 * (1 + c)
Where 1 is 'True', 0 is 'False', + is 'or', * is 'and', ~ is 'not' and 'c' is just some variable name (it could be any single alphabetic letter). I plan on using parentheses rather than implementing some kind of order of operations.
My current parser can recognize the following form of expression
Expression ::= 1
| 0
| Character
| ~ Expression
But I am unsure as to how I would implement + and * on top of this. I am fairly certain from what I have read the obvious implementation of
Expression ::= 1
| 0
| Character
| ( Expression + Expression )
| ( Expression * Expression )
Would cause an infinite loop as it is 'left-recursive'. I am unsure how to change this to remove such infinite recursion.
With the parenthesis in place, what you have there is not left recursive. Left recursion is when a production can reach itself (directly or indirectly) with no tokens consumed in between. Such grammars do indeed cause infinite recursion in recursive descent parsers, but that can't happen with yours.
You do have the issue that the grammar as it stands is ambiguous: After a parenthesis, it isn't known whether the + or the * form is being parsed until the entire left-hand expression has been parsed.
One way of getting around that issue is by pulling up the common parts in a shared prefix/suffix production:
Expression ::= 1
| 0
| Character
| ParExpr
ParExpr ::= ( Expression ParOp Expression )
ParOp ::= +
| *
Let me search that for you ...
https://en.wikipedia.org/wiki/Recursive_descent_parser
The leading LPAREN keeps this from being left-recursive.
If you want to generalize the expressions and have some operator precedence, follow the expression portion of the BNF in the Wikipedia article.
However, you have a syntax ambiguity in the grammar you've chosen. When you have operators of the same precedence, combine them into a non-terminal, such as
LogOp ::= + | *
Label similar operands to allow for expansion:
UnaryOp ::= ~
Now you can ... never mind, #500 just posted a good answer that covers my final point.

Is it possible to write a recursive-descent parser for this grammar?

From this question, a grammar for expressions involving binary operators (+ - * /) which disallows outer parentheses:
top_level : expression PLUS term
| expression MINUS term
| term TIMES factor
| term DIVIDE factor
| NUMBER
expression : expression PLUS term
| expression MINUS term
| term
term : term TIMES factor
| term DIVIDE factor
| factor
factor : NUMBER
| LPAREN expression RPAREN
This grammar is LALR(1). I have therefore been able to use PLY (a Python implementation of yacc) to create a bottom-up parser for the grammar.
For comparison purposes, I would now like to try building a top-down recursive-descent parser for the same language. I have transformed the grammar, removing left-recursion and applying left-factoring:
top_level : expression top_level1
| term top_level2
| NUMBER
top_level1 : PLUS term
| MINUS term
top_level2 : TIMES factor
| DIVIDE factor
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
Without the top_level rules this grammar is LL(1), so writing a recursive-descent parser would be fairly straightforward. Unfortunately, including top_level, the grammar is not LL(1).
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Is it possible to simplify this grammar to ease the recursive-descent approach?
The grammar is not LL with finite lookahead, but the language is LL(1) because an LL(1) grammar exists. Pragmatically, a recursive descent parser is easy to write even without modifying the grammar.
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
If α is a derivation of expression, β of term and γ of factor, then top_level can derive both the sentence (α)+β and the sentence (α)*γ (but it cannot derive the sentence (α).) However, (α) is a possible derivation of both expression and term, so it is impossible to decide which production of top_level to use until the symbol following the ) is encountered. Since α can be of arbitrary length, there is no k for which a lookahead of k is sufficient to distinguish the two productions. Some people might call that LL(∞), but that doesn't seem to be a very useful grammatical category to me. (LL(*) is, afaik, the name of a parsing strategy invented by Terence Parr, and not an accepted name for a class of grammars.) I would simply say that the grammar is not LL(k) for any k.
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Sure. It's not even that difficult.
The first symbol must either be a NUMBER or a (. If it is a NUMBER, we predict (call) expression. If it is (, we consume it, call expression, consume the following ) (or declare an error, if the next symbol is not a close parenthesis), and then either call expression1 or term1 and then expression1, depending on what the next symbol is. Again, if the next symbol doesn't match the FIRST set of either expression1 or term1, we declare a syntax error. Note that the above strategy does not require the top_level* productions at all.
Since that will clearly work without backtracking, it can serve as the basis for how to write an LL(1) grammar.
Is it possible to simplify this grammar to ease the recursive-descent approach?
I'm not sure if the following grammar is any simpler, but it does correspond to the recursive descent parser described above.
top_level : NUMBER optional_expression_or_term_1
| LPAREN expression RPAREN expression_or_term_1
optional_expression_or_term_1: empty
| expression_or_term_1
expression_or_term_1
: PLUS term expression1
| MINUS term expression1
| TIMES factor term1 expression1
| DIVIDE factor term1 expression1
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
I'm left with two observations, both of which you are completely free to ignore (particularly the second one which is 100% opinion).
The first is that it seems odd to me to ban (1+2) but allow (((1)))+2, or ((1+2))+3. But no doubt you have your reasons. (Of course, you could easily ban the redundant double parentheses by replacing expression with top_level in the second production for factor.
Second, it seems to me that the hoop-jumping involved in the LL(1) grammar in the third section is just one more reason to ask why there is any reason to use LL grammars. The LR(1) grammar is easier to read, and its correspondence with the language's syntactic structure is clearer. The logic of the generated recursive-descent parser may be easier to understand, but to me that seems secondary.
To make the grammar LL(1) you need to finish left-factoring top_level.
You stopped at:
top_level : expression top_level1
| term top_level2
| NUMBER
expression and term both have NUMBER in their FIRST sets, so they must first be substituted to left-factor:
top_level : NUMBER term1 expression1 top_level1
| NUMBER term1 top_level2
| NUMBER
| LPAREN expression RPAREN term1 expression1 top_level1
| LPAREN expression RPAREN term1 top_level2
which you can then left-factor to
top_level : NUMBER term1 top_level3
| LPAREN expression RPAREN term1 top_level4
top_level3 : expression1 top_level1
| top_level2
| empty
top_level4 : expression1 top_level1
| top_level2
Note that this still is not LL(1) as there are epsilon rules (term1, expression1) with overlapping FIRST and FOLLOW sets. So you need to factor those out too to make it LL(1)

Grammar of calculator in a finite field

I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources