Non-left-recursive PEG grammar for an "expression" - parsing

It's either a simple identifier (like cow) something surrounded by brackets ((...)) something that looks like a method call (...(...)) or something that looks like a member access (thing.member):
def expr = identifier |
"(" ~> expr <~ ")" |
expr ~ ("(" ~> expr <~ ")") |
expr ~ "." ~ identifier
It's given in Scala Parser Combinator syntax, but it should be pretty straightforward to understand. It's similar to how expressions end up looking in many programming languages (hence the name expr) However, as it stands, it is left-recursive and causes my nice PEG parser to explode.
I have not succeeded in factoring out the left-recursion while still maintaining correctness for cases like (cow.head).moo(dog.run(fast)). How can I refactor this, or would I need to shift to some parser-generator that can tolerate left recursive grammars?

The trick is to have multiple rules where the first element of each rule is the next rule instead of being a recursive call to the same rule, and the rest of the rule is optional and repeating. For example the following would work for your example:
def expr = method_call
def method_call = member_access ~ ( "(" ~> expr <~ ")" ).*
def member_access = atomic_expression ~ ( "." ~> identifier).*
def atomic_expression = identifier |
"(" ~> expr <~ ")"

Related

Parens in BNF, EBNF

I could capture a parenthetical group using something like:
expr ::= "(" <something> ")"
However, sometimes it's useful to use multiple levels of nesting, and so it's (theoretically) possible to have more than one parens as long as they match. For example:
>>> (1)+1
2
>>> (((((-1)))))+2
1
>>> ((2+2)+(1+1))
6
>>> (2+2))
SyntaxError: invalid syntax
Is there a way to specify a "matching-ness" in EBNF, or how is parenthetical-matching handled by most parsers?
In order to be able to match an arbitrary amount of anything (be it parentheses, operators, list items etc.) you need recursion (EBNF also features repetition operators that can be used instead of recursion in some cases, but not for constructs that need to be matched like parentheses).
For well-matched parentheses, the proper production is simply:
expr ::= "(" expr ")"
That's in addition to productions for other types of expressions, of course, so a complete grammar might look like this:
expr ::= "(" expr ")"
expr ::= NUMBER
expr ::= expr "+" expr
expr ::= expr "-" expr
expr ::= expr "*" expr
expr ::= expr "/" expr
Or for an unambiguous grammar:
expr ::= expr "+" multExpr
expr ::= expr "-" multExpr
multExpr ::= multExpr "*" primaryExpr
multExpr ::= multExpr "/" primaryExpr
primaryExpr ::= "(" expr ")"
primaryExpr ::= NUMBER
Also, how do you usually go about 'testing' that it is correct -- is there an online tool or something that can validate a syntax?
There are many parser generators that can accept some form of BNF- or EBNF-like notation and generate a parser from it. You can use one of those and then test whether the generated parser parses what you want it to. They're usually not available as online tools though. Also note that parser generators generally need the grammar to be unambiguous or you to add precedence declarations to disambiguate it.
also wouldn't infinite loop?
No. The exact mechanics depend on the parsing algorithm used of course, but if the character at the current input position is not an opening parenthesis, then clearly this isn't the right production to use and another one needs to be applied (or a syntax error raised if none of the productions apply).
Left recursion can cause infinite recursion when using top-down parsing algorithms (though in case of parser generators it's more likely that the grammar will either be rejected or in some cases automatically rewritten than that you get an actual infinite recursion or loop), but non-left recursion doesn't cause that kind of problem with any algorithm.

Solving shift/reduce conflict in expression grammar

I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.

ANTLR4 - Syntax error on '#' (alternative rule label)

I have made a grammar that will be used with ANTLR4 with the following definition for expressions:
// Expressions
Expr : Integer # Expr_Integer
| Float # Expr_Float
| Double # Expr_Double
| String # Expr_String
| Variable # Expr_Variable
| FuncCall # Expr_FuncCall
| Expr Op_Infix Expr # Expr_Infix
| Op_Prefix Expr # Expr_Prefix
| Expr Op_Postfix # Expr_Postfix
| Expr 'is' Id # Expr_Is
| 'this' # Expr_This
| Expr '?' Expr ':' Expr # Expr_Ternary
| '(' Expr ')' # Expr_Bracketed
;
I added the labels so that I could easily differentiate between the different expression types when analysing the generated syntax tree. However, ANTLR4 throws the following error for every single one of the above lines (excluding the one with the comment):
error(50): Ash.g4:88:19: syntax error: '#' came as a complete surprise to me while looking for lexer rule element
Line 88 is the final rule alternative ( '(' Expr ')' )
I have look through the documentation and various online examples and my syntax seems correct.
What could be causing the error to be thrown?
In Antlr, rules beginning with an uppercase letter are lexer rules, and those beginning with an lowercase letter are parser rules. Antlr uses these definitions a lot to define what you can and cannot do. Usually, the lexer is faster to proccess but less powerful than the parser.
In your case, Expr should definitely be a parser rule, as basically every other rule you have referenced there. Changing it to expr should match the expected behavior.
As a rule of thumb, lexer rules are to be used only when there is no context, it doesn't matter what is next to the generated token. Things like numeric constants, string constants, identifiers and such.

BNFC parser and bracket Mathematica like syntax

I played a bit with the BNF Converter and tried to re-engineer parts of the Mathematica language. My BNF had already about 150 lines and worked OK, until I noticed a very basic bug. Brackets [] in Mathematica are used for two different things
expr[arg] to call a function
list[[spec]] to access elements of an expression, e.g. a List
Let's assume I want to create the parser for a language which consists only of identifiers, function calls, element access and sequence of expressions as arguments. These forms would be valid
f[]
f[a]
f[a,b,c]
f[[a]]
f[[a,b]]
f[a,f[b]]
f[[a,f[x]]]
A direct, but obviously wrong input-file for BNFC could look like
entrypoints Expr ;
TSymbol. Expr1 ::= Ident ;
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]]" ;
coercions Expr 1 ;
separator Sequence "," ;
SequenceExpr. Sequence ::= Expr ;
This BNF does not work for the last two examples of the first code-block.
The problem seems to be located in the created Yylex lexer file, which matches ] and ]] separately. This is wrong, because as can be seen in the last to examples, whether or not it's a closing ] or ]] depends on the context. So either you have to create a stack of braces to ensure the right matching or you leave that to the parser.
Can someone enlighten me whether it's possible to realize this with BNFC?
(Btw, other hints would be gratefully taken too)
Your problem is the token "]]". If the lexer collects this without having
any memory of its past, it might be mistaken. So just don't do that!
The parser by definition remembers its left context, so you can get
it to do the bracket matching correctly.
I would define your grammar this way:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[" "[" [Sequence] "]" "]" ;
with the lexer detecting only single "[" "]" as tokens.
An odd variant:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]" "]" ;
with the lexer also detecting "[[" as a token, since it can't be mistaken.

Relation between grammar and operator associativity

Some compiler books / articles / papers talk about design of a grammar and the relation of its operator's associativity. I'm a big fan of top-down, especially recursive descent, parsers and so far most (if not all) compilers I've written use the following expression grammar:
Expr ::= Term { ( "+" | "-" ) Term }
Term ::= Factor { ( "*" | "/" ) Factor }
Factor ::= INTEGER | "(" Expr ")"
which is an EBNF representation of this BNF:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor = INTEGER | "(" Expr ")"
According to what I read, some regards this grammar as being "wrong" due to the change of operator associativity (left to right for those 4 operators) proven by the growing parse tree to the right instead of left. For a parser implemented through attribute grammar, this might be true as l-attribute value requires that this value created first then passed to child nodes. however, when implementing with normal recursive descent parser, it's up to me whether to construct this node first then pass to child nodes (top-down) or let child nodes be created first then add the returned value as the children of this node (passed in this node's constructor) (bottom-up). There should be something I miss here because I don't agree with the statement saying this grammar is "wrong" and this grammar has been used in many languages esp. Wirthian ones. Usually (or all?) the reading that says it promotes LR parsing instead of LL.
I think the issue here is that a language has an abstract syntax which is just like:
E ::= E + E | E - E | E * E | E / E | Int | (E)
but this is actually implemented via a concrete syntax which is used to specify associativity and precedence. So, if you're writing a recursive decent parse, you're implicitly writing the concrete syntax into it as you go along and that's fine, though it may be good to specify it exactly as a phrase-structured grammar as well!
There are a couple of issues with your grammar if it is to be a fully-fledged concrete grammar. First of all, you need to add productions to just 'go to the next level down', so relaxing your syntax a bit:
Expr ::= Term + Term | Term - Term | Term
Term ::= Factor * Factor | Factor / Factor | Factor
Factor ::= INTEGER | (Expr)
Otherwise there's no way to derive valid sentences starting from the start symbol (in this case Expr). For example, how would you derive '1 * 2' without those extra productions?
Expr -> Term
-> Factor * Factor
-> 1 * Factor
-> 1 * 2
We can see the other grammar handles this in a slightly different way:
Expr -> Term Expr'
-> Factor Term' Expr'
-> 1 Term' Expr'
-> 1 * Factor Term' Expr'
-> 1 * 2 Term' Expr'
-> 1 * 2 ε Expr'
-> 1 * 2 ε ε
= 1 * 2
but this achieves the same effect.
Your parser is actually non-associative. To see this ask how E + E + E would be parsed and find that it couldn't. Whichever + is consumed first, we get E on one side and E + E on the other, but then we're trying to parse E + E as a Term which is not possible. Equivalently, think about deriving that expression from the start symbol, again not possible.
Expr -> Term + Term
-> ? (can't get another + in here)
The other grammar is left-associative ebcase an arbitrarily long sting of E + E + ... + E can be derived.
So anyway, to sum up, you're right that when writing the RDP, you can implement whatever concrete version of the abstract syntax you like and you probably know a lot more about that than me. But there are these issues when trying to produce the grammar which describes your RDP precisely. Hope that helps!
To get associative trees, you really need to have the trees formed with the operator as the subtree root node, with children having similar roots.
Your implementation grammar:
Expr ::= Term Expr'
Expr' ::= ( "+" | "-" ) Term Expr' | ε
Term ::= Factor Term'
Term' ::= ( "*" | "/" ) Factor Term' | ε
Factor ::= INTEGER | "(" Expr ")"
must make that awkward; if you implement recursive descent on this, the Expr' routine has no access to the "left child" and so can't build the tree. You can always patch this up by passing around pieces (in this case, passing tree parts up the recursion) but that just seems awkward. You could have chosen this instead as a grammar:
Expr ::= Term ( ("+"|"-") Term )*;
Term ::= Factor ( ( "*" | "/" ) Factor )* ;
Factor ::= INTEGER | "(" Expr ")"
which is just as easy (easier?) to code recursive descent-wise, but now you can form the trees you need without trouble.
This doesn't really get you associativity; it just shapes the trees so that it could be allowed. Associativity means that the tree ( + (+ a b) c) means the same thing as (+ a (+ b c)); its actually a semantic property (sure doesn't work for "-" but the grammar as posed can't distinguish).
We have a tool (the DMS Software Reengineering Toolkit) that includes parsers and term-rewriting (using source-to-source transformations) in which the associativity is explicitly expressed. We'd write your grammar:
Expr ::= Term ;
[Associative Commutative] Expr ::= Expr "+" Term ;
Expr ::= Expr "-" Term ;
Term ::= Factor ;
[Associative Commutative] Term ::= Term "*" Factor ;
Term ::= Term "/" Factor ;
Factor ::= INTEGER ;
Factor ::= "(" Expr ")" ;
The grammar seems longer and clumsier this way, but it in fact allows us to break out the special cases and mark them as needed. In particular, we can now distinguish operators that are associative from those that are not, and mark them accordingly. With that semantic marking, our tree-rewrite engine automatically accounts for associativity and commutativity. You can see a full example of such DMS rules being used to symbolically simplify high-school algebra using explicit rewrite rules over a typical expression grammar that don't have to account for such semantic properties. That is built into the rewrite engine.

Resources