BNFC parser and bracket Mathematica like syntax - parsing

I played a bit with the BNF Converter and tried to re-engineer parts of the Mathematica language. My BNF had already about 150 lines and worked OK, until I noticed a very basic bug. Brackets [] in Mathematica are used for two different things
expr[arg] to call a function
list[[spec]] to access elements of an expression, e.g. a List
Let's assume I want to create the parser for a language which consists only of identifiers, function calls, element access and sequence of expressions as arguments. These forms would be valid
f[]
f[a]
f[a,b,c]
f[[a]]
f[[a,b]]
f[a,f[b]]
f[[a,f[x]]]
A direct, but obviously wrong input-file for BNFC could look like
entrypoints Expr ;
TSymbol. Expr1 ::= Ident ;
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]]" ;
coercions Expr 1 ;
separator Sequence "," ;
SequenceExpr. Sequence ::= Expr ;
This BNF does not work for the last two examples of the first code-block.
The problem seems to be located in the created Yylex lexer file, which matches ] and ]] separately. This is wrong, because as can be seen in the last to examples, whether or not it's a closing ] or ]] depends on the context. So either you have to create a stack of braces to ensure the right matching or you leave that to the parser.
Can someone enlighten me whether it's possible to realize this with BNFC?
(Btw, other hints would be gratefully taken too)

Your problem is the token "]]". If the lexer collects this without having
any memory of its past, it might be mistaken. So just don't do that!
The parser by definition remembers its left context, so you can get
it to do the bracket matching correctly.
I would define your grammar this way:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[" "[" [Sequence] "]" "]" ;
with the lexer detecting only single "[" "]" as tokens.
An odd variant:
FunctionCall. Expr ::= Expr "[" [Sequence] "]" ;
Part. Expr ::= Expr "[[" [Sequence] "]" "]" ;
with the lexer also detecting "[[" as a token, since it can't be mistaken.

Related

When does order of alternation matter in antlr?

In the following example, the order matters in terms of precedence:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
For example, on the expression 1+1*2 the above would produce the following parse tree which would evaluate to (1+1)*2=4:
Whereas if I changed the first and second alternations in the expr I would then get the following parse tree which would evaluate to 1+(1*2)=3:
What are the 'rules' then for when it actually matters where the ordering in an alternation occurs? Is this only relevant if it one of the 'edges' of the alternation recursively calls the expr? For example, something like ~ expr or expr + expr would matter, but something like func_call '(' expr ')' or Atom would not. Or, when is it important to order things for precedence?
If ANTLR did not have the rule to give precedence to the first alternative that could match, then either of those trees would be valid interpretations of your input (and means the grammar is technically ambiguous).
However, when there are two alternatives that could be used to match your input, then ANTLR will use the first alternative to resolve the ambiguity, in this case establishing operator precedence, so typically you would put the multiplication/division operator before the addition/subtraction, since that would be the traditional order of operations:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
Most grammar authors will just put them in precedence order, but things like Atoms or parenthesized exprs won’t really care about the order since there’s only a single alternative that could be used.

Parens in BNF, EBNF

I could capture a parenthetical group using something like:
expr ::= "(" <something> ")"
However, sometimes it's useful to use multiple levels of nesting, and so it's (theoretically) possible to have more than one parens as long as they match. For example:
>>> (1)+1
2
>>> (((((-1)))))+2
1
>>> ((2+2)+(1+1))
6
>>> (2+2))
SyntaxError: invalid syntax
Is there a way to specify a "matching-ness" in EBNF, or how is parenthetical-matching handled by most parsers?
In order to be able to match an arbitrary amount of anything (be it parentheses, operators, list items etc.) you need recursion (EBNF also features repetition operators that can be used instead of recursion in some cases, but not for constructs that need to be matched like parentheses).
For well-matched parentheses, the proper production is simply:
expr ::= "(" expr ")"
That's in addition to productions for other types of expressions, of course, so a complete grammar might look like this:
expr ::= "(" expr ")"
expr ::= NUMBER
expr ::= expr "+" expr
expr ::= expr "-" expr
expr ::= expr "*" expr
expr ::= expr "/" expr
Or for an unambiguous grammar:
expr ::= expr "+" multExpr
expr ::= expr "-" multExpr
multExpr ::= multExpr "*" primaryExpr
multExpr ::= multExpr "/" primaryExpr
primaryExpr ::= "(" expr ")"
primaryExpr ::= NUMBER
Also, how do you usually go about 'testing' that it is correct -- is there an online tool or something that can validate a syntax?
There are many parser generators that can accept some form of BNF- or EBNF-like notation and generate a parser from it. You can use one of those and then test whether the generated parser parses what you want it to. They're usually not available as online tools though. Also note that parser generators generally need the grammar to be unambiguous or you to add precedence declarations to disambiguate it.
also wouldn't infinite loop?
No. The exact mechanics depend on the parsing algorithm used of course, but if the character at the current input position is not an opening parenthesis, then clearly this isn't the right production to use and another one needs to be applied (or a syntax error raised if none of the productions apply).
Left recursion can cause infinite recursion when using top-down parsing algorithms (though in case of parser generators it's more likely that the grammar will either be rejected or in some cases automatically rewritten than that you get an actual infinite recursion or loop), but non-left recursion doesn't cause that kind of problem with any algorithm.

Solving shift/reduce conflict in expression grammar

I am new to bison and I am trying to make a grammar parsing expressions.
I am facing a shift/reduce conflight right now I am not able to solve.
The grammar is the following:
%left "[" "("
%left "+"
%%
expression_list : expression_list "," expression
| expression
| /*empty*/
;
expression : "(" expression ")"
| STRING_LITERAL
| INTEGER_LITERAL
| DOUBLE_LITERAL
| expression "(" expression_list ")" /*function call*/
| expression "[" expression "]" /*index access*/
| expression "+" expression
;
This is my grammar, but I am facing a shift/reduce conflict with those two rules "(" expression ")" and expression "(" expression_list ")".
How can I resolve this conflict?
EDIT: I know I could solve this using precedence climbing, but I would like to not do so, because this is only a small part of the expression grammar, and the size of the expression grammar would explode using precedence climbing.
There is no shift-reduce conflict in the grammar as presented, so I suppose that it is just an excerpt of the full grammar. In particular, there will be precisely the shift/reduce conflict mentioned if the real grammar includes:
%start program
%%
program: %empty
| program expression
In that case, you will run into an ambiguity because given, for example, a(b), the parser cannot tell whether it is a single call-expression or two consecutive expressions, first a single variable, and second a parenthesized expression. To avoid this problem you need to have some token which separates expression (statements).
There are some other issues:
expression_list : expression_list "," expression
| expression
| /*empty*/
;
That allows an expression list to be ,foo (as in f(,foo)), which is likely not desirable. Better would be
arguments: %empty
| expr_list
expr_list: expr
| expr_list ',' expr
And the precedences are probably backwards. Usually one wants postfix operators like call and index to bind more tightly than arithmetic operators, so they should come at the end. Otherwise a+b(7) is (a+b)(7), which is unconventional.

What to use in ANTLR4 to resolve ambiguities in more complex cases (instead of syntactic predicates)?

In ANTLR v3, syntactic predicates could be used to solve ambiguitites, i.e., to explicitly tell ANTLR which alternative should be chosen. ANTLR4 seems to simply accept grammars with similar ambiguities, but during parsing it reports these ambiguities. It produces a parse tree, despite these ambiguities (by chosing the first alternative, according to the documentation). But what can I do, if I want it to chose some other alternative? In other words, how can I explicitly resolve ambiguities?
(For the simple case of the dangling else problem see: What to use in ANTLR4 to resolve ambiguities (instead of syntactic predicates)?)
A more complex example:
If I have a rule like this:
expr
: expr '[' expr? ']'
| ID expr
| '[' expr ']'
| ID
| INT
;
This will parse foo[4] as (expr foo (expr [ (expr 4) ])). But I may want to parse it as (expr (expr foo) [ (expr 4) ]). (I. e., always take the first alternative if possible. It is the first alternative, so according to the documentation, it should have higher precedence. So why it builds this tree?)
If I understand correctly, I have 2 solutions:
Basically implement the syntactic predicate with a semantic predicate (however, I'm not sure how, in this case).
Restructure the grammar.
For example, replace expr with e:
e : expr | pe
;
expr
: expr '[' expr? ']'
| ID expr
| ID
| INT
;
pe : '[' expr ']'
;
This seems to work, although the grammar became more complex.
I may misunderstood some things, but both of these solutions seem less elegant and more complicated than syntactic predicates. Although, I like the solution for the dangling else problem with the ?? operator. But I'm not sure how to use in this case. Is it possible?
You may be able to resolve this by placing the ID alternative above ID expr. When left-recursion is eliminated, all of your alternatives which are not left recursive are parsed before your alternatives which are left recursive.
For your example, the first non-left-recursive alternative ID expr matches the entire expression, so there is nothing left to parse afterwards.
To get this expression (expr (expr foo) [ (expr 4) ]), you can use
top : expr EOF;
expr : expr '[' expr? ']' | ID | INT ;

Non-left-recursive PEG grammar for an "expression"

It's either a simple identifier (like cow) something surrounded by brackets ((...)) something that looks like a method call (...(...)) or something that looks like a member access (thing.member):
def expr = identifier |
"(" ~> expr <~ ")" |
expr ~ ("(" ~> expr <~ ")") |
expr ~ "." ~ identifier
It's given in Scala Parser Combinator syntax, but it should be pretty straightforward to understand. It's similar to how expressions end up looking in many programming languages (hence the name expr) However, as it stands, it is left-recursive and causes my nice PEG parser to explode.
I have not succeeded in factoring out the left-recursion while still maintaining correctness for cases like (cow.head).moo(dog.run(fast)). How can I refactor this, or would I need to shift to some parser-generator that can tolerate left recursive grammars?
The trick is to have multiple rules where the first element of each rule is the next rule instead of being a recursive call to the same rule, and the rest of the rule is optional and repeating. For example the following would work for your example:
def expr = method_call
def method_call = member_access ~ ( "(" ~> expr <~ ")" ).*
def member_access = atomic_expression ~ ( "." ~> identifier).*
def atomic_expression = identifier |
"(" ~> expr <~ ")"

Resources