Can this language be described by a non-ambiguous BNF grammar? - parsing

I'm getting back into language design/specification (via BNF/EBNF grammars) after 20+ years of leaving the topic alone (since my CS undergrad degree).
I only vaguely recall the various related terms in this space, like LR(1), LALR, etc. I've been attempting to refresh via some googling and reading, but it's coming slowly (probably because I didn't fully understand this stuff back in school). So I'm probably doing things quite roughly.
I decided I would describe a toy language with a grammar, and then try to analyze and possibly optimize it, as a part of my re-learning.
NOTE: all the snippets below can also be found in a gist here.
I started with an EBNF representation (as processed/validated by this tool):
Program := WhSp* (StmtSemi WhSp*)* StmtSemiOpt? WhSp*;
Stmt := AStmt | BStmt | CStmt | DStmt;
StmtSemi := Stmt? (WhSp* ";")+;
StmtSemiOpt := Stmt? (WhSp* ";")*;
WhSp := "_";
AStmt := "a";
BStmt := "b";
CStmt := "c";
DStmt := "d";
Here are some valid matches for this language (one match per line):
_____
;;;;;
_;_;_
a
__a__
a;
a;b;
a;_b;
_a;_b;_
_a_;_b_;_
__a__;;
_;_a;_b;c;;;__;;__d;___a___
And here are some values that wouldn't be in the language (again, one per line):
ab
a_b
a;_b_c
I then hand-converted this to the following BNF form (as processed/analyzed by this tool):
Program -> StmtSemi FinalStmtSemiOpt .
StmtSemi -> WhSp StmtSemiOpt | StmtSemiOpt .
FinalStmtSemiOpt -> StmtOpt SemiOpt WhSpOpt | WhSpOpt .
Stmt -> AStmt | BStmt | CStmt | DStmt .
StmtOpt -> Stmt | .
StmtSemiOpt -> StmtOpt Semi | StmtOpt Semi WhSpOpt StmtSemiOpt | .
Semi -> WhSpOpt ; | WhSpOpt ; Semi .
SemiOpt -> Semi | .
WhSp -> _ | _ WhSp .
WhSpOpt -> WhSp | .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
That tool's analysis says my grammar is ambiguous. I guess that's not surprising or necessarily a bad outcome, but I know ambiguous grammars limit some kinds of analysis and automatic conversion or parser generation.
So... finally, here are my questions:
Is this a context-free grammar? What specifically makes it so, or would make it non-CFG?
[Edit: "Yes", see #rici's answer]
Can the language I'm describing be specified in a non-ambiguous grammar (BNF or EBNF)? Or is it just inherently ambiguous?
If it's inherently ambiguous, what specific aspects of the language make it so? In other words, what minimally would I have to change/remove to arrive at a language that had a non-ambiguous grammar?
Are there meaningful ways my BNF grammar form could be simplified and still describe the same language as the EBNF?
Does the BNF grammar currently have left-recursion, right-recursion, or both? I'm having trouble convincing myself of the answer. Could the BNF be re-arranged to avoid one or the other, and what would the impacts (performance, etc) of that be?
[Edit: I believe the updated BNF only has right-recursion, according to the analysis tool.]
Apologies if I'm fumbling around with incorrect terminology or asking imprecise questions. Thanks for any insight you might be able to offer.
[EDIT: Here's a new BNF that I believe is equivalent but isn't ambiguous -- thanks to #rici for confirming it was possible. I didn't use any particular algorithm/strategy for this, just keep trial-n-error fiddling.]
Leading -> WhSp Leading Program | Semi Leading Program | Program .
Program -> Stmt | Stmt WhSp | Stmt WhSpOptSemi Program | Stmt WhSpOptSemi WhSp Program | .
Stmt -> AStmt | BStmt | CStmt | DStmt .
WhSpOptSemi -> Semi | WhSp Semi | Semi WhSpOptSemi | WhSp Semi WhSpOptSemi .
WhSp -> _ | _ WhSp .
Semi -> ; .
AStmt -> a .
BStmt -> b .
CStmt -> c .
DStmt -> d .
So that seems to answer questions (2), (3), and partly (4) I guess.

It's not always easy (or even possible) to demonstrate that a grammar is ambiguous, but if there is a short ambiguous sentence, then it can be found with brute-force enumeration, which is what I believe that tool does. And the output is revealing; the shortest ambiguous sentence is the empty string.
So it remains only to figure out why the empty string can be parsed in two (or more) ways, and a quick look at the productions reveals that FinalStmtSemiOpt has two nullable productions, which means that it has two ways of deriving the empty string. That's evident by inspection, if you believe that every production whose name ends with Opt in fact describes an optional sequence, since FinalStmtSemiOpt has two productions each of which consist only of XOpts. A little more effort is required to verify that the optional non-terminals are, in fact, optional, which is as it should be.
So the solution is to rework FinalStmtSemiOpt without using two nullable right-hand sides. That shouldn't be too hard.
Almost all the other questions you raise are answerable by inspection:
A grammar is context-free (by definition) iff every production has precisely one symbol on the left hand side. (If it had more than one symbol, then the extra symbols define a context in which the substitution is valid. If no production has a context, the grammar is context-free.)
A non-terminal is left-recursive if some production for the non-terminal either starts with the non-terminal itself (direct left-recursion), or starts with a sequence of nullable non-terminals followed by the non-terminal itself (indirect left-recursion). In other words, the non-terminal is or could be the left-most symbol in a production. Similarly, a non-terminal is right-recursive if some production ends with the non-terminal (again, possibly followed by nullable non-terminals).
There is no algorithm which can tell you in general if a language is inherently ambiguous, but in the particular case that a language is regular, then it is definitely not inherently ambiguous. A language with a regular grammar is regular, but a non-regular grammar could still describe a regular language. Your grammar is not regular, but it can be made regular without too much work (mostly refactoring). A hint that it is probably regular is that there is no recursively nested syntax (eg. parentheses or statement blocks). Most useful grammars are recursively nested, and nesting in no way implies ambiguity. So that doesn't take you too far, but it happens to be the case here.

Related

Interpretation variants of binary operators

I'm writing a grammar for a language that contains some binary operators that can also be used as unary operator (argument to the right side of the operator) and for a better error recovery I'd like them to be usable as nular operators as well).
My simplified grammar looks like this:
start:
code EOF
;
code:
(binaryExpression SEMICOLON?)*
;
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression //TODO: check before primaryExpression
| primaryExpression
;
primaryExpression:
unaryExpression
| nularExpression
;
unaryExpression:
operator primaryExpression
| BINARY_OPERATOR primaryExpression
;
nularExpression:
operator
| BINARY_OPERATOR
| NUMBER
| STRING
;
operator:
ID
;
BINARY_OPERATOR is just a set of defined keywords that are fed into the parser.
My problem is that Antlr prefers to use BINARY_OPERATORs as unary expressions (or nualr ones if there is no other choice) instead of trying to use them in a binary expression as I need it to be done.
For example consider the following intput: for varDec from one to twelve do something where from, to and do are binary operators the output of the parser is the following:
As you can see it interprets all binary operators as unary ones.
What I'm trying to achieve is the following: Try to match each BINARY_OPERATOR in a binary expression and only if that is not possible try to match them as a unary expression and if that isn't possible as well then it might be considered a nular expression (which can only be the case if the BINARY_OPERATORis the only content of an expression).
Has anyone an idea about how to achieve the desired behaviour?
Fairly standard approach is to use a single recursive rule to establish the acceptable expression syntax. ANTLR is default left associative, so op expr meets the stated unary op requirement of "argument to the right side of the operator". See, pg 70 of TDAR for a further discussion of associativity.
Ex1: -y+x -> binaryOp{unaryOp{-, literal}, +, literal}
Ex2: -y+-x -> binaryOp{unaryOp{-, literal}, +, unaryOp{-, literal}}
expr
: LPAREN expr RPAREN
| expr op expr #binaryOp
//| op expr #unaryOp // standard formulation
| op literal #unaryOp // limited formulation
| op #errorOp
| literal
;
op : .... ;
literal
: KEYWORD
| ID
| NUMBER
| STRING
;
You allow operators to act like operands ("nularExpression") and operands to act like operators ("operator: ID"). Between those two curious decisions, your grammar is 100% ambiguous, and there is never any need for a binary operator to be parsed. I don't know much about Antlr, but it surprises me that it doesn't warn you that your grammar is completely ambiguous.
Antlr has mechanisms to handle and recover from errors. You would be much better off using them than writing a deliberately ambiguous grammar which makes erroneous constructs part of the accepted grammar. (As I said, I'm not an Antlr expert, but there are Antlr experts who pass by here pretty regularly; if you ask a specific question about error recovery, I'm sure you'll get a good answer. You might also want to search this site for questions and answers about Antlr error recovery.)
I think what I'm going to write down now is what #GRosenberg meant with his answer. However as it took me a while to fully understand it I will provide a concrete solution for my problem in case someone else is stumbling across this question and is searching or an answer:
The trick was to remove the option to use a BINARY_OPERATOR inside the unaryExpression rule because this always got preferred. Instead what I really wanted was to specify that if there was no left side argument it should be okay to use a BINARY_OPERATOR in a unary way. And that's the way I had to specify it:
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression
| BINARY_OPERATOR primaryExpression
| primaryExpression
;
That way this syntax only becomes possible if there is nothing to the left side of a BINARY_OPERATOR and in every other case the binary syntax has to be used.

ANTLR 4 Parser Grammar

How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Producing Expressions from This Grammar with Recursive Descent

I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.

Shift-reduce: when to stop reducing?

I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.

Resources