Improve BNF for mathematic expressions - parsing

A good exercise while learning programming, is to write a calculator. To do this, I created some kind of DSL in BNF and want to ask for your help to improve it. With this minilanguage you should be able to add, multiply and assign values and expressions to names (a.k.a. create variables and functions).
Hava a look at the BNF first:
<Program> ::= <Line>(<NewLine><Line>)*
<Line> ::= {"("}<Expression>{")"}|<Assignment>
<Assignment> ::= <Identifier>"="<Expression>
<Identifier> ::= <Name>{"("<Name>(","<Name>)*")"}
<Expression> ::= <Summand>(("+"|"-")<Summand>)*
<Summand> ::= <Factor>(("*"|"/")<Factor>)*
<Factor> ::= <Number>|<Call>
<Call> ::= <Name> {"("<Expression>(","<Expression>)*")"}
<Name> ::= <Letter>(<Letter>|<Digit>)*
<Number> ::= {"+"|"-"}(<Digit>|<DigitNoZero><Digit>+)
<Digit> ::= "0"|<DigitNoZero>
<DigitNoZero> ::= "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
<Letter> ::= [a-zA-Z]
<NewLine> ::= "\n"|"\r"|"\r\n"
As you can see, this BNF treats no Whitespace beside NewLine. Before the parsing begins I plan to remove all whitespace (beside NewLine of course) from the string to parse. It is not nessesary for the parser anyway.
There are 4 things, that might lead to problems when using this language as defined right now and I hope you can help me figure out appropriate solutions:
I tried to follow the top-down approach, while generating this gramar, but there is a circle between <Expression>, <Summand>, <Factor> and <Call>.
The gramar treats variables and functions exactly the same way. Most programming languages make a difference. Is it nessesary to differentiate here?
There are maybe some things that I don't know about programming, BNF, whatever, that will kill me later, while trying to implement the BNF. But you might be able to spot that before I start.
There might be simple and stupid mistakes that I could not find myself. Sorry in that case. I hope there are none of these mistakes anymore.
Using hand and brain I could successfully parse the following test cases:
"3"
"-3"
"3-3"
"a=3"
"a=3+b"
"a=3+b\nc=a+3"
"a(b,c)=b*c\ra(1+2,2*3)"
Please help to improve the BNF, that it can be used to successfully write a calculator.
edit:
This BNF is really not finished. It does not treat the cases "2+-3" (should fail, but doesn't) and "2+(-3)" (should not fail, but does) correctly.

The gramar treats variables and functions exactly the same way. Most programming languages make a difference. Is it nessesary to differentiate here?
Being able to treat the result of a function invocation exactly the same as a local variable or a constant expression is precisely the point of defining (mathematical) functions in the first place. I can't imagine the use of a grammar that allowed functions but didn't treat
1 + 1
exactly the same as
1 + a
or
1 + sin(x)

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Parser/Grammar: 2x parenthesis in nested rules

Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.
But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:
boolCommonExpr = ( boolMethodCallExpr
/ notExpr
/ commonExpr [ eqExpr / neExpr / ltExpr / ... ]
/ boolParenExpr
) [ andExpr / orExpr ]
commonExpr = ( primitiveLiteral
/ firstMemberExpr ; = identifier
/ methodCallExpr
/ parenExpr
) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]
boolParenExpr = "(" boolCommonExpr ")"
parenExpr = "(" commonExpr ")"
How does this grammar match a simple expression like (1 eq 2)? From what I can see all ( are consumed by the rule parenExpr inside commonExpr, i.e. they must also close after commonExpr to not cause an error and boolParenExpr never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?
Obviously an opening ( alone won't tell me where it's going to close: After the current commonExpr expression or further away after boolCommonExpr. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of ( I have. Good idea?
I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.
Edit 1: Extension after answer by rici - Is grammar rewrite correct?
Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.
Adapted to the boolean expression for the OData $filter this could maybe look like this:
boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression = term {("add"|"sum") term} .
term = factor {("mul"|"div"|"mod") factor} .
factor = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall = METHODNAME "(" [ expression {"," expression} ] ")" .
Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?
#rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.
For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.
Edit 2: Return to original OData grammar
The differentiation between a "logical" and "arithmetic" ( is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of () is mandatory around and and or expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.
I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which ( belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.
Personally, I'd just modify the grammar so that it has only one type of expression and therefore one type of parenthesis. I'm not convinced that the OData grammar is actually correct; it is certainly not usable in an LL(1) (or recursive descent) parser for exactly the reason you mention.
Specifically, if the goal is boolCommonExpr, there are two productions which can match the ( lookahead token:
boolCommonExpr = ( …
/ commonExpr [ eqExpr / neExpr / … ]
/ boolParenExpr
/ …
) …
commonExpr = ( …
/ parenExpr
/ …
) …
For the most part, this is a misguided attempt to make the grammar detect a type violation. (If in fact it is a type violation.) It's misguided because it is doomed to failure if there are boolean variables, which there apparently are in this environment. Since there is not syntactic clue as to the type of a variable, the parser is not capable of deciding whether particular expressions are well-formed or not, so there is a good argument for not trying at all, particularly if it creates parsing headaches. A better solution is to first parse the expression into an AST of some form, and then do another pass over the AST to check that each operators has operands of the correct type (and possibly inserting explicit cast operators if that is necessary).
Aside from any other advantage, doing the type check in a separate pass lets you produce much better error messages. If you make (some) type violations syntax errors, then you may leave the user puzzled about why their expression was rejected; in contrast, if you notice that a comparison operation is being used as an operand to multiply (and if your language's semantics don't allow an automatic conversion from True/False to 1/0), then you can produce a well-targetted error message ("comparisons cannot be used as the operand of an arithmetic operator", for example).
One possible reason to put different operators (but not parentheses) into different grammatical variables is to express grammatical precedence. That consideration might encourage you to rewrite the grammar with explicit precedence. (As written, the grammar assumes that all arithmetic operators have the same precedence, which would presumably lead to 2 + 3 * a being parsed as (2 + 3) * a, which might be a huge surprise.) Alternatively, you might use some simple precedence aware subparser for expressions.
If you want to test your ABNF grammar for determinism (i.e. LL(1)), you can use Tunnel Grammar Studio (TGS). I have tested the full grammar, and there are plenty of conflicts, not only this scopes. If you are able to extract the relevant rules, you can use the desktop version of TGS to visualize the conflicts (the online version checker is with a textual result only). If the rules are not too many, the demo may help you to create an LL(1) grammar from your rules.
If you extract all rules you need, and add them to your question, I can run it for you and will tell you is it LL(1). Note that the grammar is not exactly in ABNF meta syntax, because the case sensitivity is typed with ' for case sensitive strings. The ABNF (RFC 5234) by definition is case insensitive, as RFC 7405 defines the sensitivity with %s and %i (sensitive and insensitive) prefixes before the the actual string. The default case (without a prefix) still means insensitive. This means that you have to replace this invalid '...' strings with %s"..." before testing in TGS.
TGS is a project I work on.

Shift/Reduce conflict in CUP

I'm trying to write a parser for a javascript-ish language with JFlex and Cup, but I'm having some issues with those deadly shift/reduce problems and reduce/reduce problems.
I have searched thoroughly and have found tons of examples, but I'm not able to extrapolate these to my grammar. My understanding so far is that these problems are because the parser cannot decide which way it should take because it can't distinguish.
My grammar is the following one:
start with INPUT;
INPUT::= PROGRAM;
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
FUNCTION ::= function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der;
OPTIONAL ::=
| TYPE;
TYPE::= integer
| boolean
ARG ::=
| TYPE id MORE_ARGS;
MORE_ARGS ::=
| colon TYPE id MORE_ARGS;
NEWLINE ::= salto NEWLINE
| ;
BODY ::= ;
I'm getting several conflicts but these 2 are a mere example:
Warning : *** Shift/Reduce conflict found in state #5
between NEWLINE ::= (*)
and NEWLINE ::= (*) salto NEWLINE
under symbol salto
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #0
between NEWLINE ::= (*)
and FUNCTION ::= (*) function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der
under symbol function
Resolved in favor of shifting.
PS: The grammar is far more complex, but I think that if I see how these shift/reduce problems are solved, I'll be able to fix the rest.
Thanks for your answers.
PROGRAM is useless (in the technical sense). That is, it cannot produce any sentence because in
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
both productions for PROGRAM are recursive. For a non-terminal to be useful, it needs to be able to eventually produce some string of terminals, and for it to do that, it must have at least one non-recursive production; otherwise, the recursion can never terminate. I'm surprised CUP didn't mention this to you. Or maybe it did, and you chose to ignore the warning.
That's a problem -- useless non-terminals really cannot ever match anything, so they will eventually result in a parse error -- but it's not the parsing conflict you're reporting. The conflicts come from another feature of the same production, which is related to the fact that you can't divide by 0.
The thing about nothing is that any number of nothings are still nothing. So if you had plenty of nothings and someone came along and asked you exactly how many nothings you had, you'd have a bit of problem, because to get "plenty" back from "0 * plenty", you'd have to compute "0 / 0", and that's not a well-defined value. (If you had plenty of two's, and someone asked you how many two's you had, that wouldn't be a problem: suppose the plenty of two's worked out to 40; you could easily compute that 40 / 2 = 20, which works out perfectly because 20 * 2 = 40.)
So here we don't have arithmetic, we have strings of symbols. And unfortunately the string containing no symbols is really invisible, like 0 was for all those millenia until some Arabian mathematician noticed the value of being able to write nothing.
Where this is all coming around to is that you have
PROGRAM::= ... something ...
| NEWLINE PROGRAM;
But NEWLINE is allowed to produce nothing.
NEWLINE ::= salto NEWLINE
| ;
So the second recursive production of PROGRAM might add nothing. And it might add nothing plenty of times, because the production is recursive. But the parser needs to be deterministic: it needs to know exactly how many nothings are present, so that it can reduce each nothing to a NEWLINE non-terminal and then reduce a new PROGRAM non-terminal. And it really doesn't know how many nothings to add.
In short, optional nothings and repeated nothings are ambiguous. If you are going to insert a nothing into your language, you need to make sure that there are a fixed finite number of nothings, because that's the only way the parser can unambiguously dissect nothingness.
Now, since the only point of that particular recursion (as far as I can see) is to allow empty newline-terminated statements (which are visible because of the newline, but do nothing). And you could do that by changing the recursion to avoid nothing:
PROGRAM ::= ...
| salto PROGRAM;
Although it's not relevant to your current problem, I feel obliged to mention that CUP is an LALR parser generator and all that stuff you might have learned or read on the internet about recursive descent parsers not being able to handle left recursion does not apply. (I deleted a rant about the way parsing technique are taught, so you'll have to try to recover it from the hints I've left behind.) Bottom-up parser generators like CUP and yacc/bison love left recursion. They can handle right recursion, too, of course, but reluctantly because they need to waste stack space for every recursion other than left recursion. So there is no need to warp your grammar in order to deal with the deficiency; just write the grammar naturally and be happy. (So you rarely if ever need non-terminals representing "the rest of" something.)
PD: Plenty of nothing is a culturally-specific reference to a song from the 1934 opera Porgy and Bess.

Detect wrong grammar (error alternatives) for quick fixes

In my ANTLR grammar I would like to detect wrong keywords or wrong typed constants, e.g. 'null' instead of 'NULL'. I added error alternatives in the grammar, e.g.:
| extra='null' #error1
If the wrong constant is detected in my custom editor I can fix it by replacing it with the correct constant.
But I don't know if this is the correct way to address and detect wrong keywords or constants in a grammar.
In addition I tried to detect missing closing in the grammar (see ANTLR book chapter 9.4):
.......
| 'if' '(' expr expr #error2
| '{' exprlist #error3
| 'while' '(' expr expr #error4
........
But this massively slows down the parsing process and so I think that it is wrong to do so.
My questions are:
Is it correct to detect wrong keywords, constants, etc. in that way as described above?
Is it somehow possible to catch missing closing in the grammar without the massive speed decrease?
Any help is appreciated.
I found out that I can extract the second Token from the rule (to add a brace at that position):
| 'if' '(' expr expr #error2
with:
Token myToken = tokens.get(ctx.getChild(2).getSourceInterval().b);
parser.notifyErrorListeners(........
However it massively decreases the speed of the parser.

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Resources