Handling whitespace in EBNF - parsing

Let's say I have the following EBNF defined for a simpler two-term adder:
<expression> ::= <number> <plus> <number>
<number> ::= [0-9]+
<plus> ::= "+"
Shown here.
What would be the proper way to allow any amount of whitespace except a newline/return between the terms? For example to allow:
1 + 2
1 <tab> + 2
1 + 2
etc.
For example, doing something like the following fails:
<whitespace>::= " " | \t
Furthermore, it seems (almost) every term would be preceded and followed by an optional space. Something like:
<plus> ::= <whitespace>? "+" <whitespace>?
How would that be properly addressed?

The XML standard, as an example, uses the following production for whitespace:
S ::= (#x20 | #x9 | #xD | #xA)+
You could omit CR (#xD) and LF (#xA) if you don't want those.
Regarding your observation that grammars could become overwhelmed by whitespace non-terminals, note that whitespace handling can be done in lexical analysis rather than in parsing. See EBNF Grammar for list of words separated by a space.

Related

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Find out if there is a theory fow writing expression section of BNF Grammar

I want to write a new parser for mathematical program. When i write BNF grammar for that, I'm stuck in expressions. And this is the BNF notation I wrote.
program : expression*
expression: additive_expression
additive_expression : multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
multiplicative_expression : number
| number '*' multiplicative_expression
| number '/' multiplicative_expression
But i cannot understand how to write BNF expression grammar for ++, --, +=, -=, &&, || etc. this operators. I mean operators in languages like C, C++, C#, Python, Java etc.
I know that when using +, -, *, / operators, grammar should be written according to the BODMAS theory.
What i want to know is whether any theory should be used in writing grammar for other operators?
I took a good look at the C++ Language grammar ( expressions part ). But I can not understand it.
<expression> ::= <assignment_expression>
| <assignment_expression> <expression>
<assignment_expression> ::= <logical_or_expression> "=" <assignment_expression>
| <logical_or_expression> "+=" <assignment_expression>
| <logical_or_expression> "-=" <assignment_expression>
| <logical_or_expression> "*=" <assignment_expression>
| <logical_or_expression> "/=" <assignment_expression>
| <logical_or_expression> "%=" <assignment_expression>
| <logical_or_expression> "<<=" <assignment_expression>
| <logical_or_expression> ">>=" <assignment_expression>
| <logical_or_expression> "&=" <assignment_expression>
| <logical_or_expression> "|=" <assignment_expression>
| <logical_or_expression> "^=" <assignment_expression>
| <logical_or_expression>
<constant_expression> ::= <conditional_expression>
<conditional_expression> ::= <logical_or_expression>
<logical_or_expression> ::= <logical_or_expression> "||" <logical_and_expression>
| <logical_and_expression>
<logical_and_expression> ::= <logical_and_expression> "&&" <inclusive_or_expression>
| <inclusive_or_expression>
<inclusive_or_expression> ::= <inclusive_or_expression> "|" <exclusive_or_expression>
| <exclusive_or_expression>
<exclusive_or_expression> ::= <exclusive_or_expression> "^" <and_expression>
| <and_expression>
<and_expression> ::= <and_expression> "&" <equality_expression>
| <equality_expression>
<equality_expression> ::= <equality_expression> "==" <relational_expression>
| <equality_expression> "!=" <relational_expression>
| <relational_expression>
<relational_expression> ::= <relational_expression> ">" <shift_expression>
| <relational_expression> "<" <shift_expression>
| <relational_expression> ">=" <shift_expression>
| <relational_expression> "<=" <shift_expression>
| <shift_expression>
<shift_expression> ::= <shift_expression> ">>" <addictive_expression>
| <shift_expression> "<<" <addictive_expression>
| <addictive_expression>
<addictive_expression> ::= <addictive_expression> "+" <multiplicative_expression>
| <addictive_expression> "-" <multiplicative_expression>
| <multiplicative_expression>
<multiplicative_expression> ::= <multiplicative_expression> "*" <unary_expression>
| <multiplicative_expression> "/" <unary_expression>
| <multiplicative_expression> "%" <unary_expression>
| <unary_expression>
<unary_expression> ::= "++" <unary_expression>
| "--" <unary_expression>
| "+" <unary_expression>
| "-" <unary_expression>
| "!" <unary_expression>
| "~" <unary_expression>
| "size" <unary_expression>
| <postfix_expression>
<postfix_expression> ::= <postfix_expression> "++"
| <postfix_expression> "--"
| <primary_expression>
<primary_expression> ::= <integer_literal>
| <floating_literal>
| <character_literal>
| <string_literal>
| <boolean_literal>
| "(" <expression> ")"
| IDENTIFIER
<integer_literal> ::= INTEGER
<floating_literal> ::= FLOAT
<character_literal> ::= CHARACTER
<string_literal> ::= STRING
<boolean_literal> ::= "true"
| "false"
Can you help me understand this? I searched a lot on the internet about this. But I could not find the right solution for this.
Thank you
I gather that the problem is not that you don't know how to write BNF grammars. Rather, you don't know which grammar you should write. In other words, if someone told you the precedence order for your operators, you would have no trouble writing down a grammar which parsed the operators with that particular precedence order. But you don't know which precedence order you should use.
Unfortunately there is not an International High Commission on Operator Precedence, and neither does any religion that I know of offer a spiritual reference including divinely inspired operator precedence rules.
For some operators, the precedence order is reasonably clear: BODMAS was adopted for good reasons, for example, but the main reason is that most people already wrote arithmetic according to those rules. You can certainly make a plausible mathematical argument based on group theory as to why it seems natural to give multiplication precedence over addition, but there will always be the suspicion that the argument was produced post facto to justify an already-made decision. Be that as it may, the argument for making multiplication bind more tightly than addition also works for giving bitwise-and precedence over bitwise-or, or boolean-and precedence over boolean-or. (Although I note that C compiler writers don't trust that programmers will have that particular intuition, since most modern compilers issue a warning if you omit redundant parentheses in such expressions.)
One precedence relationship which I think just about everyone would agree was intuitive is giving arithmetic operators precedence over comparison operators, and comparison operators precedence operators precedence over boolean operators. Most programmers would find a language which chose to interpret a > b + 7 as meaning "add seven to the boolean value of comparing a with b". Similarly, it might be considered outrageous to interpret a > 0 || b > 0 in any way other than (a > 0) || (b > 0). But other precedence choices seem a lot more arbitrary, and not all languages make them the same. (For example, the relative precedence of "not equal to" and "greater than", or of "exclusive or" and "inclusive or".
So what's a novice language designer to do? Well, first appeal to your own intuitions, particularly if you have used the operators in question a lot. Also, ask your friends, contacts, and potential language users what they think. Look at what other languages have done, but look with a critical eye. Search for complaints on language-specific forums. It may well be that the original language designer made an unfortunate (or even stupid) choice, and that choice now cannot be changed because it would break too many existing programs. (Which is why it is a good thing that you are worrying about operator precedence: getting it wrong can bring serious future problems.)
Direct experimentation can help, too, particularly with operators which are rarely used together. Define a plausible expression using the operators, and then write it out two (or more) times leaving out parentheses according to the various possible rules. Which of those looks more understandable? (Again, recruit your friends and ask them the same question.)
If, in the end, you really cannot decide what the precedence order between a particular pair of operators is, you can consider one more solution: Make parentheses mandatory for these operators. (Which is the intent of the C compiler whining about leaving out redundant parentheses in boolean expressions.)

parsing maxscript - problem with newlines

I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.
My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot

ANTLR4 - How to tokenize differently inside quotes?

I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:
grammar SimpleGrammar;
AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;
fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];
// Parser rules
expr:
expr AND expr
| '"' phrase '"'
| TERM
| TRUNCATION
;
phrase:
(TERM | PHRASE_TERM | TRUNCATION)+
;
The above grammar works when parsing a! & b, which correctly parses to:
AND
/ \
/ \
a! b
However, when I attempt to parse "a! & b", I get:
line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}
The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.
Is this possible?
It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.
In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.
LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;
For more information google 'ANTLR lexer Mode'

Entry rule position convention in BNF?

Is it mandatory for the first (topmost) rule of an BNF (or EBNF) grammar to represent the entry point? For example, from the wikipedia BNF page, the US Postal address grammar below has <postal-address> as the first derivation rule, and also the entry point:
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
Am I at liberty to put the <postal-address> rule in, say, the second position, and so provide the grammar in the following alternate form:
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
No, this isn't a requirement. It is just a convention used by some.
In practice, one must designate "the" goal rule. We have set of tools in which one identifies the nonterminal which is the goal nonterminal, and you can provide the rules (including goal rules) in any order. How you designate that may be outside the grammar formalism, or may be a special rule included in the grammar.
As a practical matter, this is not a big deal (OK, so some tool insists you put all the goal rules first, not actually that hard) and not that hard to do nicely (ok, the tool checks the left hand side of a grammar rule to see if it matches the goal nonterminal).
Of course, you need to know which way your tool works, but that takes about 2 minutes to figure out.
Some tools only allow one goal rule. As a practical matter, real (re-engineering, see my bio) parsers often find it useful to allow multiple rules (consider parsing COBOL as "whole programs" and as "COPYLIBS"), so you end up writing (clumsily IMHO):
G = G1 | G2 | G3 ... ;
G1 = ...
in this case. Still not a big deal. None of these constraints hurt expressiveness or in fact cost you much engineering time.

Resources