Entry rule position convention in BNF? - parsing

Is it mandatory for the first (topmost) rule of an BNF (or EBNF) grammar to represent the entry point? For example, from the wikipedia BNF page, the US Postal address grammar below has <postal-address> as the first derivation rule, and also the entry point:
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
Am I at liberty to put the <postal-address> rule in, say, the second position, and so provide the grammar in the following alternate form:
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""

No, this isn't a requirement. It is just a convention used by some.
In practice, one must designate "the" goal rule. We have set of tools in which one identifies the nonterminal which is the goal nonterminal, and you can provide the rules (including goal rules) in any order. How you designate that may be outside the grammar formalism, or may be a special rule included in the grammar.
As a practical matter, this is not a big deal (OK, so some tool insists you put all the goal rules first, not actually that hard) and not that hard to do nicely (ok, the tool checks the left hand side of a grammar rule to see if it matches the goal nonterminal).
Of course, you need to know which way your tool works, but that takes about 2 minutes to figure out.
Some tools only allow one goal rule. As a practical matter, real (re-engineering, see my bio) parsers often find it useful to allow multiple rules (consider parsing COBOL as "whole programs" and as "COPYLIBS"), so you end up writing (clumsily IMHO):
G = G1 | G2 | G3 ... ;
G1 = ...
in this case. Still not a big deal. None of these constraints hurt expressiveness or in fact cost you much engineering time.

Related

How would I implement operator-precedence in my grammar?

I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.

Handling whitespace in EBNF

Let's say I have the following EBNF defined for a simpler two-term adder:
<expression> ::= <number> <plus> <number>
<number> ::= [0-9]+
<plus> ::= "+"
Shown here.
What would be the proper way to allow any amount of whitespace except a newline/return between the terms? For example to allow:
1 + 2
1 <tab> + 2
1 + 2
etc.
For example, doing something like the following fails:
<whitespace>::= " " | \t
Furthermore, it seems (almost) every term would be preceded and followed by an optional space. Something like:
<plus> ::= <whitespace>? "+" <whitespace>?
How would that be properly addressed?
The XML standard, as an example, uses the following production for whitespace:
S ::= (#x20 | #x9 | #xD | #xA)+
You could omit CR (#xD) and LF (#xA) if you don't want those.
Regarding your observation that grammars could become overwhelmed by whitespace non-terminals, note that whitespace handling can be done in lexical analysis rather than in parsing. See EBNF Grammar for list of words separated by a space.

Find out if there is a theory fow writing expression section of BNF Grammar

I want to write a new parser for mathematical program. When i write BNF grammar for that, I'm stuck in expressions. And this is the BNF notation I wrote.
program : expression*
expression: additive_expression
additive_expression : multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
multiplicative_expression : number
| number '*' multiplicative_expression
| number '/' multiplicative_expression
But i cannot understand how to write BNF expression grammar for ++, --, +=, -=, &&, || etc. this operators. I mean operators in languages like C, C++, C#, Python, Java etc.
I know that when using +, -, *, / operators, grammar should be written according to the BODMAS theory.
What i want to know is whether any theory should be used in writing grammar for other operators?
I took a good look at the C++ Language grammar ( expressions part ). But I can not understand it.
<expression> ::= <assignment_expression>
| <assignment_expression> <expression>
<assignment_expression> ::= <logical_or_expression> "=" <assignment_expression>
| <logical_or_expression> "+=" <assignment_expression>
| <logical_or_expression> "-=" <assignment_expression>
| <logical_or_expression> "*=" <assignment_expression>
| <logical_or_expression> "/=" <assignment_expression>
| <logical_or_expression> "%=" <assignment_expression>
| <logical_or_expression> "<<=" <assignment_expression>
| <logical_or_expression> ">>=" <assignment_expression>
| <logical_or_expression> "&=" <assignment_expression>
| <logical_or_expression> "|=" <assignment_expression>
| <logical_or_expression> "^=" <assignment_expression>
| <logical_or_expression>
<constant_expression> ::= <conditional_expression>
<conditional_expression> ::= <logical_or_expression>
<logical_or_expression> ::= <logical_or_expression> "||" <logical_and_expression>
| <logical_and_expression>
<logical_and_expression> ::= <logical_and_expression> "&&" <inclusive_or_expression>
| <inclusive_or_expression>
<inclusive_or_expression> ::= <inclusive_or_expression> "|" <exclusive_or_expression>
| <exclusive_or_expression>
<exclusive_or_expression> ::= <exclusive_or_expression> "^" <and_expression>
| <and_expression>
<and_expression> ::= <and_expression> "&" <equality_expression>
| <equality_expression>
<equality_expression> ::= <equality_expression> "==" <relational_expression>
| <equality_expression> "!=" <relational_expression>
| <relational_expression>
<relational_expression> ::= <relational_expression> ">" <shift_expression>
| <relational_expression> "<" <shift_expression>
| <relational_expression> ">=" <shift_expression>
| <relational_expression> "<=" <shift_expression>
| <shift_expression>
<shift_expression> ::= <shift_expression> ">>" <addictive_expression>
| <shift_expression> "<<" <addictive_expression>
| <addictive_expression>
<addictive_expression> ::= <addictive_expression> "+" <multiplicative_expression>
| <addictive_expression> "-" <multiplicative_expression>
| <multiplicative_expression>
<multiplicative_expression> ::= <multiplicative_expression> "*" <unary_expression>
| <multiplicative_expression> "/" <unary_expression>
| <multiplicative_expression> "%" <unary_expression>
| <unary_expression>
<unary_expression> ::= "++" <unary_expression>
| "--" <unary_expression>
| "+" <unary_expression>
| "-" <unary_expression>
| "!" <unary_expression>
| "~" <unary_expression>
| "size" <unary_expression>
| <postfix_expression>
<postfix_expression> ::= <postfix_expression> "++"
| <postfix_expression> "--"
| <primary_expression>
<primary_expression> ::= <integer_literal>
| <floating_literal>
| <character_literal>
| <string_literal>
| <boolean_literal>
| "(" <expression> ")"
| IDENTIFIER
<integer_literal> ::= INTEGER
<floating_literal> ::= FLOAT
<character_literal> ::= CHARACTER
<string_literal> ::= STRING
<boolean_literal> ::= "true"
| "false"
Can you help me understand this? I searched a lot on the internet about this. But I could not find the right solution for this.
Thank you
I gather that the problem is not that you don't know how to write BNF grammars. Rather, you don't know which grammar you should write. In other words, if someone told you the precedence order for your operators, you would have no trouble writing down a grammar which parsed the operators with that particular precedence order. But you don't know which precedence order you should use.
Unfortunately there is not an International High Commission on Operator Precedence, and neither does any religion that I know of offer a spiritual reference including divinely inspired operator precedence rules.
For some operators, the precedence order is reasonably clear: BODMAS was adopted for good reasons, for example, but the main reason is that most people already wrote arithmetic according to those rules. You can certainly make a plausible mathematical argument based on group theory as to why it seems natural to give multiplication precedence over addition, but there will always be the suspicion that the argument was produced post facto to justify an already-made decision. Be that as it may, the argument for making multiplication bind more tightly than addition also works for giving bitwise-and precedence over bitwise-or, or boolean-and precedence over boolean-or. (Although I note that C compiler writers don't trust that programmers will have that particular intuition, since most modern compilers issue a warning if you omit redundant parentheses in such expressions.)
One precedence relationship which I think just about everyone would agree was intuitive is giving arithmetic operators precedence over comparison operators, and comparison operators precedence operators precedence over boolean operators. Most programmers would find a language which chose to interpret a > b + 7 as meaning "add seven to the boolean value of comparing a with b". Similarly, it might be considered outrageous to interpret a > 0 || b > 0 in any way other than (a > 0) || (b > 0). But other precedence choices seem a lot more arbitrary, and not all languages make them the same. (For example, the relative precedence of "not equal to" and "greater than", or of "exclusive or" and "inclusive or".
So what's a novice language designer to do? Well, first appeal to your own intuitions, particularly if you have used the operators in question a lot. Also, ask your friends, contacts, and potential language users what they think. Look at what other languages have done, but look with a critical eye. Search for complaints on language-specific forums. It may well be that the original language designer made an unfortunate (or even stupid) choice, and that choice now cannot be changed because it would break too many existing programs. (Which is why it is a good thing that you are worrying about operator precedence: getting it wrong can bring serious future problems.)
Direct experimentation can help, too, particularly with operators which are rarely used together. Define a plausible expression using the operators, and then write it out two (or more) times leaving out parentheses according to the various possible rules. Which of those looks more understandable? (Again, recruit your friends and ask them the same question.)
If, in the end, you really cannot decide what the precedence order between a particular pair of operators is, you can consider one more solution: Make parentheses mandatory for these operators. (Which is the intent of the C compiler whining about leaving out redundant parentheses in boolean expressions.)

LR(1) Parser: Why adding an epsilon production makes r/r or s/r conflicts

I wanted to make a reader which reads configuration files similar to INI files for mswin.
It is for exercise to teach myself using a lexer/parser generator which I made.
The grammar is:
%lexer
HEADER ::= "\\[[0-9a-zA-Z]+\\]"
TRUE ::= "yes|true"
FALSE ::= "no|false"
ASSIGN ::= "="
OPTION_NAME ::= "[a-zA-Z][0-9a-zA-Z]*"
INT ::= "[0-9]+"
STRING ::= "\"(\\\"|[^\"])*\""
CODE ::= "<{(.*)}>"
BLANK ::= "[ \t\f]+" :ignore
COMMENT ::= "#[^\n\r]*(\r|\n)?" :ignore
NEWLINE ::= "\r|\n"
%parser
Options ::= OptionGroup Options | OptionGroup | #epsilon#
OptionGroup ::= HEADER NEWLINE OptionsList
OptionsList ::= Option NEWLINE OptionsList | Option
Option ::= OPTION_NAME ASSIGN OptionValue
OptionValue ::= TRUE | FALSE | INT | STRING | CODE
The problem lies in the #epsilon# production. I added it because I want my reader to accept also empty files. But I'm getting conflicts when 'OptionsList' or 'OptionGroup' contains an epsilon production. I tried rearrange elements in productions, but I'm only getting conflicts (r/r or s/r, depending of what I did), unless I remove the epsilon completely from my grammar. It removes the problem, but...in my logic one of 'OptionsList' or 'OptionGroup' should contain an epsilon, otherwise my goal to accepting empty files is not met.
My parser generator uses LR(1) method, so I thought I can use epsilon productions in my grammar. It seems I'm good at writing generators, but not in constructing error-less grammars :(.
Should I forget about epsilons? Or is my grammar accepting empty inputs even when there is no epsilon production?
Your Options production allows an Options to be a sequence of OptionGroups, starting with either an empty list or a list consisting of a single element. That's obviously ambiguous, because a list of exactly one OptionGroup could be:
The base case OptionGroup
The base case #epsilon# with the addition of an OptionGroup.
In short, instead of
Options ::= OptionGroup Options | OptionGroup | #epsilon#
you need
Options ::= OptionGroup Options | #epsilon#
which matches exactly the same set of sentences, but unambiguously.
In general terms, you are usually better off writing left-recursive rules for bottom-up parsers. So I would have written
Options ::= Options OptionGroup | #epsilon#

Parse grammar alternating and repeating

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.
The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.
Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

Resources