What is the start symbol in this grammar? - parsing

What is the start symbol?
Based on some research "The start symbol we choose should allow the grammar to parse the most input sentences"
Clearly < Var > is NOT a start symbol as it will parse least input sentences, then start symbol is < Var > or < Group > ?
<Group> ::= [ <One>, <Group> ] | <One>
<One> ::= <Var> | ( <Group> )
<Var> ::= a | b | c

Final (start?) symbol is also called an AXIOM.
It is always given explicitly. It should never be deduced. It is decided by the author of the grammar.

Related

Handling whitespace in EBNF

Let's say I have the following EBNF defined for a simpler two-term adder:
<expression> ::= <number> <plus> <number>
<number> ::= [0-9]+
<plus> ::= "+"
Shown here.
What would be the proper way to allow any amount of whitespace except a newline/return between the terms? For example to allow:
1 + 2
1 <tab> + 2
1 + 2
etc.
For example, doing something like the following fails:
<whitespace>::= " " | \t
Furthermore, it seems (almost) every term would be preceded and followed by an optional space. Something like:
<plus> ::= <whitespace>? "+" <whitespace>?
How would that be properly addressed?
The XML standard, as an example, uses the following production for whitespace:
S ::= (#x20 | #x9 | #xD | #xA)+
You could omit CR (#xD) and LF (#xA) if you don't want those.
Regarding your observation that grammars could become overwhelmed by whitespace non-terminals, note that whitespace handling can be done in lexical analysis rather than in parsing. See EBNF Grammar for list of words separated by a space.

Is This Grammar SLR(1)?

I used this tool to generate the SLR(1) parsing table for this LL(1)/LR(1) grammar (which generates a small subset of XML):
document ::= element EOF
element ::= < elementPrefix
elementPrefix ::= NAME attribute elementSuffix
attribute ::= NAME = STRING attribute
attribute ::= EPSILON
elementSuffix ::= > elementOrData endTag
elementSuffix ::= />
elementOrData ::= < elementPrefix elementOrData
elementOrData ::= DATA elementOrData
elementOrData ::= EPSILON
endTag ::= </ NAME >
The tool correctly generates the table and associated automaton, which suggests that the grammar is SLR(1). Is that really the case? I understand that every LR(0) grammar is also SLR(1), but I was not sure how that relates to LL(1)/LR(1) grammars.
LL(1) and SLR(1) are both subsets of LR(1). They don't have a simple relationship to each other.

Is this a LL(1) grammar?

Considering the following grammar for propositional logic:
<A> ::= <B> <-> <A> | <B>
<B> ::= <C> -> <B> | <C>
<C> ::= <D> \/ <C> | <D>
<D> ::= <E> /\ <D> | <E>
<E> ::= <F> | -<F>
<F> ::= <G> | <H>
<G> ::= (<A>)
<H> ::= p | q | r | ... | z
Precedence for conectives is: -, /\, /, ->, <->.
Associativity is also considered, for example p\/q\/r should be the same as p\/(q\/r). The same for the other conectives.
I pretend to make a predictive top-down parser in java. I dont see here ambiguity or direct left recursion, but not sure if thats all i need to consider this a LL(1) grammar. Maybe undirect left recursion?
If this is not a LL(1) grammar, what would be the steps required to transform it for my intentions?
It's not LL(1). Here's why:
The first rule of an LL(1) grammar is:
A grammar G is LL(1) if and only if whenever A --> C | D are two distinct productions of G, the following conditions hold:
For no terminal a , do both C and D derive strings beginning with a.
This rule is, so that there are no conflicts while parsing this code. When the parser encounters a (, it won't know which production to use.
Your grammar violates this first rule. All your non-terminals on the right hand of the same production , that is, all your Cs and Ds, eventually reduce to G and H, so all of them derive at least one string beginning with (.

Entry rule position convention in BNF?

Is it mandatory for the first (topmost) rule of an BNF (or EBNF) grammar to represent the entry point? For example, from the wikipedia BNF page, the US Postal address grammar below has <postal-address> as the first derivation rule, and also the entry point:
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
Am I at liberty to put the <postal-address> rule in, say, the second position, and so provide the grammar in the following alternate form:
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
No, this isn't a requirement. It is just a convention used by some.
In practice, one must designate "the" goal rule. We have set of tools in which one identifies the nonterminal which is the goal nonterminal, and you can provide the rules (including goal rules) in any order. How you designate that may be outside the grammar formalism, or may be a special rule included in the grammar.
As a practical matter, this is not a big deal (OK, so some tool insists you put all the goal rules first, not actually that hard) and not that hard to do nicely (ok, the tool checks the left hand side of a grammar rule to see if it matches the goal nonterminal).
Of course, you need to know which way your tool works, but that takes about 2 minutes to figure out.
Some tools only allow one goal rule. As a practical matter, real (re-engineering, see my bio) parsers often find it useful to allow multiple rules (consider parsing COBOL as "whole programs" and as "COPYLIBS"), so you end up writing (clumsily IMHO):
G = G1 | G2 | G3 ... ;
G1 = ...
in this case. Still not a big deal. None of these constraints hurt expressiveness or in fact cost you much engineering time.

Handling Token Ambiguity in JavaCC

I'm attempting to write a parser in JavaCC that can recognize a language that has some ambiguity at the token level. In this particular case the language supports the "/" token by itself as a division operator while it also supports regular expression literals.
Consider the following JavaCC grammar:
TOKEN :
{
...
< VAR : "var" > |
< DIV : "/" > |
< EQUALS : "=" > |
< SEMICOLON : ";" > |
...
}
TOKEN :
{
< IDENTIFIER : <IDENTIFIER_START> (<IDENTIFIER_START> | <IDENTIFIER_CHAR>)* > |
< #IDENTIFIER_START : ( [ "$","_","A"-"Z","a"-"z" ] )> |
< #IDENTIFIER_CHAR : ( [ "$","_","A"-"Z","a"-"z","0"-"9" ] ) > |
< REGEX_LITERAL : ("/" <REGEX_BODY> "/" ( <REGEX_FLAGS> )? ) > |
< #REGEX_BODY : ( <REGEX_FIRST_CHAR> <REGEX_CHARS> ) > |
< #REGEX_CHARS : ( <REGEX_CHAR> )* > |
< #REGEX_FIRST_CHAR : ( ~["\r", "\n", "*", "/", "\\"] | <BACKSLASH_SEQUENCE> ) > |
< #REGEX_CHAR : ( ~[ "\r", "\n", "/", "\\" ] | <BACKSLASH_SEQUENCE> ) > |
< #BACKSLASH_SEQUENCE : ("\\" ~[ "\r", "\n"] ) > |
< #REGEX_FLAGS : ( <IDENTIFIER_CHAR> )* >
}
Given the following code:
var y = a/b/c;
Two different sets of tokens could be generated. The token stream should be either:
<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <DIV> <IDENTIFIER> <DIV> <SEMICOLON>
or
<VAR> <IDENTIFIER> <EQUALS> <IDENTIFIER> <REGEX_LITERAL> <SEMICOLON>
How can I ensure that that TokenManager generates the token stream that I expect for this case?
JavaCC will always consume the largest token available and there is no way to configure it otherwise. The only way to accomplish this is by adding a lexical state, in case say IGNORE_REGEX, that excludes the token, in this case <REGEX_LITERAL>. Then, when a token is recognized that cannot be followed by <REGEX_LITERAL> the lexical state must be switched to IGNORE_REGEX.
With the input:
var y = a/b/c
The following would occur:
<VAR> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
<EQUALS> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
At this point, there is an ambiguity in the grammar, either a <DIV> or a <REGEX_LITERAL> will be consumed. Since the lexical state is IGNORE_REGEX and that state does not match <REGEX_LITERAL> a <DIV> will be consumed.
<DIV> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
<DIV> is consumed, lexical state is set to DEFAULT
<IDENTIFIER> is consumed, lexical state is set to IGNORE_REGEX
as far as i remember (i worked with JavaCC sometime back)
the order in which you write each rule is the order in which it would be parsed, so write your rules in an order which would always generate the expression that you want.
Since JavaScript/EcmaScript does the same thing (that is, it contains regex literals and a divide operator that look just like those in your examples) you might want to look for an existing JavaCC grammar to learn from. I found one linked to from this blog entry, there may be others.

Resources