I used this tool to generate the SLR(1) parsing table for this LL(1)/LR(1) grammar (which generates a small subset of XML):
document ::= element EOF
element ::= < elementPrefix
elementPrefix ::= NAME attribute elementSuffix
attribute ::= NAME = STRING attribute
attribute ::= EPSILON
elementSuffix ::= > elementOrData endTag
elementSuffix ::= />
elementOrData ::= < elementPrefix elementOrData
elementOrData ::= DATA elementOrData
elementOrData ::= EPSILON
endTag ::= </ NAME >
The tool correctly generates the table and associated automaton, which suggests that the grammar is SLR(1). Is that really the case? I understand that every LR(0) grammar is also SLR(1), but I was not sure how that relates to LL(1)/LR(1) grammars.
LL(1) and SLR(1) are both subsets of LR(1). They don't have a simple relationship to each other.
Related
I could capture a parenthetical group using something like:
expr ::= "(" <something> ")"
However, sometimes it's useful to use multiple levels of nesting, and so it's (theoretically) possible to have more than one parens as long as they match. For example:
>>> (1)+1
2
>>> (((((-1)))))+2
1
>>> ((2+2)+(1+1))
6
>>> (2+2))
SyntaxError: invalid syntax
Is there a way to specify a "matching-ness" in EBNF, or how is parenthetical-matching handled by most parsers?
In order to be able to match an arbitrary amount of anything (be it parentheses, operators, list items etc.) you need recursion (EBNF also features repetition operators that can be used instead of recursion in some cases, but not for constructs that need to be matched like parentheses).
For well-matched parentheses, the proper production is simply:
expr ::= "(" expr ")"
That's in addition to productions for other types of expressions, of course, so a complete grammar might look like this:
expr ::= "(" expr ")"
expr ::= NUMBER
expr ::= expr "+" expr
expr ::= expr "-" expr
expr ::= expr "*" expr
expr ::= expr "/" expr
Or for an unambiguous grammar:
expr ::= expr "+" multExpr
expr ::= expr "-" multExpr
multExpr ::= multExpr "*" primaryExpr
multExpr ::= multExpr "/" primaryExpr
primaryExpr ::= "(" expr ")"
primaryExpr ::= NUMBER
Also, how do you usually go about 'testing' that it is correct -- is there an online tool or something that can validate a syntax?
There are many parser generators that can accept some form of BNF- or EBNF-like notation and generate a parser from it. You can use one of those and then test whether the generated parser parses what you want it to. They're usually not available as online tools though. Also note that parser generators generally need the grammar to be unambiguous or you to add precedence declarations to disambiguate it.
also wouldn't infinite loop?
No. The exact mechanics depend on the parsing algorithm used of course, but if the character at the current input position is not an opening parenthesis, then clearly this isn't the right production to use and another one needs to be applied (or a syntax error raised if none of the productions apply).
Left recursion can cause infinite recursion when using top-down parsing algorithms (though in case of parser generators it's more likely that the grammar will either be rejected or in some cases automatically rewritten than that you get an actual infinite recursion or loop), but non-left recursion doesn't cause that kind of problem with any algorithm.
I wanted to make a reader which reads configuration files similar to INI files for mswin.
It is for exercise to teach myself using a lexer/parser generator which I made.
The grammar is:
%lexer
HEADER ::= "\\[[0-9a-zA-Z]+\\]"
TRUE ::= "yes|true"
FALSE ::= "no|false"
ASSIGN ::= "="
OPTION_NAME ::= "[a-zA-Z][0-9a-zA-Z]*"
INT ::= "[0-9]+"
STRING ::= "\"(\\\"|[^\"])*\""
CODE ::= "<{(.*)}>"
BLANK ::= "[ \t\f]+" :ignore
COMMENT ::= "#[^\n\r]*(\r|\n)?" :ignore
NEWLINE ::= "\r|\n"
%parser
Options ::= OptionGroup Options | OptionGroup | #epsilon#
OptionGroup ::= HEADER NEWLINE OptionsList
OptionsList ::= Option NEWLINE OptionsList | Option
Option ::= OPTION_NAME ASSIGN OptionValue
OptionValue ::= TRUE | FALSE | INT | STRING | CODE
The problem lies in the #epsilon# production. I added it because I want my reader to accept also empty files. But I'm getting conflicts when 'OptionsList' or 'OptionGroup' contains an epsilon production. I tried rearrange elements in productions, but I'm only getting conflicts (r/r or s/r, depending of what I did), unless I remove the epsilon completely from my grammar. It removes the problem, but...in my logic one of 'OptionsList' or 'OptionGroup' should contain an epsilon, otherwise my goal to accepting empty files is not met.
My parser generator uses LR(1) method, so I thought I can use epsilon productions in my grammar. It seems I'm good at writing generators, but not in constructing error-less grammars :(.
Should I forget about epsilons? Or is my grammar accepting empty inputs even when there is no epsilon production?
Your Options production allows an Options to be a sequence of OptionGroups, starting with either an empty list or a list consisting of a single element. That's obviously ambiguous, because a list of exactly one OptionGroup could be:
The base case OptionGroup
The base case #epsilon# with the addition of an OptionGroup.
In short, instead of
Options ::= OptionGroup Options | OptionGroup | #epsilon#
you need
Options ::= OptionGroup Options | #epsilon#
which matches exactly the same set of sentences, but unambiguously.
In general terms, you are usually better off writing left-recursive rules for bottom-up parsers. So I would have written
Options ::= Options OptionGroup | #epsilon#
I am trying to parse positive and negative decimals.
number(N) ::= pnumber(N1).
number(N) ::= nnumber(N1).
number(N) ::= pnumber(N1) DOT pnumber(N2).
number(N) ::= nnumber(N1) DOT pnumber(N2).
pnumber(N) ::= NUMBER(N1).
nnumber(N) ::= MINUS NUMBER(N1).
The inclusion of the first two rules gives a shift/reduce conflict but I don't know how I can write the grammar such that the conflict never occurs.
I am using the Lemon parser.
Edit: conflicts from .out file
State 79:
(56) number ::= nnumber *
number ::= nnumber * DOT pnumber
DOT shift 39
DOT reduce 56 ** Parsing conflict **
{default} reduce 56 number ::= nnumber
State 80:
(55) number ::= pnumber *
number ::= pnumber * DOT pnumber
DOT shift 40
DOT reduce 55 ** Parsing conflict **
{default} reduce 55 number ::= pnumber
State 39:
number ::= nnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 58 number ::= nnumber DOT pnumber
State 40:
number ::= pnumber DOT * pnumber
pnumber ::= * NUMBER
NUMBER shift-reduce 59 pnumber ::= NUMBER
pnumber shift-reduce 57 number ::= pnumber DOT pnumber
Edit 2: Minimal grammar that causes issue
start ::= prog.
prog ::= rule.
rule ::= REVERSE_IMPLICATION body DOT.
body ::= bodydef.
body ::= body CONJUNCTION bodydef.
bodydef ::= literal.
literal ::= variable.
variable ::= number.
number ::= pnumber.
number ::= nnumber.
number ::= pnumber DOT pnumber.
number ::= nnumber DOT pnumber.
pnumber ::= NUMBER.
nnumber ::= MINUS NUMBER.
The conflicts you show indicate a problem with how the number non-terminal is used, not with number itself.
The basic problem is that after seeing a pnumber or nnumber, when the next token of lookahead is a DOT, it can't decide if that should be the end of the number (reduce, so DOT is part of some other non-terminal after the number), or if the DOT should be treated as part of the number (shifted so it can later reduce one of the p/nnumber DOT pnumber rules.)
So in order to diagnose the problem, you'll need to show all the rules that use number anywhere on the right hand side (and recursively any other rules that use any of those rules' non-terminals on the right).
Note that it is rarely useful to post just a fragment of a grammar, as the LR parser construction process depends heavily on the context of where the rules are used elsewhere in the grammar...
So the problem here is that you need two-token lookahead to differentiate between a DOT in a (real) number literal and a DOT at the end of a rule.
The easy fix is to let the lexer deal with it -- lexers can do small amounts of lookahead quite easily, so you can recognize REAL_NUMBER as a distinct non-terminal from NUMBER (probably still without the -, so you'd end up with
number ::= NUMBER | MINUS NUMBER | REAL_NUMBER | MINUS REAL_NUMBER
It's much harder to remove the conflict by factoring the grammar but it can be done.
In general, to refactor a grammar to remove a lookahead conflict, you need to figure out the rules that manifest the conflict (rule and number here) and refactor things to bring them together into rules that have common prefixes until you get far enough along to disambiguate.
First, I'm going to assume there are other rules besides number that can appear here, as otherwise we could just eliminate all the intervening rules.
variable ::= number | name
We want to move the number rule "up" in the grammar to get it into the same place as rule with DOT. So we need to split the containing rules to special case when they end with a number. We add a suffix to denote the rules that correspond to the original rule with all versions that end in a number split off
variable ::= number | variable_n
variable_n ::= name
...and propagate that "up"
literal ::= number | literal_n
literal_n ::= variable_n
...and again
bodydef ::= number | bodydef_n
bodydef_n := literal_n
...and again
body ::= number | body_n
body := body CONJUNCTION number
body_n ::= bodydef_n
body_n ::= body CONJUNCTION bodydef_n
Notice that as you move it up, you need to split up more and more rules, so this process can blow up the grammar quite a bit. However, rules that are used only at the end of a rhs that you're refactoring will end up only needing the _n version, so you don't necessarily have to double the number of rules.
...last step
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION number DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION number DOT
Now you have the DOTs in all the same places, so expand the number rules:
rule ::= REVERSE_IMPLICATION body_n DOT
rule ::= REVERSE_IMPLICATION integer DOT
rule ::= REVERSE_IMPLICATION integer DOT pnumber DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT
rule ::= REVERSE_IMPLICATION body CONJUNCTION integer DOT pnumber DOT
and the shift-reduce conflicts are gone, because the rules have common prefixes up until past the needed lookahead to determine which to use.
I've reduced the number of rules in this final expansion by adding
integer ::= pnumber | nnumber
You have to declare the associativity of the DOT operator token with %left or %right.
Or, another idea is to drop this intermediate reduction. The obvious feature in your grammar is that numbers grow by DOT followed by a number. That can be captured with a single rule:
number : number DOT NUMBER
A number followed by a DOT followed by a NUMBER token is still a number.
This rule doesn't require DOT to have an associativity declared, because there is no ambiguity; the rule is purely left-recursive, and the right hand of DOT is a terminal token. The parser must reduce the top of the stack to number when the state machine is at this point, and then shift DOT:
number : number DOT NUMBER
The language which you are parsing here is regular; it can be parsed by regular expressions without any recursion. That is why rules that have both left and right recursion in them and require associativity to be declared are somewhat of a "big hammer".
Considering the following grammar for propositional logic:
<A> ::= <B> <-> <A> | <B>
<B> ::= <C> -> <B> | <C>
<C> ::= <D> \/ <C> | <D>
<D> ::= <E> /\ <D> | <E>
<E> ::= <F> | -<F>
<F> ::= <G> | <H>
<G> ::= (<A>)
<H> ::= p | q | r | ... | z
Precedence for conectives is: -, /\, /, ->, <->.
Associativity is also considered, for example p\/q\/r should be the same as p\/(q\/r). The same for the other conectives.
I pretend to make a predictive top-down parser in java. I dont see here ambiguity or direct left recursion, but not sure if thats all i need to consider this a LL(1) grammar. Maybe undirect left recursion?
If this is not a LL(1) grammar, what would be the steps required to transform it for my intentions?
It's not LL(1). Here's why:
The first rule of an LL(1) grammar is:
A grammar G is LL(1) if and only if whenever A --> C | D are two distinct productions of G, the following conditions hold:
For no terminal a , do both C and D derive strings beginning with a.
This rule is, so that there are no conflicts while parsing this code. When the parser encounters a (, it won't know which production to use.
Your grammar violates this first rule. All your non-terminals on the right hand of the same production , that is, all your Cs and Ds, eventually reduce to G and H, so all of them derive at least one string beginning with (.
When I'm trying to compile this simple parser using Lemon, I get a conflict but I can't see which rule is wrong. The conflict disappear if I remove the binaryexpression or the callexpression.
%left Add.
program ::= expression.
expression ::= binaryexpression.
expression ::= callexpression.
binaryexpression ::= expression Add expression.
callexpression ::= expression arguments.
arguments ::= LParenthesis argumentlist RParenthesis.
arguments ::= LParenthesis RParenthesis.
argumentlist ::= expression argumentlist.
argumentlist ::= expression.
[edit] Adding a left-side associativity to LParenthesis has solved the conflict.
However, I'm willing to know if it's the correct thing to do : I've seen that some grammars (f.e. C++) have a different precedence for the construction-operator '()' and the call-operator '()'. So I'm not sure about the right thing to do.
The problem is that the grammar is ambiguous. It is not possible to decide between reducing to binaryexpression or callexpression without looking at all the input sequence. The ambiguity is because of the left recursion over expression, which cannot be ended because expression cannot derive a terminal.