Shift/Reduce conflict in CUP - parsing

I'm trying to write a parser for a javascript-ish language with JFlex and Cup, but I'm having some issues with those deadly shift/reduce problems and reduce/reduce problems.
I have searched thoroughly and have found tons of examples, but I'm not able to extrapolate these to my grammar. My understanding so far is that these problems are because the parser cannot decide which way it should take because it can't distinguish.
My grammar is the following one:
start with INPUT;
INPUT::= PROGRAM;
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
FUNCTION ::= function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der;
OPTIONAL ::=
| TYPE;
TYPE::= integer
| boolean
ARG ::=
| TYPE id MORE_ARGS;
MORE_ARGS ::=
| colon TYPE id MORE_ARGS;
NEWLINE ::= salto NEWLINE
| ;
BODY ::= ;
I'm getting several conflicts but these 2 are a mere example:
Warning : *** Shift/Reduce conflict found in state #5
between NEWLINE ::= (*)
and NEWLINE ::= (*) salto NEWLINE
under symbol salto
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #0
between NEWLINE ::= (*)
and FUNCTION ::= (*) function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der
under symbol function
Resolved in favor of shifting.
PS: The grammar is far more complex, but I think that if I see how these shift/reduce problems are solved, I'll be able to fix the rest.
Thanks for your answers.

PROGRAM is useless (in the technical sense). That is, it cannot produce any sentence because in
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
both productions for PROGRAM are recursive. For a non-terminal to be useful, it needs to be able to eventually produce some string of terminals, and for it to do that, it must have at least one non-recursive production; otherwise, the recursion can never terminate. I'm surprised CUP didn't mention this to you. Or maybe it did, and you chose to ignore the warning.
That's a problem -- useless non-terminals really cannot ever match anything, so they will eventually result in a parse error -- but it's not the parsing conflict you're reporting. The conflicts come from another feature of the same production, which is related to the fact that you can't divide by 0.
The thing about nothing is that any number of nothings are still nothing. So if you had plenty of nothings and someone came along and asked you exactly how many nothings you had, you'd have a bit of problem, because to get "plenty" back from "0 * plenty", you'd have to compute "0 / 0", and that's not a well-defined value. (If you had plenty of two's, and someone asked you how many two's you had, that wouldn't be a problem: suppose the plenty of two's worked out to 40; you could easily compute that 40 / 2 = 20, which works out perfectly because 20 * 2 = 40.)
So here we don't have arithmetic, we have strings of symbols. And unfortunately the string containing no symbols is really invisible, like 0 was for all those millenia until some Arabian mathematician noticed the value of being able to write nothing.
Where this is all coming around to is that you have
PROGRAM::= ... something ...
| NEWLINE PROGRAM;
But NEWLINE is allowed to produce nothing.
NEWLINE ::= salto NEWLINE
| ;
So the second recursive production of PROGRAM might add nothing. And it might add nothing plenty of times, because the production is recursive. But the parser needs to be deterministic: it needs to know exactly how many nothings are present, so that it can reduce each nothing to a NEWLINE non-terminal and then reduce a new PROGRAM non-terminal. And it really doesn't know how many nothings to add.
In short, optional nothings and repeated nothings are ambiguous. If you are going to insert a nothing into your language, you need to make sure that there are a fixed finite number of nothings, because that's the only way the parser can unambiguously dissect nothingness.
Now, since the only point of that particular recursion (as far as I can see) is to allow empty newline-terminated statements (which are visible because of the newline, but do nothing). And you could do that by changing the recursion to avoid nothing:
PROGRAM ::= ...
| salto PROGRAM;
Although it's not relevant to your current problem, I feel obliged to mention that CUP is an LALR parser generator and all that stuff you might have learned or read on the internet about recursive descent parsers not being able to handle left recursion does not apply. (I deleted a rant about the way parsing technique are taught, so you'll have to try to recover it from the hints I've left behind.) Bottom-up parser generators like CUP and yacc/bison love left recursion. They can handle right recursion, too, of course, but reluctantly because they need to waste stack space for every recursion other than left recursion. So there is no need to warp your grammar in order to deal with the deficiency; just write the grammar naturally and be happy. (So you rarely if ever need non-terminals representing "the rest of" something.)
PD: Plenty of nothing is a culturally-specific reference to a song from the 1934 opera Porgy and Bess.

Related

Parser/Grammar: 2x parenthesis in nested rules

Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.
But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:
boolCommonExpr = ( boolMethodCallExpr
/ notExpr
/ commonExpr [ eqExpr / neExpr / ltExpr / ... ]
/ boolParenExpr
) [ andExpr / orExpr ]
commonExpr = ( primitiveLiteral
/ firstMemberExpr ; = identifier
/ methodCallExpr
/ parenExpr
) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]
boolParenExpr = "(" boolCommonExpr ")"
parenExpr = "(" commonExpr ")"
How does this grammar match a simple expression like (1 eq 2)? From what I can see all ( are consumed by the rule parenExpr inside commonExpr, i.e. they must also close after commonExpr to not cause an error and boolParenExpr never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?
Obviously an opening ( alone won't tell me where it's going to close: After the current commonExpr expression or further away after boolCommonExpr. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of ( I have. Good idea?
I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.
Edit 1: Extension after answer by rici - Is grammar rewrite correct?
Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.
Adapted to the boolean expression for the OData $filter this could maybe look like this:
boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression = term {("add"|"sum") term} .
term = factor {("mul"|"div"|"mod") factor} .
factor = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall = METHODNAME "(" [ expression {"," expression} ] ")" .
Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?
#rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.
For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.
Edit 2: Return to original OData grammar
The differentiation between a "logical" and "arithmetic" ( is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of () is mandatory around and and or expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.
I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which ( belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.
Personally, I'd just modify the grammar so that it has only one type of expression and therefore one type of parenthesis. I'm not convinced that the OData grammar is actually correct; it is certainly not usable in an LL(1) (or recursive descent) parser for exactly the reason you mention.
Specifically, if the goal is boolCommonExpr, there are two productions which can match the ( lookahead token:
boolCommonExpr = ( …
/ commonExpr [ eqExpr / neExpr / … ]
/ boolParenExpr
/ …
) …
commonExpr = ( …
/ parenExpr
/ …
) …
For the most part, this is a misguided attempt to make the grammar detect a type violation. (If in fact it is a type violation.) It's misguided because it is doomed to failure if there are boolean variables, which there apparently are in this environment. Since there is not syntactic clue as to the type of a variable, the parser is not capable of deciding whether particular expressions are well-formed or not, so there is a good argument for not trying at all, particularly if it creates parsing headaches. A better solution is to first parse the expression into an AST of some form, and then do another pass over the AST to check that each operators has operands of the correct type (and possibly inserting explicit cast operators if that is necessary).
Aside from any other advantage, doing the type check in a separate pass lets you produce much better error messages. If you make (some) type violations syntax errors, then you may leave the user puzzled about why their expression was rejected; in contrast, if you notice that a comparison operation is being used as an operand to multiply (and if your language's semantics don't allow an automatic conversion from True/False to 1/0), then you can produce a well-targetted error message ("comparisons cannot be used as the operand of an arithmetic operator", for example).
One possible reason to put different operators (but not parentheses) into different grammatical variables is to express grammatical precedence. That consideration might encourage you to rewrite the grammar with explicit precedence. (As written, the grammar assumes that all arithmetic operators have the same precedence, which would presumably lead to 2 + 3 * a being parsed as (2 + 3) * a, which might be a huge surprise.) Alternatively, you might use some simple precedence aware subparser for expressions.
If you want to test your ABNF grammar for determinism (i.e. LL(1)), you can use Tunnel Grammar Studio (TGS). I have tested the full grammar, and there are plenty of conflicts, not only this scopes. If you are able to extract the relevant rules, you can use the desktop version of TGS to visualize the conflicts (the online version checker is with a textual result only). If the rules are not too many, the demo may help you to create an LL(1) grammar from your rules.
If you extract all rules you need, and add them to your question, I can run it for you and will tell you is it LL(1). Note that the grammar is not exactly in ABNF meta syntax, because the case sensitivity is typed with ' for case sensitive strings. The ABNF (RFC 5234) by definition is case insensitive, as RFC 7405 defines the sensitivity with %s and %i (sensitive and insensitive) prefixes before the the actual string. The default case (without a prefix) still means insensitive. This means that you have to replace this invalid '...' strings with %s"..." before testing in TGS.
TGS is a project I work on.

Faulty bison reduction using %glr-parser and %merge rules

Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.

Bison: how to fix reduce/reduce conflict

Below is a a Bison grammar which illustrates my problem. The actual grammar that I'm using is more complicated.
%glr-parser
%%
s : e | p '=' s;
p : fp | p ',' fp;
fp : 'x';
e : te | e ';' te;
te : fe | te ',' fe;
fe : 'x';
Some examples of input would be:
x
x = x
x,x = x,x
x,x = x;x
x,x,x = x,x;x,x
x = x,x = x;x
What I'm after is for the x's on the left side of an '=' to be parsed differently than those on the right. However, the set of legal "expressions" which may appear on the right of an '='-sign is larger than those on the left (because of the ';').
Bison prints the message (input file was test.y):
test.y: conflicts: 1 reduce/reduce.
There must be some way around this problem. In C, you have a similar situation. The program below passes through gcc with no errors.
int main(void) {
int x;
int *px;
x;
*px;
*px = x = 1;
}
In this case, the 'px' and 'x' get treated differently depending on whether they appear to the left or right of an '='-sign.
You're using %glr-parser, so there's no need to "fix" the reduce/reduce conflict. Bison just tells you there is one, so that you know you grammar might be ambiguous, so you might need to add ambiguity resolution with %dprec or %merge directives. But in your case, the grammar is not ambiguous, so you don't need to do anything.
A conflict is NOT an error, its just an indication that your grammar is not LALR(1).
The reduce-reduce conflict in your grammar comes from the context:
... = ... x ,
At this point, the parser has to decide whether x is an fe or an fp, and it cannot know with one symbol lookahead. Indeed, it cannot know with any finite lookahead, you could have any number of repetitions of x , following that point without encountering a =, ; or the end of the input, any of which would reveal the answer.
This is not quite the same as the C issue, which can be resolved with single symbol lookahead. However, the C example is a classic illustration of why SLR(1) grammars are less powerful than LALR(1) grammars -- it's used for that purpose in the dragon book -- and a similarly problematic grammar is an example of the difference between LALR(1) and LR(1); it can be found in the bison manual (here):
def: param_spec return_spec ',';
param_spec: type | name_list ':' type;
return_spec: type | name ':' type;
type: "id";
name: "id";
name_list: name | name ',' name_list;
(The bison manual explains how to resolve this issue for LALR(1) grammars, although using a GLR grammar is always a possibility.)
The key to resolving such conflicts without using a GLR grammar is to avoid forcing the parser to make premature decisions.
For example, it is traditional to distinguish syntactically between lvalues and rvalues, and some languages continue to do so. C and C++ do not, however; and this turns out to be an extremely powerful feature in C++ because it allows the definition of functions which can act as lvalues.
In C, I think it's just to simplify the grammar a bit: the C grammar allows the result of any unary operator to appear on the left hand side of an assignment operator, but unary operators are actually a mix of lvalues (*v, v[expr]) and rvalues (sizeof v, f(expr)). The grammar could have distinguished between the two kinds of unary operators, but it could not resolve the actual restriction, which is that only modifiable lvalues may appears on the left side of an assignment operator.
C++ allows an arbitrary expression to appear on the left-hand side of an assignment operator (although some need to be parenthesized); consequently, the following is totally legal:
(predicate(x) ? *some_pointer : some_variable) = 42;
In your case, you could resolve the conflict syntactically by replacing te with p, since both non-terminals produce the same set of derivations. That's probably not the general solution, unless it is really the case in your full grammar that left-side expressions are a strict subset of right-side expressions. In a full grammar, you might end up with three types of expression (left-only, right-only, common), which could considerably complicated the grammar, and leaving the resolution for semantic analysis might prove to be easier (and even, as in the case of C++, surprisingly useful).

Improve BNF for mathematic expressions

A good exercise while learning programming, is to write a calculator. To do this, I created some kind of DSL in BNF and want to ask for your help to improve it. With this minilanguage you should be able to add, multiply and assign values and expressions to names (a.k.a. create variables and functions).
Hava a look at the BNF first:
<Program> ::= <Line>(<NewLine><Line>)*
<Line> ::= {"("}<Expression>{")"}|<Assignment>
<Assignment> ::= <Identifier>"="<Expression>
<Identifier> ::= <Name>{"("<Name>(","<Name>)*")"}
<Expression> ::= <Summand>(("+"|"-")<Summand>)*
<Summand> ::= <Factor>(("*"|"/")<Factor>)*
<Factor> ::= <Number>|<Call>
<Call> ::= <Name> {"("<Expression>(","<Expression>)*")"}
<Name> ::= <Letter>(<Letter>|<Digit>)*
<Number> ::= {"+"|"-"}(<Digit>|<DigitNoZero><Digit>+)
<Digit> ::= "0"|<DigitNoZero>
<DigitNoZero> ::= "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
<Letter> ::= [a-zA-Z]
<NewLine> ::= "\n"|"\r"|"\r\n"
As you can see, this BNF treats no Whitespace beside NewLine. Before the parsing begins I plan to remove all whitespace (beside NewLine of course) from the string to parse. It is not nessesary for the parser anyway.
There are 4 things, that might lead to problems when using this language as defined right now and I hope you can help me figure out appropriate solutions:
I tried to follow the top-down approach, while generating this gramar, but there is a circle between <Expression>, <Summand>, <Factor> and <Call>.
The gramar treats variables and functions exactly the same way. Most programming languages make a difference. Is it nessesary to differentiate here?
There are maybe some things that I don't know about programming, BNF, whatever, that will kill me later, while trying to implement the BNF. But you might be able to spot that before I start.
There might be simple and stupid mistakes that I could not find myself. Sorry in that case. I hope there are none of these mistakes anymore.
Using hand and brain I could successfully parse the following test cases:
"3"
"-3"
"3-3"
"a=3"
"a=3+b"
"a=3+b\nc=a+3"
"a(b,c)=b*c\ra(1+2,2*3)"
Please help to improve the BNF, that it can be used to successfully write a calculator.
edit:
This BNF is really not finished. It does not treat the cases "2+-3" (should fail, but doesn't) and "2+(-3)" (should not fail, but does) correctly.
The gramar treats variables and functions exactly the same way. Most programming languages make a difference. Is it nessesary to differentiate here?
Being able to treat the result of a function invocation exactly the same as a local variable or a constant expression is precisely the point of defining (mathematical) functions in the first place. I can't imagine the use of a grammar that allowed functions but didn't treat
1 + 1
exactly the same as
1 + a
or
1 + sin(x)

How to resolve a shift-reduce conflict in unambiguous grammar

I'm trying to parse a simple grammar using an LALR(1) parser generator (Bison, but the problem is not specific to that tool), and I'm hitting a shift-reduce conflict. The docs and other sources I've found about fixing these tend to say one or more of the following:
If the grammar is ambiguous (e.g. if-then-else ambiguity), change the language to fix the ambiguity.
If it's an operator precedence issue, specify precedence explicitly.
Accept the default resolution and tell the generator not to complain about it.
However, none of these seem to apply to my situation: the grammar is unambiguous so far as I can tell (though of course it's ambiguous with only one character of lookahead), it has only one operator, and the default resolution leads to parse errors on correctly-formed input. Are there any techniques for reworking the definition of a grammar to remove shift-reduce conflicts that don't fall into the above buckets?
For concreteness, here's the grammar in question:
%token LETTER
%%
%start input;
input: /* empty */ | input input_elt;
input_elt: rule | statement;
statement: successor ';';
rule: LETTER "->" successor ';';
successor: /* empty */ | successor LETTER;
%%
The intent is to parse semicolon-separated lines of the form "[A-Za-z]+" or "[A-Za-z] -> [A-Za-z]+".
Using the Solaris version of yacc, I get:
1: shift/reduce conflict (shift 5, red'n 7) on LETTER
state 1
$accept : input_$end
input : input_input_elt
successor : _ (7)
$end accept
LETTER shift 5
. reduce 7
input_elt goto 2
rule goto 3
statement goto 4
successor goto 6
So, the trouble is, as it very often is, the empty rule - specifically, the empty successor. It isn't completely clear whether you want to allow a semi-colon as a valid input - at the moment, it is. If you modified the successor rule to:
successor: LETTER | successor LETTER;
the shift/reduce conflict is eliminated.
Thanks for whittling down the grammar and posting it. Changing the successor rule to -
successor: /* empty */ | LETTER successor;
...worked for me. ITYM the language looked unambiguous.

Resources