Below is a very simplified XML grammar for Bison:
head : NODE_START NAME atts
| NODE_START NAME
;
element : head NODE_CLOSE NODE_END
| head NODE_END anys NODE_START NODE_CLOSE NAME NODE_END
| head NODE_END NODE_START NODE_CLOSE NAME NODE_END
;
text : TEXT
;
comment : NODE_START COMMENT_START COMMENT_END NODE_END
;
cdata : NODE_START CDATA_START CDATA_END NODE_END
;
attr : NAME EQUALS value
;
value : QUOTED
| APOSED
;
atts : attr atts
| attr
elt : element
| comment
| cdata
any : elt
| text
;
elts : elt elts
| elt
;
anys : text elts anys
| elts
| text
;
s : any
| PROLOG any
;
The alleged conflict is the rule anys -> text.
When I look at the corresponding output:
State 35
21 anys: text elts . anys
NODE_START shift, and go to state 1
TEXT shift, and go to state 2
head go to state 4
element go to state 5
text go to state 25
comment go to state 7
cdata go to state 8
elt go to state 26
elts go to state 27
anys go to state 42
How do I understand what is at conflict here?
1. Interpreting the dump file
If you look at the beginning of the .output file, you will see the following:
Rules useless in parser due to conflicts
23 anys: text
State 26 conflicts: 1 shift/reduce
State 27 conflicts: 1 shift/reduce
The first warning tells you that the production anys: text was eliminated altogether because the resolution of parsing conflicts (elsewhere in the grammar) made it impossible for the rule to ever be used. (Thus, it is "useless".) The next two lines tell you where to find the conflicts: in states 26 and 27.
So the rule you quote is not the "alleged conflict" and the state you quote has nothing to do with conflicts (indeed, I have no idea why you focused on it.)
In the states with conflicts, you will see, for example:
State 26
21 anys: text . elts anys
23 | text .
NODE_START shift, and go to state 1
NODE_START [reduce using rule 23 (anys)]
head go to state 4
element go to state 5
comment go to state 7
cdata go to state 8
elt go to state 27
elts go to state 35
The conflict is indicated by a lookahead (in this case NODE_START) with two or more different actions. The action(s) enclosed in brackets (in this case [reduce using rule 23 (anys)]) were eliminated by bison's conflict resolution mechanism (which, in the absence of precedence declarations, chooses the shift action if there is one, and otherwise the reduce action with the smallest production number).
The state dumps should make it clear why the rule anys: text became useless. In both cases where it could be reduced, there was a shift-reduce conflict and the shift action was preferred.
2. Cause of the shift-reduce conflict
The problem is anys: text elts anys. Consider an input consisting of three elts. This could be parsed as an elts consisting of two elts followed by an elts consisting of a single elt, or vice versa. The ambiguity causes a shift-reduce conflict.
Another problem with that production is that it does not permit an elts to end with a text (unless it consists only of a single text.
A better definition would be the simple
anys: any | anys any
Note: you are using a bottom-up parser and right recursion is (literally) an anti-pattern. Writing your lists left-recursively as above will limit parser stack usage and cause senantic actions to run in the expected order (that is, left to right). Unless you have very specific needs, you should avoid right recursion.
Related
I've been stuck with some ambiguous grammar for a while now as yacc reports 6 shift/reduce conflicts. I've looked in the y.output file and have tried to understand how to look at the states and figure out what to do to fix the ambiguous grammar but to no avail. I'm legitimately stuck at how I'm supposed to fix the issues. I've looked at a lot of questions on stack overflow to see if other people's explanation would help me with my problem, but that hasn't helped me much either. For the record, I cannot use any precedence defining directives such as %left to solve the parsing conflicts.
Would someone be able to help me out by guiding me as to how I should change the grammar to fix the shift/reduce conflicts? Maybe by trying to resolve one of the issues and showing me the thinking process behind it? I know the grammar is quite long and hefty and I apologize in advance for that. If anyone is willing to spare their free time on this it would be greatly appreciated, but I realize that I may not be able to have that.
Anyways, here is my grammar in question (it is a slight expansion of the MiniJava grammar):
Grammar
0 $accept: program $end
1 program: main_class class_decl_list
2 main_class: CLASS ID '{' PUBLIC STATIC VOID MAIN '(' STRING '[' ']' ID ')' '{' statement '}' '}'
3 class_decl_list: class_decl_list class_decl
4 | %empty
5 class_decl: CLASS ID '{' var_decl_list method_decl_list '}'
6 | CLASS ID EXTENDS ID '{' var_decl_list method_decl_list '}'
7 var_decl_list: var_decl_list var_decl
8 | %empty
9 method_decl_list: method_decl_list method_decl
10 | %empty
11 var_decl: type ID ';'
12 method_decl: PUBLIC type ID '(' formal_list ')' '{' var_decl_list statement_list RETURN exp ';' '}'
13 formal_list: type ID formal_rest_list
14 | %empty
15 formal_rest_list: formal_rest_list formal_rest
16 | %empty
17 formal_rest: ',' type ID
18 type: INT
19 | BOOLEAN
20 | ID
21 | type '[' ']'
22 statement: '{' statement_list '}'
23 | IF '(' exp ')' statement ELSE statement
24 | WHILE '(' exp ')' statement
25 | SOUT '(' exp ')' ';'
26 | SOUT '(' STRING_LITERAL ')' ';'
27 | ID '=' exp ';'
28 | ID index '=' exp ';'
29 statement_list: statement_list statement
30 | %empty
31 index: '[' exp ']'
32 | index '[' exp ']'
33 exp: exp OP exp
34 | '!' exp
35 | '+' exp
36 | '-' exp
37 | '(' exp ')'
38 | ID index
39 | ID '.' LENGTH
40 | ID index '.' LENGTH
41 | INTEGER_LITERAL
42 | TRUE
43 | FALSE
44 | object
45 | object '.' ID '(' exp_list ')'
46 object: ID
47 | THIS
48 | NEW ID '(' ')'
49 | NEW type index
50 exp_list: exp exp_rest_list
51 | %empty
52 exp_rest_list: exp_rest_list exp_rest
53 | %empty
54 exp_rest: ',' exp
And here are the relevant states from y.output that have shift/reduce conflicts.
State 58
7 var_decl_list: var_decl_list . var_decl
12 method_decl: PUBLIC type ID '(' formal_list ')' '{' var_decl_list . statement_list RETURN exp ';' '}'
INT shift, and go to state 20
BOOLEAN shift, and go to state 21
ID shift, and go to state 22
ID [reduce using rule 30 (statement_list)]
$default reduce using rule 30 (statement_list)
var_decl go to state 24
type go to state 25
statement_list go to state 69
State 76
38 exp: ID . index
39 | ID . '.' LENGTH
40 | ID . index '.' LENGTH
46 object: ID .
'[' shift, and go to state 64
'.' shift, and go to state 97
'.' [reduce using rule 46 (object)]
$default reduce using rule 46 (object)
index go to state 98
State 100
33 exp: exp . OP exp
34 | '!' exp .
OP shift, and go to state 103
OP [reduce using rule 34 (exp)]
$default reduce using rule 34 (exp)
State 101
33 exp: exp . OP exp
35 | '+' exp .
OP shift, and go to state 103
OP [reduce using rule 35 (exp)]
$default reduce using rule 35 (exp)
State 102
33 exp: exp . OP exp
36 | '-' exp .
OP shift, and go to state 103
OP [reduce using rule 36 (exp)]
$default reduce using rule 36 (exp)
State 120
33 exp: exp . OP exp
33 | exp OP exp .
OP shift, and go to state 103
OP [reduce using rule 33 (exp)]
$default reduce using rule 33 (exp)
And there we have it. I apologize again for the length of this grammar and the number of shift/reduce conflicts. I just cannot seem to understand how to fix them by changing the grammar in question. Any help would be thoroughly appreciated, though if no one has time to look through such a massive post, I would understand. If anyone needs more information, don't hesitate to ask.
The basic problem is that when parsing a method_decl body, it can't tell where the var_decl_list ends and the statement_list begins. This is because when the lookahead is ID, it doesn't know whether that is the start of another var_decl or the start of the first statement, and it needs to reduce an empty statement before it can start working on a statement_list.
There are a number of ways you can deal with this:
have the lexer return different tokens for type IDs and other IDs -- that way the difference will tell the parser which is next.
don't require an empty statement at the start of a statement list. Change the grammar to:
statement_list: statement | statement_list statement ;
opt_statement_list: statement_list | %empty ;
and use opt_statement_list in the method_decl rule. This gets around the problem of having to reduce an empty statement_list before you start parsing statements. This is a process known as "unfactoring" the grammar as you are replacing rules with multiple variations. It makes the grammar more complex, and in this case, doesn't solve the problem, it just moves it; you'll then see shift/reduce conflicts betweeen statement: ID . index and type: ID on a [ lookahead. This problem can also be solved by unfactoring, but is harder.
So this brings up the general idea of resolving shift-reduce conflicts by unfactoring. The basic idea is to get rid of the rule causing the reduce half of the shift reduce conflict, replacing it with rules that are more limited in context, so don't trigger the conflict. The example above is easily solved by the "replace a 0-or-more recursive repeat with a 1-or-more recursive repeat and an optional rule". This works well for shift-reduce conflicts on the epsilon rule of the repeat if the following context means you can easily resolve when the 0-case should be legal (only when the next token is } in this case.)
The second conflict is tougher. Here the conflict is on reducing type: ID when the lookahead is [. So we need to duplicate type rules until that is not necessary. Something like:
type: simpleType | arrayType ;
simpleType: INT | BOOLEAN | ID ;
arrayType: INT '[' ']' | BOOLEAN '[' ']' | ID '[' ']'
| arrayType '[' ']' ;
replaces the "0 or more repetitions of the '[' ']' suffix" with "1 or more" and works for similar reasons (defers the reduction until after seeing the '[' ']' instead of requiring it before.) The key being that the simpleType: ID rule never needs to be reduced when the lookahead is '[' as it is only valid in other contexts.
I'm using PLY to parse this grammar. I implemented a metagrammar for EBNF used in the linked spec, but PLY reports multiple shift/reduce conflicts.
Grammar:
Rule 0 S' -> grammar
Rule 1 grammar -> prod_list
Rule 2 grammar -> empty
Rule 3 prod_list -> prod
Rule 4 prod_list -> prod prod_list
Rule 5 prod -> id : : = rule_list
Rule 6 rule_list -> rule
Rule 7 rule_list -> rule rule_list
Rule 8 rule -> rule_simple
Rule 9 rule -> rule_group
Rule 10 rule -> rule_opt
Rule 11 rule -> rule_rep0
Rule 12 rule -> rule_rep1
Rule 13 rule -> rule_alt
Rule 14 rule -> rule_except
Rule 15 rule_simple -> terminal
Rule 16 rule_simple -> id
Rule 17 rule_simple -> char_range
Rule 18 rule_group -> ( rule_list )
Rule 19 rule_opt -> rule_simple ?
Rule 20 rule_opt -> rule_group ?
Rule 21 rule_rep0 -> rule_simple *
Rule 22 rule_rep0 -> rule_group *
Rule 23 rule_rep1 -> rule_simple +
Rule 24 rule_rep1 -> rule_group +
Rule 25 rule_alt -> rule | rule
Rule 26 rule_except -> rule - rule_simple
Rule 27 rule_except -> rule - rule_group
Rule 28 terminal -> SQ string_no_sq SQ
Rule 29 terminal -> DQ string_no_dq DQ
Rule 30 string_no_sq -> LETTER string_no_sq
Rule 31 string_no_sq -> DIGIT string_no_sq
Rule 32 string_no_sq -> SYMBOL string_no_sq
Rule 33 string_no_sq -> DQ string_no_sq
Rule 34 string_no_sq -> + string_no_sq
Rule 35 string_no_sq -> * string_no_sq
Rule 36 string_no_sq -> ( string_no_sq
Rule 37 string_no_sq -> ) string_no_sq
Rule 38 string_no_sq -> ? string_no_sq
Rule 39 string_no_sq -> | string_no_sq
Rule 40 string_no_sq -> [ string_no_sq
Rule 41 string_no_sq -> ] string_no_sq
Rule 42 string_no_sq -> - string_no_sq
Rule 43 string_no_sq -> : string_no_sq
Rule 44 string_no_sq -> = string_no_sq
Rule 45 string_no_sq -> empty
Rule 46 string_no_dq -> LETTER string_no_dq
Rule 47 string_no_dq -> DIGIT string_no_dq
Rule 48 string_no_dq -> SYMBOL string_no_dq
Rule 49 string_no_dq -> SQ string_no_dq
Rule 50 string_no_dq -> + string_no_dq
Rule 51 string_no_dq -> * string_no_dq
Rule 52 string_no_dq -> ( string_no_dq
Rule 53 string_no_dq -> ) string_no_dq
Rule 54 string_no_dq -> ? string_no_dq
Rule 55 string_no_dq -> | string_no_dq
Rule 56 string_no_dq -> [ string_no_dq
Rule 57 string_no_dq -> ] string_no_dq
Rule 58 string_no_dq -> - string_no_dq
Rule 59 string_no_dq -> : string_no_dq
Rule 60 string_no_dq -> = string_no_dq
Rule 61 string_no_dq -> empty
Rule 62 id -> LETTER LETTER id
Rule 63 id -> LETTER DIGIT id
Rule 64 id -> LETTER
Rule 65 id -> DIGIT
Rule 66 rest_of_id -> LETTER rest_of_id
Rule 67 rest_of_id -> DIGIT rest_of_id
Rule 68 rest_of_id -> empty
Rule 69 char_range -> [ UNI_CH - UNI_CH ]
Rule 70 empty -> <empty>
Conflicts:
id : LETTER LETTER id
| LETTER DIGIT id
| LETTER
| DIGIT
.
state 4
(62) id -> LETTER . LETTER id
(63) id -> LETTER . DIGIT id
(64) id -> LETTER .
! shift/reduce conflict for LETTER resolved as shift
! shift/reduce conflict for DIGIT resolved as shift
LETTER shift and go to state 10
DIGIT shift and go to state 9
| reduce using rule 64 (id -> LETTER .)
- reduce using rule 64 (id -> LETTER .)
( reduce using rule 64 (id -> LETTER .)
SQ reduce using rule 64 (id -> LETTER .)
DQ reduce using rule 64 (id -> LETTER .)
[ reduce using rule 64 (id -> LETTER .)
$end reduce using rule 64 (id -> LETTER .)
) reduce using rule 64 (id -> LETTER .)
: reduce using rule 64 (id -> LETTER .)
? reduce using rule 64 (id -> LETTER .)
* reduce using rule 64 (id -> LETTER .)
+ reduce using rule 64 (id -> LETTER .)
! LETTER [ reduce using rule 64 (id -> LETTER .) ]
! DIGIT [ reduce using rule 64 (id -> LETTER .) ]
The id rule is supposed to guarantee that productions' ids start with a letter.
Next conflict:
rule_alt : rule '|' rule
.
state 113
(25) rule_alt -> rule | rule .
(25) rule_alt -> rule . | rule
(26) rule_except -> rule . - rule_simple
(27) rule_except -> rule . - rule_group
! shift/reduce conflict for | resolved as shift
! shift/reduce conflict for - resolved as shift
( reduce using rule 25 (rule_alt -> rule | rule .)
SQ reduce using rule 25 (rule_alt -> rule | rule .)
DQ reduce using rule 25 (rule_alt -> rule | rule .)
LETTER reduce using rule 25 (rule_alt -> rule | rule .)
DIGIT reduce using rule 25 (rule_alt -> rule | rule .)
[ reduce using rule 25 (rule_alt -> rule | rule .)
) reduce using rule 25 (rule_alt -> rule | rule .)
$end reduce using rule 25 (rule_alt -> rule | rule .)
| shift and go to state 76
- shift and go to state 74
! | [ reduce using rule 25 (rule_alt -> rule | rule .) ]
! - [ reduce using rule 25 (rule_alt -> rule | rule .) ]
Connected to a smiliar one:
rule_except : rule '-' rule_simple
| rule '-' rule_group
How do I fix these?
You really should think seriously about using the usual scanner/parser architecture. Otherwise, you will have to find a way to deal with whitespace.
As it is, you seem to be ignoring whitespace altogether. That means that the parser cannot see the whitespace between three consecutive identifiers. It will see them run together as asoupofundifferentiatedletters, and it has no way to know what the original intent was. This makes your grammar deeply ambiguous, because in the grammar two identifiers can follow each other on the assumption that something will cause them to be differentiated from each other. And ambiguous grammars always result in LR conflicts.
Having the identifiers (and other multi-character tokens) recognized by the lexer is much easier. Otherwise, you will have to rewrite your grammar to identify all the places where whitespace is allowed (such as around the punctuation in (identifer1|identifier2)) or required (such as two identifiers).
Identifying identifiers in the scanner using regular expressions will also remove the other problems with your grammar and identifiers:
id -> LETTER LETTER id
id -> LETTER DIGIT id
id -> LETTER
These rules require id to be an odd number of characters, where the digits only appear in even positions. So a1b would be an id, but not ab1 or ab or a1. I'm sure that's not what you meant.
You seem to be trying to avoid left-recursion. Instead, you should embrace left-recursion. Bottom-up parsers, like PLY, love left-recursion. (They handle right-recursion, but at the cost of excessive parser stack usage.) So what you really want is:
id: LETTER | id LETTER | id DIGIT
There are other places in the grammar where similar changes are necessary.
The other conflict is caused by your unorthodox handling of operator precedence, which might also be a result of your attempt to avoid left-recursion. The EBNF operators can be parsed with a simple precedence scheme, as with algebraic operators. However, the use of precedence declarations (%left and friends) will be complicated because of the "invisible" concatenation operator. Generally, you'll find it easier to use explicit precedence as in the standard expr/factor/term algebraic grammar. In your case, the equivalent would be something like:
item: id
| terminal
| '(' rule ')'
term: item
| item '*'
| item '+'
| item '?'
seq : term
| seq term
alt : seq
| alt '|' seq
except: term '-' term
rule: alt
| except
The handling of except in the above corresponds to the lack of information about the precedence of the - operator. That's expressed by effectively disallowing any mix of - and | operators without explicit parentheses.
You will also find that you have a shift/reduce conflict here:
# The following will create a problem
prod: id "::=" rule
prod_list
: prod
| prod_list prod
(NOTE: the fact that I wrote that with left-recursion does not create the problem.)
That is not ambiguous, but it is not left-to-right parseable with a single lookahead token. It requires two tokens, because you cannot know whether or not the id is part of the currently-being-parsed sequence, or the beginning of a new production until you see the token after the id: if it is ::=, then the id was the start of a new production and should not be shifted into the current rule. The usual solution to that problem is a hack in the lexer: the lexer is wrapped by a function which keeps one extra token of lookahead, so that it can emit id ::= as a single token of type definition. There are a number of examples of this hack for various LR parsers in other SO questions.
Having said all of that, I really don't understand why you want to build a parser for EBNF in order to parse XML. Building a working parser from EBNF is basically what PLY does, except that it doesn't implement the "E" part, so you have to rewrite rules which use the ?, *, + and - operators. This can be handled automatically, although the - operator is non-trivial in general, but it is not going to be simple. It would be easier, IMHO, to rewrite the few EBNF rules into BNF and then just use PLY. But if you are looking for a challenge, go for it.
First of all, you have apparently slavishly translated the grammar. You need to tokenize the input stream.
Normally, something like id would be a terminal to be discerned by the lexical analyzer, rather than parsed as part of the grammar
id : LETTER LETTER id
| LETTER DIGIT id
| LETTER
| DIGIT
It looks like everything you have under terminal should not be part of the grammar.
Second, you use right recursion in your grammar. While LALR works with both left and right recursion, you get smaller tables with left recursion.
Suppose you have the input string AA
If you were to insist on parsing identifiers, you'd want something more like
id : id LETTER
| id DIGIT
| LETTER
Finally, Shift-Reduce conflicts are not necessarily based. They frequently occur in numeric expressions to be resolved by operator precedent.
Reduce-Reduce conflicts are always bad.
The bison grammar I wrote for parsing a text file gives me 10 shift/reduce conflicts. The parser.output file doesn't help me enough. The file gives me information as:
State 38 conflicts: 5 shift/reduce
State 40 conflicts: 4 shift/reduce
State 46 conflicts: 1 shift/reduce
Grammar
0 $accept: session $end
1 session: boot_data section_start
2 boot_data: head_desc statement head_desc head_desc
3 head_desc: OPEN_TOK BOOT_TOK statement CLOSE_TOK
4 | OPEN_TOK statement CLOSE_TOK
5 statement: word
6 | statement word
7 word: IDENTIFIER
8 | TIME
9 | DATE
10 | DATA
11 section_start: section_details
12 | section_start section_details
13 | section_start head_desc section_details
14 $#1: /* empty */
15 section_details: $#1 section_head section_body section_end
16 section_head: START_TOK head_desc START_TOK time_stamp
17 time_stamp: statement DATE TIME
18 section_body: log_entry
19 | section_body log_entry
20 log_entry: entry_prefix body_statements
21 | entry_prefix TIME body_statements
22 body_statements: statement
23 | head_desc
24 entry_prefix: ERROR_TOK
25 | WARN_TOK
26 | /* empty */
27 $#2: /* empty */
28 section_end: END_TOK statement $#2 END_TOK head_desc
state 38
8 word: TIME .
21 log_entry: entry_prefix TIME . body_statements
OPEN_TOK shift, and go to state 1
TIME shift, and go to state 6
DATE shift, and go to state 7
DATA shift, and go to state 8
IDENTIFIER shift, and go to state 9
OPEN_TOK [reduce using rule 8 (word)]
TIME [reduce using rule 8 (word)]
DATE [reduce using rule 8 (word)]
DATA [reduce using rule 8 (word)]
IDENTIFIER [reduce using rule 8 (word)]
$default reduce using rule 8 (word)
head_desc go to state 39
statement go to state 40
word go to state 11
body_statements go to state 45
state 39
23 body_statements: head_desc .
$default reduce using rule 23 (body_statements)
state 40
6 statement: statement . word
22 body_statements: statement .
TIME shift, and go to state 6
DATE shift, and go to state 7
DATA shift, and go to state 8
IDENTIFIER shift, and go to state 9
TIME [reduce using rule 22 (body_statements)]
DATE [reduce using rule 22 (body_statements)]
DATA [reduce using rule 22 (body_statements)]
IDENTIFIER [reduce using rule 22 (body_statements)]
$default reduce using rule 22 (body_statements)
word go to state 19
state 46
9 word: DATE .
17 time_stamp: statement DATE . TIME
TIME shift, and go to state 48
TIME [reduce using rule 9 (word)]
$default reduce using rule 9 (word)
The equivalent part of my grammar is:
statement : word
{
printf("WORD\n");
$$=$1;
}
|statement word
{
printf("STATEMENTS\n");
$$=$1;
printf("STATEMENT VALUE== %s\n\n",$$);
}
;
word : IDENTIFIER
{
printf("IDENTIFIER\n");
$$=$1;
}
|TIME
{
printf("TIME\n");
$$=$1;
}
|DATE
{
printf("DATE\n");
$$=$1;
}
|DATA
{
}
;
section_start : section_details
{
printf("SINGLE SECTIONS\n");
}
|section_start section_details
{
printf("MULTIPLE SECTIONS\n");
}
|section_start head_desc section_details
;
section_details :
{
fprintf(fp,"\n%d:\n",set_count);
}
section_head section_body section_end
{
printf("SECTION DETAILS\n");
set_count++;
}
;
section_head : START_TOK head_desc START_TOK statement time_stamp
{
printf("SECTION HEAD...\n\n%s===\n\n%s\n",$2,$4);
fprintf(fp,"%s\n",$4);
}
;
time_stamp : DATET TIME
{
}
;
section_body :log_entry
{
}
|section_body log_entry
{
}
;
log_entry : entry_prefix body_statements
{
}
|entry_prefix TIME body_statements
{
}
;
body_statements : statement
{
}
|head_desc
{
}
;
Please help me to fix this..
Thanks
A conflict in a yacc/bison parser means that the grammar is not LALR(1) -- which generally means that something is either ambiguous or needs more than 1 token of lookahead. The default resolution is to choose always shifting over reducing, or choose always reducing the first rule (for reduce/reduce conflicts), which results in a parser that will recognize a subset of the language described by the grammar. That may or may not be ok -- in the case of an ambiguous grammar, it is often the case that the "subset" is in fact the entire language, and the default resolution trims off the ambiguous case. For cases that require more lookahead, however, as well as some ambiguous cases, the default resolution will result in failing to parse some things in the language.
To figure out what is going wrong on any given conflict, the .output file generally gives you all you need. In your case, you have 3 state with conflicts -- generally the conflicts in a single state are all a single related issue.
state 38
8 word: TIME .
21 log_entry: entry_prefix TIME . body_statements
This conflict is an ambiguity between the rules for log_entry and body_statements:
log_entry: entry_prefix body_statements
| entry_prefix TIME body_statements
a body_statements can be a sequence of one or more TIME/DATE/DATA/IDENTIFIER tokens, so when you have an input with (eg) entry_prefix TIME DATA, it could be either the first log_entry rule with TIME DATA as the body_statements or the second log_entry rule with just DATA as the body_statements.
The default resolution in this case will favor the second rule (shifting to treat the TIME as part of log_statements rather than reducing it as word to be part of a body_statements), and will result in a "subset" that is the entire language -- the only parses that will be missed are ambiguous. This is a case analogous to the "dangling else" that shows up in some languages, where the default shift likely does exactly what you want.
To eliminate this conflict, the easiest way is just to get rid of the log_entry: entry_prefix TIME body_statements rule. This has the opposite effect of the default resolution -- now TIME will always be considered part of the BODY. The issue is that now
you don't have a separate rule to reduce if you want to treat the case of the being an initial TIME in the body differently. You can do a check in the action for a body that begins with TIME if you need to do something special.
state 40
6 statement: statement . word
22 body_statements: statement .
This is another ambiguity problem, this time with section_body where it can't tell where one log_entry ends and another begins. A section_body consists of one or more log_entries, each of which is an entry_prefix followed by a body_statements. The body_statements as noted above may be one or more word tokens, while an entry_prefix may be empty. So if you have a section_body that is just a sequence of word tokens, it can be parsed as either a single log_entry (with no entry_prefix) or as a sequence of log_entry rules, each with no entry_prefix. The default shift over reduce resolution will favor putting as many tokens as possible into a single statement before reducing a body_statement, so will parse it as a single log_entry, which is likely ok.
To eliminate this conflict, you'll need to refactor the grammar. Since the tail of any statement in a log_entry might be another log_entry with no entry_prefix and statement for the body_statements, you pretty much need to eliminate that case (which is what the default conflict resolution does). Assuming you've fixed the previous conflict by removing the second log_entry rule, first unfactor log_entry to make the problematic case its own rule:
log_entry: ERROR_TOK body_statements
| WARN_TOK body_statements
| head_desc
initial_log_entry: log_entry
| statements
Now change the section_body rule so it uses the split-off rule for just the first one:
section_body: initial_log_entry
| section_body log_entry
and the conflict goes away.
state 46
9 word: DATE .
17 time_stamp: statement DATE . TIME
This conflict is a lookahead ambiguity problem -- since DATE and TIME tokens can appear in a statement, when parsing a time_stamp it can't tell where the statement ends and the terminal DATE/TIME begins. The default resolution will result in treating any DATE TIME pair seen as the end of the time_stamp. Now since time_stamp only appears at the end of a section_head just before a section_body, and a section_body may begin with a statement, this may be fine as well.
So it may well be the case that all the conflicts in your grammar are ignorable, and it may even be desirable to do so, as that would be simpler than rewriting the grammar to get rid of them. On the other hand, the presence of conflicts makes it tougher to modify the grammar, since any time you do, you need to reexamine all of the conflicts to make sure they are still benign.
There's a confusing issue with "the default resolution of a conflict" and "the default action in a state". These two defaults have nothing to do with one another -- the first is a decision made by yacc/bison when it builds the parser, and the second is a decision by the parser at runtime. So when you have a state in the output file like:
state 46
9 word: DATE .
17 time_stamp: statement DATE . TIME
TIME shift, and go to state 48
TIME [reduce using rule 9 (word)]
$default reduce using rule 9 (word)
This is telling you that bison has determined that the possible actions from this state are to either shift to state 48 or reduce rule 9. The shift action is only possible if the lookahead token is TIME, while the reduce action is possible for many lookahead tokens including time. So in the interest of minimizing table size, rather than enumerating all the possible next tokens for the reduce action, it just says $default -- meaning the parser will do the reduce action as long as no previous action matches the lookahead token. The $default action will always be the last action in the state.
So the parser's code in this state will run down the list of actions until it finds one that matches the lookahead token and does that action. The TIME [reduce... action is only included to make it clear that there was a conflict in this state and that bison resolved it by disallowing the action reduce when the lookahead is TIME. So the actual action table constructed for this state will have just a single action (shift on token TIME) followed by a default action (reduce on any token).
Note that it does this despite the fact that not all tokens are legal after a word, it still does the reduction on any token. That's because even if the next token is something illegal, since the reduction doesn't read it from the input (it's still the lookahead), a later state (after potentially multiple default reductions) will see (and report) the error.
Here are the relevant parts of my Bison grammar rules:
statement:
expression ';' |
IF expression THEN statement ELSE statement END_IF ';'
;
expression:
IDENTIFIER |
IDENTIFIER '('expressions')' |
LIT_INT |
LIT_REAL |
BOOL_OP |
LOG_NOT expression |
expression operator expression |
'('expression')'
;
expressions:
expression |
expressions ',' expression
;
operator:
REL_OP |
ADD_OP |
MULT_OP |
LOG_OR |
LOG_AND
;
When compiling, I get 10 shift/reduce conflicts:
5 conflicts are caused by the LOG_NOT expression rule:
State 45
25 expression: LOG_NOT expression .
26 | expression . operator expression
REL_OP shift, and go to state 48
ADD_OP shift, and go to state 49
MULT_OP shift, and go to state 50
LOG_OR shift, and go to state 51
LOG_AND shift, and go to state 52
REL_OP [reduce using rule 25 (expression)]
ADD_OP [reduce using rule 25 (expression)]
MULT_OP [reduce using rule 25 (expression)]
LOG_OR [reduce using rule 25 (expression)]
LOG_AND [reduce using rule 25 (expression)]
$default reduce using rule 25 (expression)
operator go to state 54
5 conflicts are caused by the expressions operator expression rule:
State 62
26 expression: expression . operator expression
26 | expression operator expression .
REL_OP shift, and go to state 48
ADD_OP shift, and go to state 49
MULT_OP shift, and go to state 50
LOG_OR shift, and go to state 51
LOG_AND shift, and go to state 52
REL_OP [reduce using rule 26 (expression)]
ADD_OP [reduce using rule 26 (expression)]
MULT_OP [reduce using rule 26 (expression)]
LOG_OR [reduce using rule 26 (expression)]
LOG_AND [reduce using rule 26 (expression)]
$default reduce using rule 26 (expression)
operator go to state 54
I know that the problem has to do with precedence. For instance, if the expression was:
a + b * c
Does Bison shift after the a + and hope to find an expression, or does it reduce the a to an expression? I have a feeling that this is due to the Bison 1-token look-ahead limitation, but I can't figure out how to rewrite the rule(s) to resolve the conflicts.
My professor will take points off for shift/reduce conflicts, so I can't use %expect. My professor has also stated that we cannot use %left or %right precedence values.
This is my first post on Stack, so please let me know if I'm posting this all wrong. I've searched existing posts, but this really seems a case-by-case thing. If I use any code from Stack, I will note the source in my submitted project.
Thanks!
As written, your grammar is ambiguous. So it must have conflicts.
There is no inherent rule of binding precedences, and apparently you're not allowed to use bison's precedence declarations either. If you were allowed to, you wouldn't be able to use operator as a non-terminal, because you need to distinguish between
expr1 + expr2 * expr3 expr1 * expr2 + expr3
| | | | | |
| +---+---+ +---+---+ |
| | | |
| expr expr |
| | | |
+-----+-----+ +-----+-----+
| |
expr expr
And you cannot distinguish between them if + and * are replaced with operator. The terminals actually have to be visible.
Now, here's a quick clue:
expr1 + expr2 + expr3 reduces expr1 + expr2 first
expr1 * expr2 + expr3 reduces expr1 * expr2 first
So in non-terminal-1 + non-terminal-2, non-terminal-1 cannot produce x + y or x * y. But in non-terminal-1 * non-terminal-2, non-terminal-1 can produce `x + y
Thanks! I did some more troubleshooting, and fixed the reduce/conflict errors by rewriting the expression and operator rules:
expression:
expression LOG_OR term1 |
term1
;
term1:
term1 LOG_AND term2 |
term2
;
term2:
term2 REL_OP term3 |
term3
;
term3:
term3 ADD_OP term4 |
term4
;
term4:
term4 MULT_OP factor |
factor
;
factor:
IDENTIFIER |
IDENTIFIER '('expressions')' |
LIT_INT |
LIT_REAL |
BOOL_OP |
LOG_NOT factor |
'('expression')'
;
expressions:
expression |
expressions ',' expression
;
I had to rearrange what was actually an expression and what was actually a factor. I made a factor rule that includes all of the factors (terminals), which has the highest precedence. I then made a term# rule for each expression, which also sets them at different precedence levels (term5 has higher precedence than term4, term4 has higher precedence than term3, etc.).
This allowed me to set each operator at a difference precedence without using any of the built-in % precedence functions.
I was able to parse all of my test input files without error. Any thoughts on the design?
this is part of y.ouput file
state 65
15 Expression: Expression . "&&" Expression
16 | Expression . "<" Expression
17 | Expression . "+" Expression
18 | Expression . "-" Expression
19 | Expression . "*" Expression
20 | Expression . "[" Expression "]"
21 | Expression . "." "length"
22 | Expression . "." Identifier "(" Expression "," Expression ")"
25 | "!" Expression .
"[" shift, and go to state 67
"<" shift, and go to state 69
"+" shift, and go to state 70
"-" shift, and go to state 71
"*" shift, and go to state 72
"." shift, and go to state 73
"[" [reduce using rule 25 (Expression)]
"<" [reduce using rule 25 (Expression)]
"+" [reduce using rule 25 (Expression)]
"-" [reduce using rule 25 (Expression)]
"*" [reduce using rule 25 (Expression)]
"." [reduce using rule 25 (Expression)]
$default reduce using rule 25 (Expression)
this is how the precedence of operators is set
%left "&&"
%left '<'
%left '-' '+'
%left '*'
%right '!'
%left '.'
%left '(' ')'
%left '[' ']'
In bison, there is a difference between "x" and 'x'; they are not the same token. So, assuming you are using bison, your precedence declarations don't refer to the terminals in the productions.
Bison also allows %token definitions of the following form:
%token name quoted-string ...
For example (a short excerpt from bison's own grammar file):
%token
PERCENT_CODE "%code"
PERCENT_DEBUG "%debug"
PERCENT_DEFAULT_PREC "%default-prec"
PERCENT_DEFINE "%define"
PERCENT_DEFINES "%defines"
PERCENT_ERROR_VERBOSE "%error-verbose"
Once the symbols have been aliased, they can be used interchangeably in the grammar, making it possible to use the double-quoted string in productions; some people find such grammars easier to read. However, there is no mechanism to ensure that the lexer produces the correct token number for a double-quoted string since it only has access to the token names.
The "original" yacc, at least in the current "byacc" version maintained by Thomas Dickey, allows both single- and double-quoted token names, but does not distinguish between them; both "+" and '+' are mapped to token number 43 ('+'). It also does not provide any easy way to alias token names, so the double-quoted multi-character strings are not particularly easy to use in a reliable way.