Bison - shift/reduce conflict identifier - parsing

I have 1 shift/reduce conflict at state 19. I think that there may be a problem with the different occurrences of 'identifier' but I'm struggling to understand the bison report and resolve the conflict. Below is my grammar followed by the bison report with state information:
%{
#include <cstdio>
#include <iostream>
using namespace std;
extern "C" int yylex();
extern "C" int yyparse();
void yyerror(const char *s);
%}
%union{
int int_val;
double d_val;
char *strng;
}
%token TLINEC TBLOCKC TPRINT
%token TASSIGN TCPA TOB TCB TCOMMA TSEMIC
%token TIF TELSE TFOR
%token TINT TFLOAT TD TCHAR
%token TPLUS TMINUS TDIV TMULT
%token TLT TGT TAND TOR TEQUAL TNE
%token<int_val> TINTEGER
%token<d_val> TDOUBLE
%token<strng> TID
%token TOPA
%start program
%%
program : command_list
command_list : declaration
| command_list declaration
declaration : function_dec
| variable_dec
| expression
variable_dec : type identifier TSEMIC
| type assignment TSEMIC
assignment : identifier TASSIGN expression
var_list : expression
| var_list TCOMMA expression
|
function_dec : type identifier TOPA p_list TCPA scope
p_list : variable_dec
| p_list TCOMMA variable_dec
type : TINT
| TD
| TFLOAT
| TCHAR
/* inside function scope */
scope : TOB command_list TCB
| TOB TCB
function_call : identifier TOPA var_list TCPA TSEMIC
/* expression rules */
expression : assignment
| function_call
| TOPA expression TCPA
| constant arithmetic_op expression
| constant logical_op expression
| constant
constant : identifier
| num
identifier : TID
num : TINTEGER
| TDOUBLE
arithmetic_op : TPLUS | TMINUS | TDIV | TMULT
logical_op : TLT | TGT | TAND | TOR | TEQUAL | TNE
%%
int main(int, char**){
int y=0;
do{
y=yyparse();
}while(y);
}
void yyerror(const char *s){
cout << "parse error! Message: " << s << endl;
exit(-1);
}
Bison Report:
State 19 conflicts: 1 shift/reduce
Grammar
0 $accept: program $end
1 program: command_list
2 command_list: declaration
3 | command_list declaration
4 declaration: function_dec
5 | variable_dec
6 | expression
7 variable_dec: type identifier TSEMIC
8 | type assignment TSEMIC
9 assignment: identifier TASSIGN expression
10 var_list: expression
11 | var_list TCOMMA expression
12 | /* empty */
13 function_dec: type identifier TOPA p_list TCPA scope
14 p_list: variable_dec
15 | p_list TCOMMA variable_dec
16 type: TINT
17 | TD
18 | TFLOAT
19 | TCHAR
20 scope: TOB command_list TCB
21 | TOB TCB
22 function_call: identifier TOPA var_list TCPA TSEMIC
23 expression: assignment
24 | function_call
25 | TOPA expression TCPA
26 | constant arithmetic_op expression
27 | constant logical_op expression
28 | constant
29 constant: identifier
30 | num
31 identifier: TID
32 num: TINTEGER
33 | TDOUBLE
34 arithmetic_op: TPLUS
35 | TMINUS
36 | TDIV
37 | TMULT
38 logical_op: TLT
39 | TGT
40 | TAND
41 | TOR
42 | TEQUAL
43 | TNE
Terminals, with rules where they appear
$end (0) 0
error (256)
TLINEC (258)
TBLOCKC (259)
TPRINT (260)
TASSIGN (261) 9
TCPA (262) 13 22 25
TOB (263) 20 21
TCB (264) 20 21
TCOMMA (265) 11 15
TSEMIC (266) 7 8 22
TIF (267)
TELSE (268)
TFOR (269)
TINT (270) 16
TFLOAT (271) 18
TD (272) 17
TCHAR (273) 19
TPLUS (274) 34
TMINUS (275) 35
TDIV (276) 36
TMULT (277) 37
TLT (278) 38
TGT (279) 39
TAND (280) 40
TOR (281) 41
TEQUAL (282) 42
TNE (283) 43
TINTEGER (284) 32
TDOUBLE (285) 33
TID (286) 31
TOPA (287) 13 22 25
Nonterminals, with rules where they appear
$accept (33)
on left: 0
program (34)
on left: 1, on right: 0
command_list (35)
on left: 2 3, on right: 1 3 20
declaration (36)
on left: 4 5 6, on right: 2 3
variable_dec (37)
on left: 7 8, on right: 5 14 15
assignment (38)
on left: 9, on right: 8 23
var_list (39)
on left: 10 11 12, on right: 11 22
function_dec (40)
on left: 13, on right: 4
p_list (41)
on left: 14 15, on right: 13 15
type (42)
on left: 16 17 18 19, on right: 7 8 13
scope (43)
on left: 20 21, on right: 13
function_call (44)
on left: 22, on right: 24
expression (45)
on left: 23 24 25 26 27 28, on right: 6 9 10 11 25 26 27
constant (46)
on left: 29 30, on right: 26 27 28
identifier (47)
on left: 31, on right: 7 9 13 22 29
num (48)
on left: 32 33, on right: 30
arithmetic_op (49)
on left: 34 35 36 37, on right: 26
logical_op (50)
on left: 38 39 40 41 42 43, on right: 27
state 18
26 expression: constant . arithmetic_op expression
27 | constant . logical_op expression
28 | constant . [$end, TCPA, TCB, TCOMMA, TSEMIC, TINT, TFLOAT, TD, TCHAR, TINTEGER, TDOUBLE, TID, TOPA]
34 arithmetic_op: . TPLUS
35 | . TMINUS
36 | . TDIV
37 | . TMULT
38 logical_op: . TLT
39 | . TGT
40 | . TAND
41 | . TOR
42 | . TEQUAL
43 | . TNE
TPLUS shift, and go to state 26
TMINUS shift, and go to state 27
TDIV shift, and go to state 28
TMULT shift, and go to state 29
TLT shift, and go to state 30
TGT shift, and go to state 31
TAND shift, and go to state 32
TOR shift, and go to state 33
TEQUAL shift, and go to state 34
TNE shift, and go to state 35
$default reduce using rule 28 (expression)
arithmetic_op go to state 36
logical_op go to state 37
state 19
9 assignment: identifier . TASSIGN expression
22 function_call: identifier . TOPA var_list TCPA TSEMIC
29 constant: identifier . [$end, TCPA, TCB, TCOMMA, TSEMIC, TINT, TFLOAT, TD, TCHAR, TPLUS, TMINUS, TDIV, TMULT, TLT, TGT, TAND, TOR, TEQUAL, TNE, TINTEGER, TDOUBLE, TID, TOPA]
TASSIGN shift, and go to state 38
TOPA shift, and go to state 39
TOPA [reduce using rule 29 (constant)]
$default reduce using rule 29 (constant)
state 20
30 constant: num .
$default reduce using rule 30 (constant)

According to the output of bison, from state 19 it is possible to reduce expression on a lookahead of (. How can this be possible? In other words, under what circumstances can expression be followed by an open parenthesis?
A search through the grammar reveals only three uses of TOPA. Two of them (function declarations and function calls) follow identifier and identifier cannot derive expression, so it must be the third one:
expression: TOPA expression TCPA;
However, the only way that a reduction of expression could occur immediately before this instance of ( is if it were possible for two expressions to occur consecutively. Normally, in C-like languages that possibility is eliminated by requiring a ; to separate statements (which might be, start with, or end with expression), and I suppose that was your intention.
However, we see that:
command_list: declaration
| command_list declaration
declaration: expression
which allows two consecutive expressions without intervening semicolon.
As always, I encourage the use of more readable tokens in a bison grammar. '(' is much easier to understand than TOPA, and I honestly have no idea what COB might be. But it's a question of style.

Related

Ply shift/reduce conflicts: dangling else and empty productions

I had lots of conflicts, most of them were due to operators and relational operators which had different precedences. But I still face some conflicts that I don't really know how to tackle them. some of them are below. I suspect that maybe I should do epsilon elimination for stmtlist but to be honest I'm not sure about it.
state 70:
state 70
(27) block -> LCB varlist . stmtlist RCB
(25) varlist -> varlist . vardec
(28) stmtlist -> . stmt
(29) stmtlist -> . stmtlist stmt
(30) stmtlist -> .
(15) vardec -> . type idlist SEMICOLON
(33) stmt -> . RETURN exp SEMICOLON
(34) stmt -> . exp SEMICOLON
(35) stmt -> . WHILE LRB exp RRB stmt
(36) stmt -> . FOR LRB exp SEMICOLON exp SEMICOLON exp RRB stmt
(37) stmt -> . IF LRB exp RRB stmt elseiflist
(38) stmt -> . IF LRB exp RRB stmt elseiflist ELSE stmt
(39) stmt -> . PRINT LRB ID RRB SEMICOLON
(40) stmt -> . block
(7) type -> . INTEGER
(8) type -> . FLOAT
(9) type -> . BOOLEAN
(44) exp -> . lvalue ASSIGN exp
(45) exp -> . exp SUM exp
(46) exp -> . exp MUL exp
(47) exp -> . exp SUB exp
(48) exp -> . exp DIV exp
(49) exp -> . exp MOD exp
(50) exp -> . exp AND exp
(51) exp -> . exp OR exp
(52) exp -> . exp LT exp
(53) exp -> . exp LE exp
(54) exp -> . exp GT exp
(55) exp -> . exp GE exp
(56) exp -> . exp NE exp
(57) exp -> . exp EQ exp
(58) exp -> . const
(59) exp -> . lvalue
(60) exp -> . ID LRB explist RRB
(61) exp -> . LRB exp RRB
(62) exp -> . ID LRB RRB
(63) exp -> . SUB exp
(64) exp -> . NOT exp
(27) block -> . LCB varlist stmtlist RCB
(31) lvalue -> . ID
(32) lvalue -> . ID LSB exp RSB
(72) const -> . INTEGERNUMBER
(73) const -> . FLOATNUMBER
(74) const -> . TRUE
(75) const -> . FALSE
! shift/reduce conflict for RETURN resolved as shift
! shift/reduce conflict for WHILE resolved as shift
! shift/reduce conflict for FOR resolved as shift
! shift/reduce conflict for IF resolved as shift
! shift/reduce conflict for PRINT resolved as shift
! shift/reduce conflict for ID resolved as shift
! shift/reduce conflict for LRB resolved as shift
! shift/reduce conflict for SUB resolved as shift
! shift/reduce conflict for NOT resolved as shift
! shift/reduce conflict for LCB resolved as shift
! shift/reduce conflict for INTEGERNUMBER resolved as shift
! shift/reduce conflict for FLOATNUMBER resolved as shift
! shift/reduce conflict for TRUE resolved as shift
! shift/reduce conflict for FALSE resolved as shift
RCB reduce using rule 30 (stmtlist -> .)
RETURN shift and go to state 99
WHILE shift and go to state 101
FOR shift and go to state 102
IF shift and go to state 103
PRINT shift and go to state 104
INTEGER shift and go to state 8
FLOAT shift and go to state 9
BOOLEAN shift and go to state 10
ID shift and go to state 31
LRB shift and go to state 36
SUB shift and go to state 34
NOT shift and go to state 37
LCB shift and go to state 45
INTEGERNUMBER shift and go to state 38
FLOATNUMBER shift and go to state 39
TRUE shift and go to state 40
FALSE shift and go to state 41
! RETURN [ reduce using rule 30 (stmtlist -> .) ]
! WHILE [ reduce using rule 30 (stmtlist -> .) ]
! FOR [ reduce using rule 30 (stmtlist -> .) ]
! IF [ reduce using rule 30 (stmtlist -> .) ]
! PRINT [ reduce using rule 30 (stmtlist -> .) ]
! ID [ reduce using rule 30 (stmtlist -> .) ]
! LRB [ reduce using rule 30 (stmtlist -> .) ]
! SUB [ reduce using rule 30 (stmtlist -> .) ]
! NOT [ reduce using rule 30 (stmtlist -> .) ]
! LCB [ reduce using rule 30 (stmtlist -> .) ]
! INTEGERNUMBER [ reduce using rule 30 (stmtlist -> .) ]
! FLOATNUMBER [ reduce using rule 30 (stmtlist -> .) ]
! TRUE [ reduce using rule 30 (stmtlist -> .) ]
! FALSE [ reduce using rule 30 (stmtlist -> .) ]
stmtlist shift and go to state 96
vardec shift and go to state 97
stmt shift and go to state 98
type shift and go to state 72
exp shift and go to state 100
block shift and go to state 105
lvalue shift and go to state 33
const shift and go to state 35
here is a list of all productions:
program β†’ declist main ( ) block
declist β†’ dec | declist dec | πœ–
dec β†’ vardec | funcdec
type β†’ int | float | bool
iddec β†’ id | id [ exp ] | id=exp
idlist β†’ iddec | idlist , iddec
vardec β†’ type idlist ;
funcdec β†’ type id (paramdecs) block | void id (paramdecs) block
paramdecs β†’ paramdecslist | πœ–
paramdecslist β†’ paramdec | paramdecslist , paramdec
paramdec β†’ type id | type id []
Precedencevarlist β†’ vardec | varlist vardec | πœ–
block β†’ { varlist stmtlist }
stmtlist β†’ stmt | stmlist stmt | πœ–
lvalue β†’ id | id [exp]
stmt β†’ return exp ; | exp ;| block |
while (exp) stmt |
for(exp ; exp ; exp) stmt |
if (exp) stmt elseiflist | if (exp) stmt elseiflist else stmt |
print ( id) ;
elseiflist β†’ elif (exp) stmt | elseiflist elif (exp) stmt | πœ–
exp β†’ lvalue=exp | exp operator exp |exp relop exp|
const | lvalue | id(explist) | (exp) | id() | - exp | ! exp
operator β†’ β€œ||” | && | + | - | * | / | %
const β†’ intnumber | floatnumber | true | false
relop β†’ > | < | != | == | <= | >=
explist β†’ exp | explist,exp
Another problem is the famous dangling else, I added ('nonassoc', 'IFP'), ('left', 'ELSE' , 'ELIF') to precedence tuple and change the grammar in this way:
def p_stmt_5(self, p):
"""stmt : IF LRB exp RRB stmt elseiflist %prec IFP """
print("""stmt : IF LRB exp RRB stmt elseiflist """)
def p_stmt_6(self, p):
"""stmt : IF LRB exp RRB stmt elseiflist ELSE stmt"""
print("""stmt : IF LRB exp RRB stmt elseiflist else stmt """)
But it didn't make it go away. below is the state where the shift/reduce conflict happens.
state 130
(37) stmt -> IF LRB exp RRB stmt . elseiflist
(38) stmt -> IF LRB exp RRB stmt . elseiflist ELSE stmt
(41) elseiflist -> . ELIF LRB exp RRB stmt
(42) elseiflist -> . elseiflist ELIF LRB exp RRB stmt
(43) elseiflist -> .
! shift/reduce conflict for ELIF resolved as shift
ELIF shift and go to state 134
RCB reduce using rule 43 (elseiflist -> .)
RETURN reduce using rule 43 (elseiflist -> .)
WHILE reduce using rule 43 (elseiflist -> .)
FOR reduce using rule 43 (elseiflist -> .)
IF reduce using rule 43 (elseiflist -> .)
PRINT reduce using rule 43 (elseiflist -> .)
ID reduce using rule 43 (elseiflist -> .)
LRB reduce using rule 43 (elseiflist -> .)
SUB reduce using rule 43 (elseiflist -> .)
NOT reduce using rule 43 (elseiflist -> .)
LCB reduce using rule 43 (elseiflist -> .)
INTEGERNUMBER reduce using rule 43 (elseiflist -> .)
FLOATNUMBER reduce using rule 43 (elseiflist -> .)
TRUE reduce using rule 43 (elseiflist -> .)
FALSE reduce using rule 43 (elseiflist -> .)
ELSE reduce using rule 43 (elseiflist -> .)
! ELIF [ reduce using rule 43 (elseiflist -> .) ]
elseiflist shift and go to state 133
Finally there are two more states with shift/reduce errors which I list below:
state 45
(27) block -> LCB . varlist stmtlist RCB
(24) varlist -> . vardec
(25) varlist -> . varlist vardec
(26) varlist -> .
(15) vardec -> . type idlist SEMICOLON
(7) type -> . INTEGER
(8) type -> . FLOAT
(9) type -> . BOOLEAN
! shift/reduce conflict for INTEGER resolved as shift
! shift/reduce conflict for FLOAT resolved as shift
! shift/reduce conflict for BOOLEAN resolved as shift
RETURN reduce using rule 26 (varlist -> .)
WHILE reduce using rule 26 (varlist -> .)
FOR reduce using rule 26 (varlist -> .)
IF reduce using rule 26 (varlist -> .)
PRINT reduce using rule 26 (varlist -> .)
ID reduce using rule 26 (varlist -> .)
LRB reduce using rule 26 (varlist -> .)
SUB reduce using rule 26 (varlist -> .)
NOT reduce using rule 26 (varlist -> .)
LCB reduce using rule 26 (varlist -> .)
INTEGERNUMBER reduce using rule 26 (varlist -> .)
FLOATNUMBER reduce using rule 26 (varlist -> .)
TRUE reduce using rule 26 (varlist -> .)
FALSE reduce using rule 26 (varlist -> .)
RCB reduce using rule 26 (varlist -> .)
INTEGER shift and go to state 8
FLOAT shift and go to state 9
BOOLEAN shift and go to state 10
! INTEGER [ reduce using rule 26 (varlist -> .) ]
! FLOAT [ reduce using rule 26 (varlist -> .) ]
! BOOLEAN [ reduce using rule 26 (varlist -> .) ]
varlist shift and go to state 70
vardec shift and go to state 71
type shift and go to state 72
And:
state 0
(0) S' -> . program
(1) program -> . declist MAIN LRB RRB block
(2) declist -> . dec
(3) declist -> . declist dec
(4) declist -> .
(5) dec -> . vardec
(6) dec -> . funcdec
(15) vardec -> . type idlist SEMICOLON
(16) funcdec -> . type ID LRB paramdecs RRB block
(17) funcdec -> . VOID ID LRB paramdecs RRB block
(7) type -> . INTEGER
(8) type -> . FLOAT
(9) type -> . BOOLEAN
! shift/reduce conflict for VOID resolved as shift
! shift/reduce conflict for INTEGER resolved as shift
! shift/reduce conflict for FLOAT resolved as shift
! shift/reduce conflict for BOOLEAN resolved as shift
MAIN reduce using rule 4 (declist -> .)
VOID shift and go to state 7
INTEGER shift and go to state 8
FLOAT shift and go to state 9
BOOLEAN shift and go to state 10
! VOID [ reduce using rule 4 (declist -> .) ]
! INTEGER [ reduce using rule 4 (declist -> .) ]
! FLOAT [ reduce using rule 4 (declist -> .) ]
! BOOLEAN [ reduce using rule 4 (declist -> .) ]
program shift and go to state 1
declist shift and go to state 2
dec shift and go to state 3
vardec shift and go to state 4
funcdec shift and go to state 5
type shift and go to state 6
Thank you so much in advance.
There are actually two somewhat related problems here, both having to do with ambiguity induced by duplicate base cases in recursive productions:
1. Ambiguity in stmtlist
First, as you imply, there is a problem with stmtlist. Your grammar for stmtlist is:
stmtlist β†’ stmt | stmlist stmt | πœ–
which has two base cases: stmtlist β†’ stmt and stmtlist β†’ πœ–. This duplication means that a single stmt can be parsed in two ways:
stmtlist β†’ stmt
stmtlist β†’ stmtlist stmt β†’ πœ– stmt
Grammatical ambiguities always manifest as conflicts. To eliminate the conflict, eliminate the ambiguity. If you want stmtlist to be possibly empty, use:
stmtlist β†’ stmlist stmt | πœ–
If you want to insist that stmtlist contains at least one stmt, use:
stmtlist β†’ stmlist stmt | stmt
Above all, try to understand the logic of the above suggestion.
In addition, you allow stmt to be empty. It should be obvious that this is going to lead to an ambiguity in stmtlist because it is impossible to know how many empty stmts there are in a list. It could be 3; it could be 42; it could be eight million. Empty is invisible.
The potential nothingness of stmt also creates an ambiguity with those compound statements which end with stmt, such as "while" '(' exp ')' stmt. If stmt could be nothing, then
while (x) while(y) c;
could be two statements: while(x) with an empty repeated statement, and then while(y) with a loop on c;. Or it could have the (probably expected) meaning of a while(x) loop whose repeated statement is a nested while(y) c;. I would suggest that no-one would expect the first interpretation and that the grammar should not allow it. If you wanted an empty while target, you would use ; as the repeated statement, not nothing.
I'm sure you didn't intend that a stmt can be nothing. It makes lots of sense to allow the empty statement written as ; (that is, an emptyness followed by a semicolon), but that's obviously a different syntax. (Inside {…} you might want to allow nothing, rather than insisting on a semicolon. To achieve that, you need an empty stmtlist, not an empty stmt.)
2. Dangling else: actually an ambiguity in elseiflist
I think this is the grammar you are using:
(37) stmt -> "if" '(' exp ')' stmt elseiflist %prec IFP
(38) stmt -> "if" '(' exp ')' stmt elseiflist "else" stmt
(41) elseiflist -> "elif" '(' exp ')' stmt
(42) elseiflist -> elseiflist "elif" '(' exp ')' stmt
(43) elseiflist ->
Just as with the stmtlist production, elseiflist is a recursive production with two base cases, one of which is redundant. Again, it is necessary to decide whether or not elseiflist can really be empty (Hint: it can be), and then to remove one or the other of the base cases to avoid an ambiguous parse.
Having said that, I don't think that's the best way of writing the grammar for an if statement; the parse tree it builds might not be quite as you expect. But I guess it will work.

Different YouTube URLs points to the same video

I found some bug in youtube when I reverse engineering it's video id generator. If I change last characther of the video id, it redirects to same video. How is this possible?
Example:
https://www.youtube.com/watch?v=9bZkp7q19f0
https://www.youtube.com/watch?v=9bZkp7q19f1
https://www.youtube.com/watch?v=9bZkp7q19f2
https://www.youtube.com/watch?v=9bZkp7q19f3
But this url isn't work:
https://www.youtube.com/watch?v=9bZkp7q19f4
The videoId is 8 Bytes (64 bit) base64 encoded. From this post :
For the videoId, it is an 8-byte (64-bit) integer. Applying
Base64-encoding to 8 bytes of data requires 11 characters. However,
since each Base64 character conveys exactly 6 bits, this allocation
could actually hold up to 11 Γ— 6 = 66 bits--a surplus of 2 bits over
what our payload needs. The excess bits are set to zero, which has the
effect of excluding certain characters from ever appearing in the last
position of the encoded string. In particular, the videoId will always
end with one of the following: { A, E, I, M, Q, U, Y, c, g, k, o, s,
w, 0, 4, 8 }
In your case, your videoId is 9bZkp7q19f0 :
enc. | 9 b Z k p 7 q 1 9 f | 0
value | 61 27 25 36 41 59 42 53 61 31 | 52
bin. | 111101 011011 011001 100100 101001 111011 101010 110101 111101 011111 | 1101 00
If you modify the last character, the 64 bit id will change if the 4 most significative bit (MSB) are modified :
9bZkp7q19f1 :
enc. | 9 b Z k p 7 q 1 9 f | 1
value | 61 27 25 36 41 59 42 53 61 31 | 53
bin. | 111101 011011 011001 100100 101001 111011 101010 110101 111101 011111 | 1101 01
9bZkp7q19f2 :
enc. | 9 b Z k p 7 q 1 9 f | 2
value | 61 27 25 36 41 59 42 53 61 31 | 54
bin. | 111101 011011 011001 100100 101001 111011 101010 110101 111101 011111 | 1101 10
9bZkp7q19f3 :
enc. | 9 b Z k p 7 q 1 9 f | 3
value | 61 27 25 36 41 59 42 53 61 31 | 55
bin. | 111101 011011 011001 100100 101001 111011 101010 110101 111101 011111 | 1101 11
This will give a different video id (note the 4 MSB of the last Byte are modified 1101 to 1110) :
enc. | 9 b Z k p 7 q 1 9 f | 4
value | 61 27 25 36 41 59 42 53 61 31 | 56
bin. | 111101 011011 011001 100100 101001 111011 101010 110101 111101 011111 | 1110 00
9bZkp7q19f4 will give a different 64 bit id. Note that if such an id exists 9bZkp7q19f4, 9bZkp7q19f5, 9bZkp7q19f6 and 9bZkp7q19f7 will give the same id.
You can check the base64 encoding/values here

Why occurs 'no viable alternative at input'?

I wrote the following combined grammar:
grammar KeywordGrammar;
options{
TokenLabelType = MyToken;
}
//start rule
start: sequence+ EOF;
sequence: keyword filter?;
filter: simpleFilter | logicalFilter | rangeFilter;
logicalFilter: andFilter | orFilter | notFilter;
simpleFilter: lessFilter | greatFilter | equalFilter | containsFilter;
andFilter: simpleFilter AND? simpleFilter;
orFilter: simpleFilter OR simpleFilter;
lessFilter: LESS (DIGIT | FLOAT|DATE);
notFilter: NOT IN? (STRING|ID);
greatFilter: GREATER (DIGIT|FLOAT|DATE);
equalFilter: EQUAL (DIGIT|FLOAT|DATE);
containsFilter: EQUAL (STRING|ID);
rangeFilter: RANGE? DATE DATE? | RANGE? FLOAT FLOAT?;
keyword: ID | STRING;
DATE: DIGIT DIGIT? SEPARATOR MONTH SEPARATOR DIGIT DIGIT (DIGIT DIGIT)?;
MONTH: JAN
| FEV
| MAR
| APR
| MAY
| JUN
| JUL
| AUG
| SEP
| OCT
| NOV
| DEC
;
JAN : 'janeiro'|'jan'|'01'|'1';
FEV : 'fevereiro'|'fev'|'02'|'2';
MAR : 'março'|'mar'|'03'|'3';
APR : 'abril' |'abril'|'04'|'4';
MAY : 'maio'| 'mai'| '05'|'5';
JUN : 'junho'|'jun'|'06'|'6';
JUL : 'julho'|'jul'|'07'|'7';
AUG : 'agosto'|'ago'|'08'|'8';
SEP : 'setembro'|'set'|'09'|'9';
OCT : 'outubro'|'out'|'10';
NOV : 'novembro'|'nov'|'11';
DEC : 'dezembro'|'dez'|'12';
SEPARATOR: '/'|'-';
AND: ('e'|'E');
OR: ('O'|'o')('U'|'u');
NOT: ('N'|'n')('Γƒ'|'Γ£')('O'|'o');
IN: ('E'|'e')('M'|'m');
GREATER: '>' | ('m'|'M')('a'|'A')('i'|'I')('o'|'O')('r'|'R') ;
LESS: '<' | ('m'|'M')('e'|'E')('n'|'N')('o'|'O')('r'|'R');
EQUAL: '=' | ('i'|'I')('g'|'G')('u'|'U')('a'|'A')('l'|'L');
RANGE: ('e'|'E')('n'|'N')('t'|'T')('r'|'R')('e'|'E');
FLOAT: DIGIT+ | DIGIT+ POINT DIGIT+;
ID: (LETTER|DIGIT+ SYMBOL) (LETTER|SYMBOL|DIGIT)*;
STRING: '"' ( ESC_SEQ | ~('\\'|'"') )* '"';
DIGIT: [0-9];
WS: (' '
| '\t'
| '\r'
| '\n') -> skip
;
POINT: '.' | ',';
fragment
LETTER: 'A'..'Z'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment
SYMBOL: '-' | '_';
fragment
HEX_DIGIT: ('0'..'9'|'a'..'f'|'A'..'F');
fragment
ESC_SEQ: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
But a no viable alternative at input error occurs only trying parse a following type of sentences: keyword OPERATOR DIGIT; for example:
filter = 2
filter < 2
filter > 2
Zero as a value, it works!!!
Where is the error?
Thanks by your help,
Yenier
You have a lot of ambiguity in your lexer rules. What messes it up specifically in your case is digits 1-9 can be matched to both DIGIT and MONTH, JAN, etc. Digit 0 is immune to this problem. Use grun with -tokens to diagnose problems of the sort you encountered:
$ grun KeywordGrammar start -tokens
filter = 0
[#0,0:5='filter',<24>,1:0]
[#1,7:7='=',<21>,1:7]
[#2,9:9='0',<23>,1:9]
[#3,11:10='<EOF>',<-1>,2:0]
$ grun KeywordGrammar start -tokens
filter = 2
[#0,0:5='filter',<24>,1:0]
[#1,7:7='=',<21>,1:7]
[#2,9:9='2',<1>,1:9]
[#3,11:10='<EOF>',<-1>,2:0]
line 1:9 no viable alternative at input '=2'
As you can see, 0 in the first case hase token type <23>, in the second case 2 is token type <1>. Look at your generated KeywordGrammar.tokens:
MONTH=1
JAN=2
...
FLOAT=23
...
So it is not a DIGIT or FLOAT - it is MONTH. As a result, your filter rule does not match. And yes, the order of rules matter, since in case of ambiguity ANTLR picks the first rule.
Remove the ambiguity from the lexer. Make months and similar tokens into grammar rules. And you have plenty of other places, like your FLOAT makes DIGIT impossible to appear standalone, still you refer to DIGIT along with the FLOAT in the rules. If DIGIT has no significance at the grammar level, make it a fragment and use only FLOAT in parser rules.
And make it a habit to use grun and/or ANTLR plugins for IDE to make sure you know what your lexers and parsers actually see.
testing here I saw that the problem disappears placing the FLOAT definition token before DATE definition.
...
FLOAT: DIGIT+ (POINT DIGIT+)?;
DATE: DIGIT DIGIT? SEPARATOR MONTH SEPARATOR DIGIT DIGIT (DIGIT DIGIT)?;
...
I do not know why. Does the order matter?

Why doesn't ANTLR "over-reduce" this expression?

I have the following grammar:
expr : factor op ;
op
: '+' factor op
| // Blank rule for left-recursion elimination
;
factor
: NUM
| '(' expr ')'
;
NUM : ('0'..'9')+ ;
I supply 2 + 3, using expr as the start rule. The resulting parse tree from ANTLR is correct; however, I think I am misunderstanding the shift-reduce methods it uses.
I would expect the following to happen:
Step # | Parse Stack | Lookahead | Unscanned | Action
1 | | NUM | + 3 | Shift
2 | NUM | + | 3 | Reduce by factor -> NUM
3 | factor | + | 3 | Shift a 'null'?
4 | factor null | + | 3 | Reduce by op -> null
5 | factor op | + | 3 | Reduce by expr -> factor op
6 | expr | + | 3 | Shift
7 | expr + | NUM | | Shift
8 | expr + NUM | | | Reduce by factor -> NUM
9 | expr + factor | | | ERROR (no production)
I would've expected an error to occur at step 3 wherin the parser would shift a null onto the stack as a prerequisite to reduceing the factor "up" to an expr.
Does ANTLR only shift a null when it's strictly "required" because the resulting reduce will satisfy the grammar?
It seems to me that ANTLR doesn't use a shift-reduce parser; the generated parsers are recursive descent using an arbitrary amount of lookahead.
The steps of the parser would be something like:
Rule | Consummed | Input
--------------+-----------+------
expr | | 2 + 3
..factor | | 2 + 3
....NUM | 2 | + 3
..op | 2 | + 3
....'+' | 2 + | 3
....factor | 2 + | 3
......NUM | 2 + 3 |
....op | 2 + 3 |
......(empty) | 2 + 3 |
From what I read about ANTLR, you could achieve the same result with the following changes to the original grammar:
expr: factor op*;
op: '+' factor;
...

Webcrawler - Duplicates and weird count

I've stole some code from "Expert F# 2.0", that shows how to build a webcrawler, using MailboxProcessor. As you see, then I have a print expression at line 23, that prints the current number of urls in the visited Set. Also the number of urls to crawl is limited by 49.
open System
open System.Net
open System.Text.RegularExpressions
open Microsoft.FSharp.Control.WebExtensions
let getLinks (txt:string) =
[ for m in Regex.Matches(txt, "href=\s*\"[^\"h]*(http://[^&\"]*)\"") -> m.Groups.Item(1).Value ]
let collectLinks (url:string) =
async { let web = new WebClient()
let! data = web.AsyncDownloadString <| Uri url
let links = getLinks data
return links }
let urlCollector =
MailboxProcessor.Start(fun self ->
let rec waitForUrl (visited : Set<string>) =
async { // Checks whether we have reached the limit of pages to crawl
if visited.Count < 50 then
// Waits for a URL...
let! url = self.Receive()
printfn "%A | %A" visited.Count url
// If not the URL already has been crawled...
if not (visited.Contains url) then
// Start
do! Async.StartChild(
async { let! links = collectLinks url
Seq.iter self.Post links}) |> Async.Ignore
return! waitForUrl (visited.Add url) }
waitForUrl Set.empty)
urlCollector.Post "http://news.google.com/"
That's seems alright eh? - But now the output looks like:
0 | "http://news.google.com/"
1 | "http://www.gstatic.com/news/img/favicon.ico"
2 | "http://mail.google.com/mail/?tab=nm"
3 | "http://www.google.com/intl/en/options/"
4 | "http://docs.google.com/?tab=no"
5 | "http://www.google.com/reader/?tab=ny"
6 | "http://sites.google.com/?tab=n3"
7 | "http://www.google.com/intl/en/options/"
7 | "http://www.google.com/preferences?hl=en"
8 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
9 | "http://www.bloomberg.com/news/2011-08-07/london-rioters-clash-with-police-loot-in-tottenham-after-shooting-death.html"
10 | "http://www.hindustantimes.com/Rioters-battle-police-after-shooting-protest/Article1-730371.aspx"
11 | "http://www.telegraph.co.uk/news/uknews/crime/8687177/Tottenham-riot-live.html"
12 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
12 | "http://www.montrealgazette.com/London+wakes+riot+aftermath/5218849/story.html"
13 | "http://themediablog.typepad.com/the-media-blog/2011/08/daily-mail-tottenham-violence-twitter.html"
14 | "http://en.wikipedia.org/wiki/2011_Tottenham_riots"
15 | "http://www.babnet.net/festivaldetail-37897.asp"
16 | "http://www.youtube.com/watch?v=l9UImSbegj4"
17 | "http://www.babnet.net/festivaldetail-37897.asp"
17 | "http://www.youtube.com/watch?v=l9UImSbegj4"
17 | "http://www.telegraph.co.uk/news/uknews/crime/8687177/Tottenham-riot-live.html"
17 | "http://www.telegraph.co.uk/news/uknews/crime/8687177/Tottenham-riot-live.html"
17 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
17 | "http://www.guardian.co.uk/uk/2011/aug/07/tottenham-riots-police-had-not-anticipated-violence"
17 | "http://www.bbc.co.uk/news/uk-14436001"
18 | "http://www.bbc.co.uk/news/uk-14436001"
18 | "http://www.kbc.co.ke/news.asp?nid=71755"
19 | "http://www.kbc.co.ke/news.asp?nid=71755"
19 | "http://news.sky.com/skynews/Home/UK-News/Tottenham-Riots-Simmering-Anger-Erupts-In-North-London-After-Protest-At-Mans-Shooting-Death/Article/201108116045172?f=rss"
20 | "http://news.sky.com/skynews/Home/UK-News/Tottenham-Riots-Simmering-Anger-Erupts-In-North-London-After-Protest-At-Mans-Shooting-Death/Article/201108116045172?f=rss"
20 | "http://www.irishtimes.com/newspaper/breaking/2011/0807/breaking2.html?via=mr"
21 | "http://www.irishtimes.com/newspaper/breaking/2011/0807/breaking2.html?via=mr"
21 | "http://www.cbc.ca/news/world/story/2011/08/07/tottenham-riot.html"
22 | "http://www.cbc.ca/news/world/story/2011/08/07/tottenham-riot.html"
22 | "http://www.newsday.com/news/police-officer-hospitalized-7-injured-in-uk-riot-1.3079769"
23 | "http://www.newsday.com/news/police-officer-hospitalized-7-injured-in-uk-riot-1.3079769"
23 | "http://www.msnbc.msn.com/id/44049721/ns/world_news-europe/"
24 | "http://www.msnbc.msn.com/id/44049721/ns/world_news-europe/"
24 | "http://www.timeslive.co.za/world/2011/08/07/eight-london-police-hospitalised-after-riots"
25 | "http://www.timeslive.co.za/world/2011/08/07/eight-london-police-hospitalised-after-riots"
25 | "http://www.cnn.com/2011/WORLD/europe/08/07/uk.riots/"
26 | "http://www.cnn.com/2011/WORLD/europe/08/07/uk.riots/"
26 | "http://www.dailymail.co.uk/news/article-2023348/Tottenham-anarchy-Grim-echo-1985-Broadwater-farm-riot.html"
27 | "http://www.dailymail.co.uk/news/article-2023348/Tottenham-anarchy-Grim-echo-1985-Broadwater-farm-riot.html"
27 | "http://www.mirror.co.uk/news/top-stories/2011/08/06/tottenham-riot-protesters-torch-police-cars-shops-and-a-bus-115875-23325724/"
28 | "http://www.mirror.co.uk/news/top-stories/2011/08/06/tottenham-riot-protesters-torch-police-cars-shops-and-a-bus-115875-23325724/"
28 | "http://www.theglobeandmail.com/news/world/images-of-the-destruction-from-londons-tottenham-riots/article2122026/"
29 | "http://www.theglobeandmail.com/news/world/images-of-the-destruction-from-londons-tottenham-riots/article2122026/"
29 | "http://thelede.blogs.nytimes.com/2011/08/06/shops-and-cars-burn-in-anti-police-riot-in-london/"
30 | "http://thelede.blogs.nytimes.com/2011/08/06/shops-and-cars-burn-in-anti-police-riot-in-london/"
30 | "http://www.stuff.co.nz/world/5403614/Crowds-attack-police-after-UK-protest"
31 | "http://www.stuff.co.nz/world/5403614/Crowds-attack-police-after-UK-protest"
31 | "http://www.google.com/hostednews/afp/article/ALeqM5jOCV_DVSYR1S50v6vdSBjsR5H9Jw?docId=CNG.36dce69df0a155bfd2fa1a3a5f92f6e1.5c1"
32 | "http://www.google.com/hostednews/afp/article/ALeqM5jOCV_DVSYR1S50v6vdSBjsR5H9Jw?docId=CNG.36dce69df0a155bfd2fa1a3a5f92f6e1.5c1"
32 | "http://fallenscoop.com/16993/tottenham-riot-2011-north-london-burns-after-protest-of-mark-duggan"
33 | "http://fallenscoop.com/16993/tottenham-riot-2011-north-london-burns-after-protest-of-mark-duggan"
33 | "http://www.thedailybeast.com/cheats/2011/08/07/riots-grip-north-london.html"
34 | "http://www.thedailybeast.com/cheats/2011/08/07/riots-grip-north-london.html"
34 | "http://www.thehindu.com/news/article2333142.ece"
35 | "http://www.sfgate.com/cgi-bin/article.cgi?f=/g/a/2011/08/07/bloomberg1376-LPHCT11A1I4H01-3ULNPF643I4ERSIU09MO54CQ4B.DTL"
36 | "http://online.wsj.com/community/groups/question-day-229/topics/do-you-agree-sps-decision?commentid=2864110"
37 | "http://www.businessweek.com/ap/financialnews/D9OUMJVO1.htm"
38 | "http://www.cnn.com/2011/BUSINESS/08/06/global.economy.cnn/"
39 | "http://www.chicagotribune.com/news/opinion/editorials/ct-edit-credit-20110806,0,6468631.story"
40 | "http://www.foxbusiness.com/markets/2011/08/07/treasury-hits-back-against-sp-downgrade/"
41 | "http://en.wikipedia.org/wiki/Standard_%26_Poor%27s"
42 | "http://www.usatoday.com/money/companies/management/2011-08-07-verizon-strike_n.htm"
43 | "http://www.businessweek.com/ap/financialnews/D9OV028O3.htm"
44 | "http://www.nbcnewyork.com/news/local/Verizon-Workers-Demonstrate-in-Manhattan-Part-of-45K-Worker-Strike-127087478.html"
45 | "http://www.poughkeepsiejournal.com/article/20110807/NEWS03/110807003/45K-Verizon-workers-strike-over-new-labor-contract-?odyssey=tab%7Ctopnews%7Ctext%7CPoughkeepsieJournal.com"
46 | "http://www.nypost.com/p/news/national/verizon_hit_by_strike_Ga9JjKphZrKCEAr608bqkI"
47 | "http://www.nytimes.com/2011/08/07/us/07verizon.html"
48 | "http://www.ctv.ca/CTVNews/World/20110807/afghanistan-helicopter-crash-fighting-110807/"
49 | "http://abcnews.go.com/International/nato-crash-team-seal-members-killed-afghanistan/story?id=14249189"
What's up with all the duplicates? Also why does some of them print the same "current urls in visited Set" (like 17, 33, 34 etc.)? I'm pretty sure, that I miss something totally fundamental, but I cant figure out what.
In your snippet, the printing using printfn is done before you check if the URL is already present in the set. This means that it will print the URL even if it will not be added in the next step. (You can see that it wasn't added if you look at the numbers in the left column - if the count wasn't incremented, the number on the next line is the same).
Moving printfn to the body of the if expression should give the expected results:
// Waits for a URL...
let! url = self.Receive()
// If not the URL already has been crawled...
if not (visited.Contains url) then
printfn "%A | %A" visited.Count url
// Start

Resources