Try the first rule if the second rule fails - parsing

I have defined two sets of identifiers IDENTIFIER_ONE and IDENTIFIER_TWO which are both exculsive subsets of IDENTIFIER. I would like to write a parser such that:
"i1(arg) EOS" can't be parsed (1)
"i2(arg) EOS" can be parsed (2)
"i1(arg) = value EOS" can be parsed (3)
"i2(arg) = value EOS" can be parsed (4)
where i1(resp., i2) belongs to IDENTIFIER_ONE (resp., IDENTIFIER_TWO); arg and value belong to IDENTIFIER. The following parser.mly has already realized all the points I am after, except (4):
identifier:
| IDENTIFIER_ONE { $1 }
| IDENTIFIER_TWO { $1 }
block_statement_EOS:
| identifier LPAREN identifier RPAREN EQUAL identifier EOS { BSE_Let ($1, $3, $6) }
| IDENTIFIER_TWO LPAREN identifier RPAREN EOS { BSE_I_I ($1, $3) }
Given i1(arg) = value EOS as input, as goal (3) it is correctly read as BSE_Let (i1, arg, value). However, given i2(arg) = value EOS as input, it stops the parsing after reading EQUAL. I guess it is because once the parse meets i2(arg), it goes to the 2nd rule of block_statement_EOS, and later EQUAL can't be parsed.
Ideally, I would hope the parser could try the 1st rule of block_statement_EOS if the 2nd rule fails. Could anyone help me to make this possible?
PS: If I write the parser.mly as follows, all the goals can be achieved. Does anyone know why? Additionally I really don't like this workaround, because I do need to write identifier instead of two subsets in many other rules, I would hope a more elegant solution...
block_statement_EOS:
| IDENTIFIER_ONE LPAREN identifier RPAREN EQUAL identifier EOS { BSE_Let ($1, $3, $6) }
| IDENTIFIER_TWO LPAREN identifier RPAREN EQUAL identifier EOS { BSE_Let ($1, $3, $6) }
| IDENTIFIER_TWO LPAREN identifier RPAREN EOS { BSE_I_I ($1, $3) }

When your parser encounters an LPAREN after an IDENTIFIER_TWO, it has to decide whether to shift or to reduce:
shift: put LPAREN on the stack;
reduce: replace IDENTIFIER_TWO, which is on top of the stack, by identifier.
If your parser chooses to shift, it will never reduce this particular IDENTIFIER_TWO into identifier (because this particular IDENTIFIER_TWO will never be on top of the stack again), meaning that it will always reduce the second rule of block_statement_EOS.
If your parser chooses to reduce, it will never reduce the second rule of block_statement_EOS, as this rule starts with IDENTIFIER_TWO and not identifier.
This is why your second version works, because there is no need to choose between shifting and reducing after IDENTIFIER_TWO. The choice is made later, if you wish.

Related

How to resolve this grammar ambiguity?

I am able to parse the following valid SQL expression:
(select 1 limit 1) union all select 1 union all (select 2);
However, it seems there is an ambiguity in it, which I've been have a lot of trouble resolving. Here is a working version of the program (I've obviously cut down the statements just to create a minimal producible question) --
parser grammar DBParser;
options { tokenVocab = DBLexer;}
root
: selectStatement SEMI? EOF
;
selectStatement
: (selectStatementWithParens|selectClause|setOperation)
limitClause?
;
selectClause
: SELECT NUMBER
;
limitClause
: LIMIT NUMBER
;
selectStatementWithParens
: OPEN_PAREN selectStatement CLOSE_PAREN
;
setOperation:
(selectClause | selectStatementWithParens)
(setOperand (selectClause | selectStatementWithParens))*
;
setOperand
: UNION ALL?
;
lexer grammar DBLexer;
options { caseInsensitive=true; }
SELECT : 'SELECT'; // SELECT *...
LIMIT : 'LIMIT'; // ORDER BY x LIMIT 20
ALL : 'ALL'; // SELECT ALL vs. SELECT DISTINCT; WHERE ALL (...); UNION ALL...
UNION : 'UNION'; // Set operation
SEMI : ';'; // Statement terminator
OPEN_PAREN : '('; // Function calls, object declarations
CLOSE_PAREN : ')';
NUMBER
: [0-9]+
;
WHITESPACE
: [ \t\r\n] -> skip
;
Where is the ambiguity coming from, and what could be a possible way to solve this?
Update: I'm not sure if this is the solution, but it seems the following helped eliminate that ambiguity:
selectStatement:
withClause?
(selectStatementWithParens|selectClause)
(setOperand (selectClause|selectStatementWithParens))*
orderClause?
(limitClause offsetClause?)?
;
In other words -- making it such that the setOperand doesn't re-start with the select.
your setOpertaion rule can match single selectClause or selectStatementWithParens (because you use the * cardinality for the second half of the rule, so 0 instances of the second half still matches the rule). This means that a selectClause can match the selectClause rule in selectStatement, or it could be used to construct a setOperation (which is the other alternative in your ambiguity).
If you change setOperation to use + cardinality for the second half of the rule, you resolve the ambiguity.
setOperation
: (selectClause | selectStatementWithParens)
(setOperand (selectClause | selectStatementWithParens))+
;
This also seems logical, that you'd only want to consider something a setOperation if there's a setOperand involved.
That explains and corrects the ambiguity, but still leaves you with a "max k" of 7.

Error: popping nterm

I'm trying to understand the diagnostic messages given by Flex:
Entering state 5
Return for a new token:
Reading a token: Next token is token END_OF_FILE (4.0: )
Shifting token END_OF_FILE (4.0: )
Entering state 43
Reducing stack by rule 143 (line 331):
$1 = nterm syntax (0.0-17: )
$2 = nterm top_levels (0.18-4.0: )
$3 = token END_OF_FILE (4.0: )
-> $$ = nterm s (0.0-4.0: )
Stack now 0
Entering state 3
Return for a new token:
Reading a token: Next token is token END_OF_FILE (4.0: )
4/0: syntax error
Error: popping nterm s (0.0-4.0: )
Stack now 0
Cleanup: discarding lookahead token END_OF_FILE (4.0: )
Stack now 0
I cannot understand why / what is it trying to do with EOF token. Below are the Flex rules:
<<EOF>> { return END_OF_FILE; }
And the Bison rules:
top_level : message
| enum
| service
| import { $$ = Py_None; }
| package { $$ = Py_None; }
| option_def { $$ = Py_None; }
| ';' { $$ = Py_None; } ;
top_levels : %empty { $$ = py_list(Py_None); }
| top_levels top_level { $$ = py_append($1, $2); } ;
s : syntax top_levels END_OF_FILE { $$ = $2; } ;
And the output file generated by Bison:
State 3
0 $accept: s . $end
$end shift, and go to state 6
State 5
142 top_levels: top_levels . top_level
143 s: syntax top_levels . END_OF_FILE
BOOL shift, and go to state 9
... bunch of similar rules
END_OF_FILE shift, and go to state 43
';' shift, and go to state 44
import go to state 45
... bunch of similar rules
top_level go to state 55
State 6
0 $accept: s $end .
$default accept
I have no idea what's going on. Why does it report reading EOF token twice? What was exactly the problem with popping s? To me it seems like it actually accepted the whole thing, and then decided to reject it because it red the token second time... but the whole reporting is very confusing.
1. The problem
Don't do this:
<<EOF>> { return END_OF_FILE; }
Yacc/bison parsers augment grammars with an internal rule which produces the start symbol followed by an internal eof token called $end, whose token number is 0. (You can see this rule in states 3 and 6.) That is the only accepting rule in the grammar.
By default, (f)lex scanners return 0 when EOF is detected. So that all Just Works.
When you try to send a different token on EOF, you are attempting to defeat this mechanism, but it won't work because the start symbol is not an accepting rule. After the start symbil is reduced, the parser tries to reduce the $accept rule, so it asks the scanner for another token. But the scanner has already hit EOF. In most cases, the scanner will execute the <<EOF>> action again (although this is not guaranteed), but that's not going to produce the $end token it needs. So you get a syntax error.
2. The underlying problem (maybe)
Normally, people try this in order to create a user action which runs when the input is accepted, typically in order to return the result of the parse to yyparse's caller through an "out" parameter. Trying to explicitly recognize an EOF token (or even the $end token) in the start production cannot work, but there is a much simpler solution: an extra unit rule:
%start return
%%
return: s { *out = $1; }
s: syntax top_levels { $$ = $2; }
Note that you could also do this without top_levels:
%start return
%%
return: { *out = $1; }
s: syntax { $$ = py_list(Py_None); }
| s top_level { $$ = py_append($1, $2); }
An alternative is to use the special YYACCEPT action macro in the action for the start rule. However, I believe the standard solution outlined above is simpler because it doesn't require anything from the scanner.
3. The trace output
Error: popping nterm s (0.0-4.0: )
Means:
A syntax error was detected.
As part of error recovery, the parser popped the non-terminal s from the stack.
That non-terminal's source location extends from 0.0 to 4.0 (line . column)
If s (or its semantic type) had had a registered destructor, that would have run at step 2. You probably will want to register destructor for syntactic types which reference Python values in order to decrement their reference counts so that you don't leak memory on syntax errors. But perhaps I'm wrong about that.
Also, you could register a %printer for the syntactic value, in which case it would have been printed after the colon.

Debug parser by printing useful information

I would like to parse a set of expressions, for instance:X[3], X[-3], XY[-2], X[4]Y[2], etc.
In my parser.mly, index (which is inside []) is defined as follows:
index:
| INTEGER { $1 }
| MINUS INTEGER { 0 - $2 }
The token INTEGER, MINUS etc. are defined in lexer as normal.
I try to parse an example, it fails. However, if I comment | MINUS INTEGER { 0 - $2 }, it works well. So the problem is certainly related to that. To debug, I want to get more information, in other words I want to know what is considered to be MINUS INTEGER. I tried to add print:
index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n" $2; 0 - $2 }
But nothing is printed while parsing.
Could anyone tell me how to print information or debug that?
I tried coming up with an example of what you describe and was able to get output of 8 with what I show below. [This example is completely stripped down so that it only works for [1] and [- 1 ], but I believe it's equivalent logically to what you said you did.]
However, I also notice that your example's debug string in your example does not have an explicit flush with %! at the end, so that the debugging output might not be flushed to the terminal until later than you expect.
Here's what I used:
Test.mll:
{
open Ytest
open Lexing
}
rule test =
parse
"-" { MINUS }
| "1" { ONE 1 }
| "[" { LB }
| "]" { RB }
| [ ' ' '\t' '\r' '\n' ] { test lexbuf }
| eof { EOFTOKEN }
Ytest.mly:
%{
%}
%token <int> ONE
%token MINUS LB RB EOFTOKEN
%start item
%type <int> index item
%%
index:
ONE { 2 }
| MINUS ONE { Printf.printf "%n" 8; $2 }
item : LB index RB EOFTOKEN { $2 }
Parse.ml
open Test;;
open Ytest;;
open Lexing;;
let lexbuf = Lexing.from_channel stdin in
ignore (Ytest.item Test.test lexbuf)

how to setup flex/bison rules for parsing a comma-delimited argument list

I would like to be able to parse a non-empty, one-or-many element, comma-delimited (and optionally parenthesized) list using flex/bison parse rules.
some e.g. of parseable lists :
1
1,2
(1,2)
(3)
3,4,5
(3,4,5,6)
etc.
I am using the following rules to parse the list (final result is parse element 'top level list'), but they do not seem to give the desired result when parsing (I get a syntax-error when supplying a valid list). Any suggestion on how I might set this up ?
cList : ELEMENT
{
...
}
| cList COMMA ELEMENT
{
...
}
;
topLevelList : LPAREN cList RPAREN
{
...
}
| cList
{
...
}
;
This sounds simple. Tell me if i missed something or if my example doesnt work
RvalCommaList:
RvalCommaListLoop
| '(' RvalCommaListLoop ')'
RvalCommaListLoop:
Rval
| RvalCommaListLoop ',' Rval
Rval: INT_LITERAL | WHATEVER
However if you accept rvals as well as this list you'll have a conflict confusing a regular rval with a single item list. In this case you can use the below which will either require the '('')' around them or require 2 items before it is a list
RvalCommaList2:
Rval ',' RvalCommaListLoop
| '(' RvalCommaListLoop ')'
I too want to know how to do this, thinking about it briefly, one way to achieve this would be to use a linked list of the form,
struct list;
struct list {
void *item;
struct list *next;
};
struct list *make_list(void *item, struct list *next);
and using the rule:
{ $$ = make_list( $1, $2); }
This solution is very similar in design to:
Using bison to parse list of elements
The hard bit is to figure out how to handle lists in the scheme of a (I presume) binary AST.
%start input
%%
input:
%empty
| integer_list
;
integer_list
: integer_loop
| '(' integer_loop ')'
;
integer_loop
: INTEGER
| integer_loop COMMA INTEGER
;
%%

Representing optional syntax and repetition with OcamlYacc / FsYacc

I'm trying to build up some skills in lexing/parsing grammars. I'm looking back on a simple parser I wrote for SQL, and I'm not altogether happy with it -- it seems like there should have been an easier way to write the parser.
SQL tripped me up because it has a lot of optional tokens and repetition. For example:
SELECT *
FROM t1
INNER JOIN t2
INNER JOIN t3
WHERE t1.ID = t2.ID and t1.ID = t3.ID
Is equivalent to:
SELECT *
FROM t1
INNER JOIN t2 ON t1.ID = t2.ID
INNER JOIN t3 on t1.ID = t3.ID
The ON and WHERE clauses are optional and can occur more than once. I handled these in my parser as follows:
%{
open AST
%}
// ...
%token <string> ID
%token AND OR COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
// ...
%%
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
// WHERE clause is optional
whereClause:
| { None }
| WHERE whereList { Some($2) }
whereList:
| ID op ID { Cond($1, $2, $3) }
| ID op ID AND whereList { And(Cond($1, $2, $3), $5) }
| ID op ID OR whereList { Or(Cond($1, $2, $3), $5) }
// Join clause is optional
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
// "On" clause is optional
joinOnClause:
| { None }
| ON whereList { Some($2) }
// ...
%%
In other words, I handled optional syntax by breaking it into separate rules, and handled repetition using recursion. This works, but it breaks parsing into a bunch of little subroutines, and it's very hard to see what the grammar actually represents.
I think it would be much easier to write if I could specify optional syntax inside brackets and repetition with an * or +. This would reduce my whereClause and joinList rules to the following:
whereClause:
| { None }
// $1 $2, I envision $2 being return as an (ID * op * ID * cond) list
| WHERE [ ID op ID { (AND | OR) }]+
{ Some([for name1, op, name2, _ in $1 -> name1, op, name2]) }
joinClause:
| { None }
// $1, returned as (joinType
// * JOIN
// * ID
// * ((ID * op * ID) list) option) list
| [ (INNER | LEFT | RIGHT) JOIN ID [ON whereClause] ]*
{ let joinType, _, table, onClause = $1;
Some(table, joinType, onClause) }
I think this form is much easier to read and expresses the grammar it's trying to capture more intuitively. Unfortunately, I can't find anything in either the Ocaml or F# documentation which supports this notation or anything similar.
Is there an easy way to represent grammars with optional or repetitive tokens in OcamlYacc or FsYacc?
When you compose all the little pieces you should get something like you want though. Instead of:
(INNER | RIGHT | LEFT )
you just have
inner_right_left
and define that to be the union of those three keywords.
You can also define the union in the lexer. in the way you define the tokens, or using camlp4, I haven't done much in that area, so I cannot advise you to take those routes. And I don't think they'll work for you as well as just having little pieces everywhere.
EDIT:
So, for camlp4 you can look at Camlp4 Grammar module and a tutorial and a better tutorial. This isn't exactly what you want, mind you, but it's pretty close. The documentation is pretty bad, as expressed in the recent discussion on the ocaml groups, but for this specific area, I don't think you'll have too many problems. I did a little with it and can field more questions.
Menhir allows to parametrize nonterminal symbols by another symbols and provides the library of standard shortcuts, like optionals and lists, and you can create your own. Example:
option(X): x=X { Some x}
| { None }
There is also some syntax sugar, 'token?' is equivalent to 'option(token)', 'token+' to 'nonempty_list(token)'.
All of this really shortens grammar definition. Also it is supported by ocamlbuild and can be a drop-in replacement for ocamlyacc. Highly recommended!
Funny, I used it to parse SQL too :)

Resources