I'm trying to build up some skills in lexing/parsing grammars. I'm looking back on a simple parser I wrote for SQL, and I'm not altogether happy with it -- it seems like there should have been an easier way to write the parser.
SQL tripped me up because it has a lot of optional tokens and repetition. For example:
SELECT *
FROM t1
INNER JOIN t2
INNER JOIN t3
WHERE t1.ID = t2.ID and t1.ID = t3.ID
Is equivalent to:
SELECT *
FROM t1
INNER JOIN t2 ON t1.ID = t2.ID
INNER JOIN t3 on t1.ID = t3.ID
The ON and WHERE clauses are optional and can occur more than once. I handled these in my parser as follows:
%{
open AST
%}
// ...
%token <string> ID
%token AND OR COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
// ...
%%
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
// WHERE clause is optional
whereClause:
| { None }
| WHERE whereList { Some($2) }
whereList:
| ID op ID { Cond($1, $2, $3) }
| ID op ID AND whereList { And(Cond($1, $2, $3), $5) }
| ID op ID OR whereList { Or(Cond($1, $2, $3), $5) }
// Join clause is optional
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
// "On" clause is optional
joinOnClause:
| { None }
| ON whereList { Some($2) }
// ...
%%
In other words, I handled optional syntax by breaking it into separate rules, and handled repetition using recursion. This works, but it breaks parsing into a bunch of little subroutines, and it's very hard to see what the grammar actually represents.
I think it would be much easier to write if I could specify optional syntax inside brackets and repetition with an * or +. This would reduce my whereClause and joinList rules to the following:
whereClause:
| { None }
// $1 $2, I envision $2 being return as an (ID * op * ID * cond) list
| WHERE [ ID op ID { (AND | OR) }]+
{ Some([for name1, op, name2, _ in $1 -> name1, op, name2]) }
joinClause:
| { None }
// $1, returned as (joinType
// * JOIN
// * ID
// * ((ID * op * ID) list) option) list
| [ (INNER | LEFT | RIGHT) JOIN ID [ON whereClause] ]*
{ let joinType, _, table, onClause = $1;
Some(table, joinType, onClause) }
I think this form is much easier to read and expresses the grammar it's trying to capture more intuitively. Unfortunately, I can't find anything in either the Ocaml or F# documentation which supports this notation or anything similar.
Is there an easy way to represent grammars with optional or repetitive tokens in OcamlYacc or FsYacc?
When you compose all the little pieces you should get something like you want though. Instead of:
(INNER | RIGHT | LEFT )
you just have
inner_right_left
and define that to be the union of those three keywords.
You can also define the union in the lexer. in the way you define the tokens, or using camlp4, I haven't done much in that area, so I cannot advise you to take those routes. And I don't think they'll work for you as well as just having little pieces everywhere.
EDIT:
So, for camlp4 you can look at Camlp4 Grammar module and a tutorial and a better tutorial. This isn't exactly what you want, mind you, but it's pretty close. The documentation is pretty bad, as expressed in the recent discussion on the ocaml groups, but for this specific area, I don't think you'll have too many problems. I did a little with it and can field more questions.
Menhir allows to parametrize nonterminal symbols by another symbols and provides the library of standard shortcuts, like optionals and lists, and you can create your own. Example:
option(X): x=X { Some x}
| { None }
There is also some syntax sugar, 'token?' is equivalent to 'option(token)', 'token+' to 'nonempty_list(token)'.
All of this really shortens grammar definition. Also it is supported by ocamlbuild and can be a drop-in replacement for ocamlyacc. Highly recommended!
Funny, I used it to parse SQL too :)
Related
I am able to parse the following valid SQL expression:
(select 1 limit 1) union all select 1 union all (select 2);
However, it seems there is an ambiguity in it, which I've been have a lot of trouble resolving. Here is a working version of the program (I've obviously cut down the statements just to create a minimal producible question) --
parser grammar DBParser;
options { tokenVocab = DBLexer;}
root
: selectStatement SEMI? EOF
;
selectStatement
: (selectStatementWithParens|selectClause|setOperation)
limitClause?
;
selectClause
: SELECT NUMBER
;
limitClause
: LIMIT NUMBER
;
selectStatementWithParens
: OPEN_PAREN selectStatement CLOSE_PAREN
;
setOperation:
(selectClause | selectStatementWithParens)
(setOperand (selectClause | selectStatementWithParens))*
;
setOperand
: UNION ALL?
;
lexer grammar DBLexer;
options { caseInsensitive=true; }
SELECT : 'SELECT'; // SELECT *...
LIMIT : 'LIMIT'; // ORDER BY x LIMIT 20
ALL : 'ALL'; // SELECT ALL vs. SELECT DISTINCT; WHERE ALL (...); UNION ALL...
UNION : 'UNION'; // Set operation
SEMI : ';'; // Statement terminator
OPEN_PAREN : '('; // Function calls, object declarations
CLOSE_PAREN : ')';
NUMBER
: [0-9]+
;
WHITESPACE
: [ \t\r\n] -> skip
;
Where is the ambiguity coming from, and what could be a possible way to solve this?
Update: I'm not sure if this is the solution, but it seems the following helped eliminate that ambiguity:
selectStatement:
withClause?
(selectStatementWithParens|selectClause)
(setOperand (selectClause|selectStatementWithParens))*
orderClause?
(limitClause offsetClause?)?
;
In other words -- making it such that the setOperand doesn't re-start with the select.
your setOpertaion rule can match single selectClause or selectStatementWithParens (because you use the * cardinality for the second half of the rule, so 0 instances of the second half still matches the rule). This means that a selectClause can match the selectClause rule in selectStatement, or it could be used to construct a setOperation (which is the other alternative in your ambiguity).
If you change setOperation to use + cardinality for the second half of the rule, you resolve the ambiguity.
setOperation
: (selectClause | selectStatementWithParens)
(setOperand (selectClause | selectStatementWithParens))+
;
This also seems logical, that you'd only want to consider something a setOperation if there's a setOperand involved.
That explains and corrects the ambiguity, but still leaves you with a "max k" of 7.
I have defined two sets of identifiers IDENTIFIER_ONE and IDENTIFIER_TWO which are both exculsive subsets of IDENTIFIER. I would like to write a parser such that:
"i1(arg) EOS" can't be parsed (1)
"i2(arg) EOS" can be parsed (2)
"i1(arg) = value EOS" can be parsed (3)
"i2(arg) = value EOS" can be parsed (4)
where i1(resp., i2) belongs to IDENTIFIER_ONE (resp., IDENTIFIER_TWO); arg and value belong to IDENTIFIER. The following parser.mly has already realized all the points I am after, except (4):
identifier:
| IDENTIFIER_ONE { $1 }
| IDENTIFIER_TWO { $1 }
block_statement_EOS:
| identifier LPAREN identifier RPAREN EQUAL identifier EOS { BSE_Let ($1, $3, $6) }
| IDENTIFIER_TWO LPAREN identifier RPAREN EOS { BSE_I_I ($1, $3) }
Given i1(arg) = value EOS as input, as goal (3) it is correctly read as BSE_Let (i1, arg, value). However, given i2(arg) = value EOS as input, it stops the parsing after reading EQUAL. I guess it is because once the parse meets i2(arg), it goes to the 2nd rule of block_statement_EOS, and later EQUAL can't be parsed.
Ideally, I would hope the parser could try the 1st rule of block_statement_EOS if the 2nd rule fails. Could anyone help me to make this possible?
PS: If I write the parser.mly as follows, all the goals can be achieved. Does anyone know why? Additionally I really don't like this workaround, because I do need to write identifier instead of two subsets in many other rules, I would hope a more elegant solution...
block_statement_EOS:
| IDENTIFIER_ONE LPAREN identifier RPAREN EQUAL identifier EOS { BSE_Let ($1, $3, $6) }
| IDENTIFIER_TWO LPAREN identifier RPAREN EQUAL identifier EOS { BSE_Let ($1, $3, $6) }
| IDENTIFIER_TWO LPAREN identifier RPAREN EOS { BSE_I_I ($1, $3) }
When your parser encounters an LPAREN after an IDENTIFIER_TWO, it has to decide whether to shift or to reduce:
shift: put LPAREN on the stack;
reduce: replace IDENTIFIER_TWO, which is on top of the stack, by identifier.
If your parser chooses to shift, it will never reduce this particular IDENTIFIER_TWO into identifier (because this particular IDENTIFIER_TWO will never be on top of the stack again), meaning that it will always reduce the second rule of block_statement_EOS.
If your parser chooses to reduce, it will never reduce the second rule of block_statement_EOS, as this rule starts with IDENTIFIER_TWO and not identifier.
This is why your second version works, because there is no need to choose between shifting and reducing after IDENTIFIER_TWO. The choice is made later, if you wish.
I would like to parse a set of expressions, for instance:X[3], X[-3], XY[-2], X[4]Y[2], etc.
In my parser.mly, index (which is inside []) is defined as follows:
index:
| INTEGER { $1 }
| MINUS INTEGER { 0 - $2 }
The token INTEGER, MINUS etc. are defined in lexer as normal.
I try to parse an example, it fails. However, if I comment | MINUS INTEGER { 0 - $2 }, it works well. So the problem is certainly related to that. To debug, I want to get more information, in other words I want to know what is considered to be MINUS INTEGER. I tried to add print:
index:
| INTEGER { $1 }
| MINUS INTEGER { Printf.printf "%n" $2; 0 - $2 }
But nothing is printed while parsing.
Could anyone tell me how to print information or debug that?
I tried coming up with an example of what you describe and was able to get output of 8 with what I show below. [This example is completely stripped down so that it only works for [1] and [- 1 ], but I believe it's equivalent logically to what you said you did.]
However, I also notice that your example's debug string in your example does not have an explicit flush with %! at the end, so that the debugging output might not be flushed to the terminal until later than you expect.
Here's what I used:
Test.mll:
{
open Ytest
open Lexing
}
rule test =
parse
"-" { MINUS }
| "1" { ONE 1 }
| "[" { LB }
| "]" { RB }
| [ ' ' '\t' '\r' '\n' ] { test lexbuf }
| eof { EOFTOKEN }
Ytest.mly:
%{
%}
%token <int> ONE
%token MINUS LB RB EOFTOKEN
%start item
%type <int> index item
%%
index:
ONE { 2 }
| MINUS ONE { Printf.printf "%n" 8; $2 }
item : LB index RB EOFTOKEN { $2 }
Parse.ml
open Test;;
open Ytest;;
open Lexing;;
let lexbuf = Lexing.from_channel stdin in
ignore (Ytest.item Test.test lexbuf)
If I write grammar file in Yacc/Bison like this:
Module
:ModuleName "=" Functions
{ $$ = Builder::concat($1, $2, ","); }
Functions
:Functions Function
{ $$ = Builder::concat($1, $2, ","); }
| Function
{ $$ = $1; }
Function
: DEF ID ARGS BODY
{
/** Lacks module name to do name mangling for the function **/
/** How can I obtain the "parent" node's module name here ?? **/
module_name = ; //????
$$ = Builder::def_function(module_name, $ID, $ARGS, $BODY);
}
And this parser should parse codes like this:
main_module:
def funA (a,b,c) { ... }
In my AST, the name "funA" should be renamed as main_module.funA. But I can't get the module's information while the parser is processing Function node !
Is there any Yacc/Bison facilities can help me to handle this problem, or should I change my parsing style to avoid such embarrassing situations ?
There is a bison feature, but as the manual says, use it with care:
$N with N zero or negative is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:
foo: expr bar '+' expr { ... }
| expr bar '-' expr { ... }
;
bar: /* empty */
{ previous_expr = $0; }
;
As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo.
More cleanly, you could use a mid-rule action (in Module) to push the module name on a name stack (which would have to be part of the parsing context). You would then pop the stack at the end of the rule.
For more information and examples of mid-rules actions, see the manual.
I would like to be able to parse a non-empty, one-or-many element, comma-delimited (and optionally parenthesized) list using flex/bison parse rules.
some e.g. of parseable lists :
1
1,2
(1,2)
(3)
3,4,5
(3,4,5,6)
etc.
I am using the following rules to parse the list (final result is parse element 'top level list'), but they do not seem to give the desired result when parsing (I get a syntax-error when supplying a valid list). Any suggestion on how I might set this up ?
cList : ELEMENT
{
...
}
| cList COMMA ELEMENT
{
...
}
;
topLevelList : LPAREN cList RPAREN
{
...
}
| cList
{
...
}
;
This sounds simple. Tell me if i missed something or if my example doesnt work
RvalCommaList:
RvalCommaListLoop
| '(' RvalCommaListLoop ')'
RvalCommaListLoop:
Rval
| RvalCommaListLoop ',' Rval
Rval: INT_LITERAL | WHATEVER
However if you accept rvals as well as this list you'll have a conflict confusing a regular rval with a single item list. In this case you can use the below which will either require the '('')' around them or require 2 items before it is a list
RvalCommaList2:
Rval ',' RvalCommaListLoop
| '(' RvalCommaListLoop ')'
I too want to know how to do this, thinking about it briefly, one way to achieve this would be to use a linked list of the form,
struct list;
struct list {
void *item;
struct list *next;
};
struct list *make_list(void *item, struct list *next);
and using the rule:
{ $$ = make_list( $1, $2); }
This solution is very similar in design to:
Using bison to parse list of elements
The hard bit is to figure out how to handle lists in the scheme of a (I presume) binary AST.
%start input
%%
input:
%empty
| integer_list
;
integer_list
: integer_loop
| '(' integer_loop ')'
;
integer_loop
: INTEGER
| integer_loop COMMA INTEGER
;
%%