SQL Parser Disambiguation - parsing

I have written a very simple SQL Parser for a very small subset of the language to handle a one time specific problem. I had to translate an extremely large amount of old SQL expressions into an intermediate form that could then possibly be brought into a business rule system. The initial attempt worked for about 80% of the existing data.
I looked at some commercial solutions but thought I could do this pretty easy based on some past experience and some reading. I hit a problem and decided to go and finish the task with a commercial solution, I know when to admit defeat. However I am still curious as to how to handle this or what I may have done wrong.
My initial solution was based on a simple recursive descent parser, found in many books and online articles, producing an Abstract Syntax Tree and then during the analysis phase, I would determine type differences and whether logical expressions were being mixed with algebraic expressions and such.
I referenced the ANTLR SQL Lite grammar by Bark Kiers
https://github.com/bkiers/sqlite-parser
I also referenced an online SQL grammar site
http://savage.net.au/SQL/
The main question is how to make the parser differentiate between the following
expr AND expr
BETWEEN expr AND expr
The problem I am encountering is when I hit the following unit test case
case when PP_ID between '009000' and '009999' then 'MA' when PP_ID between '001000' and '001999' then 'TL' else 'LA' end
The '009000' and '009999' is matched as a Binary Expression so the parser throws an error expecting the keyword AND but instead encounters THEN.
The online ANSI grammar actually breaks down expressions into finer grained productions and I suspect that is the proper approach. I am also wondering if my parser should detect if an expression is actually Boolean vs. Algebraic during the parse phase and not the semantic phase, and use that information to handle the above case.
I am sure I could brute force the solution but I want to learn the correct way to handle this.
Thanks for any help offered.

I also met with this problem while developed Jison (Bison) parser for SQLite, and solved it with who different rules in grammar for binary operations: one for AND and one for BETWEEN (this is a Jison grammar):
%left BETWEEN // Here I defined that AND has higher priority over BETWEEN
%left AND //
: expr AND expr // Rule for AND
{ $$ = {op: 'AND', left: $1, right: $3}; }
;
: expr BETWEEN expr // Rule for BETWEEN
{
if($3.op != 'AND') throw new Error('Wrong syntax of BETWEEN AND');
$$ = {op: 'BETWEEN', expr: $1, left:$3.left, right:$3.right};
}
;
and then parser checks right expression, and pass only expressions with AND operations. May be this approach can help you.
For ANTLR grammar I found the following rule (see this grammar made by Bart Kiers)
expr
:
| expr K_AND expr
| expr K_NOT? K_BETWEEN expr K_AND expr
;
But I am not sure, that it works in proper way.

Related

Validating expressions in the parser

I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".

Differences in the two different parsing methodologies

I am quite new to Antlr4, and am interested in simplifying a large grammar that I have. There are two possible approaches that I may take. The first one, is more of the "in-line" approach, which often makes things simpler as I run into less left-recursion errors (and everything is 'there in one place' so it it's a bit easier to track. Here is an example parse with it:
// inline
// input: CASE WHEN 1 THEN 1 WHEN 1 THEN 1 ELSE 1 END
grammar DBParser;
statement: expr EOF;
expr
: 'CASE' expr? ('WHEN' expr 'THEN' expr)+ ('ELSE' expr)? 'END' # caseExpressionInline
| ATOM # constantExpression
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
The second approach is to break everything out into separate small components, for example:
// separated
grammar DBParser;
statement: expr EOF;
expr
: case_expr # caseExpression
| ATOM # constantExpression
;
case_expr
: 'CASE' case_arg? when_clause_list case_default? 'END'
;
when_clause_list
: when_clause+
;
when_clause
: 'WHEN' expr 'THEN' expr
;
case_default
: 'ELSE' expr
;
case_arg
: expr
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
A couple high-level things I've noticed are:
The inline version comes out to about half the size of the output code files of the second approach.
The in-line version is about 20% faster! I tried both parsers on a 10MB file of repetitions of the same statement, and the inline version never took over 1s and the separated version took around 1.2s and never went under 1.1s (I ran each about 20x). That seems crazy to me that just in-lining the expressions would have such a large performance difference! That was certainly not my goal but noticed it as I was comparing the two different things in writing up this question.
What would be the preferable way to split things up then?
(With the caveat that SO frowns upon opinion-based questions…)
I think that you should let the programming experience of coding against the generated classes be the most important guidance on how much to break things out.
I find the first style much easier to deal with, and you’ve already pointed out performance benefits.
Like most guidance it can be taken too far in either direction. The second is pretty extreme in the “breaking things out” category and, in my opinion, going to be a pain to deal with the deeply nested structure. I think it appeals to our desire to “keep functions small” in general purpose languages, but you’ll spend a lot more time writing the rest of your code than you will getting the parser right.
Your first example comes close to going to the other extreme. I don’t think it quite crosses the line, but if it were to start getting out of hand, I might opt for something like:
grammar DBParser;
statement: expr EOF;
expr
: case_expr # caseExpression
| ATOM # constantExpression
;
case_expr
: 'CASE' expr? ('WHEN' expr 'THEN' expr)+ ('ELSE' expr)? 'END'
;
I’ve also found that I go back and rework the grammar sometimes as I’m fleshing out the language and find I want a slightly different structure. For example, I found that, when doing code completion, it was sometimes helpful to have different parser rules for different contexts that just required an ID token. The code completion could then tell me that I could use an assigment_dest instead of an `IDK and I could do much better completion, so, that’s a case where an extra level proved useful once I started using the grammar in practice.
To me, refining the grammar to give me the most useful ParseTree is the first priority.

Parser/Grammar: 2x parenthesis in nested rules

Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.
But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:
boolCommonExpr = ( boolMethodCallExpr
/ notExpr
/ commonExpr [ eqExpr / neExpr / ltExpr / ... ]
/ boolParenExpr
) [ andExpr / orExpr ]
commonExpr = ( primitiveLiteral
/ firstMemberExpr ; = identifier
/ methodCallExpr
/ parenExpr
) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]
boolParenExpr = "(" boolCommonExpr ")"
parenExpr = "(" commonExpr ")"
How does this grammar match a simple expression like (1 eq 2)? From what I can see all ( are consumed by the rule parenExpr inside commonExpr, i.e. they must also close after commonExpr to not cause an error and boolParenExpr never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?
Obviously an opening ( alone won't tell me where it's going to close: After the current commonExpr expression or further away after boolCommonExpr. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of ( I have. Good idea?
I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.
Edit 1: Extension after answer by rici - Is grammar rewrite correct?
Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.
Adapted to the boolean expression for the OData $filter this could maybe look like this:
boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression = term {("add"|"sum") term} .
term = factor {("mul"|"div"|"mod") factor} .
factor = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall = METHODNAME "(" [ expression {"," expression} ] ")" .
Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?
#rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.
For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.
Edit 2: Return to original OData grammar
The differentiation between a "logical" and "arithmetic" ( is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of () is mandatory around and and or expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.
I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which ( belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.
Personally, I'd just modify the grammar so that it has only one type of expression and therefore one type of parenthesis. I'm not convinced that the OData grammar is actually correct; it is certainly not usable in an LL(1) (or recursive descent) parser for exactly the reason you mention.
Specifically, if the goal is boolCommonExpr, there are two productions which can match the ( lookahead token:
boolCommonExpr = ( …
/ commonExpr [ eqExpr / neExpr / … ]
/ boolParenExpr
/ …
) …
commonExpr = ( …
/ parenExpr
/ …
) …
For the most part, this is a misguided attempt to make the grammar detect a type violation. (If in fact it is a type violation.) It's misguided because it is doomed to failure if there are boolean variables, which there apparently are in this environment. Since there is not syntactic clue as to the type of a variable, the parser is not capable of deciding whether particular expressions are well-formed or not, so there is a good argument for not trying at all, particularly if it creates parsing headaches. A better solution is to first parse the expression into an AST of some form, and then do another pass over the AST to check that each operators has operands of the correct type (and possibly inserting explicit cast operators if that is necessary).
Aside from any other advantage, doing the type check in a separate pass lets you produce much better error messages. If you make (some) type violations syntax errors, then you may leave the user puzzled about why their expression was rejected; in contrast, if you notice that a comparison operation is being used as an operand to multiply (and if your language's semantics don't allow an automatic conversion from True/False to 1/0), then you can produce a well-targetted error message ("comparisons cannot be used as the operand of an arithmetic operator", for example).
One possible reason to put different operators (but not parentheses) into different grammatical variables is to express grammatical precedence. That consideration might encourage you to rewrite the grammar with explicit precedence. (As written, the grammar assumes that all arithmetic operators have the same precedence, which would presumably lead to 2 + 3 * a being parsed as (2 + 3) * a, which might be a huge surprise.) Alternatively, you might use some simple precedence aware subparser for expressions.
If you want to test your ABNF grammar for determinism (i.e. LL(1)), you can use Tunnel Grammar Studio (TGS). I have tested the full grammar, and there are plenty of conflicts, not only this scopes. If you are able to extract the relevant rules, you can use the desktop version of TGS to visualize the conflicts (the online version checker is with a textual result only). If the rules are not too many, the demo may help you to create an LL(1) grammar from your rules.
If you extract all rules you need, and add them to your question, I can run it for you and will tell you is it LL(1). Note that the grammar is not exactly in ABNF meta syntax, because the case sensitivity is typed with ' for case sensitive strings. The ABNF (RFC 5234) by definition is case insensitive, as RFC 7405 defines the sensitivity with %s and %i (sensitive and insensitive) prefixes before the the actual string. The default case (without a prefix) still means insensitive. This means that you have to replace this invalid '...' strings with %s"..." before testing in TGS.
TGS is a project I work on.

Set a rule based on the value of a global variable

In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

Resources