Validating expressions in the parser - parsing

I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?

Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".

Related

Parser/Grammar: 2x parenthesis in nested rules

Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.
But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:
boolCommonExpr = ( boolMethodCallExpr
/ notExpr
/ commonExpr [ eqExpr / neExpr / ltExpr / ... ]
/ boolParenExpr
) [ andExpr / orExpr ]
commonExpr = ( primitiveLiteral
/ firstMemberExpr ; = identifier
/ methodCallExpr
/ parenExpr
) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]
boolParenExpr = "(" boolCommonExpr ")"
parenExpr = "(" commonExpr ")"
How does this grammar match a simple expression like (1 eq 2)? From what I can see all ( are consumed by the rule parenExpr inside commonExpr, i.e. they must also close after commonExpr to not cause an error and boolParenExpr never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?
Obviously an opening ( alone won't tell me where it's going to close: After the current commonExpr expression or further away after boolCommonExpr. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of ( I have. Good idea?
I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.
Edit 1: Extension after answer by rici - Is grammar rewrite correct?
Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.
Adapted to the boolean expression for the OData $filter this could maybe look like this:
boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression = term {("add"|"sum") term} .
term = factor {("mul"|"div"|"mod") factor} .
factor = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall = METHODNAME "(" [ expression {"," expression} ] ")" .
Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?
#rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.
For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.
Edit 2: Return to original OData grammar
The differentiation between a "logical" and "arithmetic" ( is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of () is mandatory around and and or expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.
I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which ( belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.
Personally, I'd just modify the grammar so that it has only one type of expression and therefore one type of parenthesis. I'm not convinced that the OData grammar is actually correct; it is certainly not usable in an LL(1) (or recursive descent) parser for exactly the reason you mention.
Specifically, if the goal is boolCommonExpr, there are two productions which can match the ( lookahead token:
boolCommonExpr = ( …
/ commonExpr [ eqExpr / neExpr / … ]
/ boolParenExpr
/ …
) …
commonExpr = ( …
/ parenExpr
/ …
) …
For the most part, this is a misguided attempt to make the grammar detect a type violation. (If in fact it is a type violation.) It's misguided because it is doomed to failure if there are boolean variables, which there apparently are in this environment. Since there is not syntactic clue as to the type of a variable, the parser is not capable of deciding whether particular expressions are well-formed or not, so there is a good argument for not trying at all, particularly if it creates parsing headaches. A better solution is to first parse the expression into an AST of some form, and then do another pass over the AST to check that each operators has operands of the correct type (and possibly inserting explicit cast operators if that is necessary).
Aside from any other advantage, doing the type check in a separate pass lets you produce much better error messages. If you make (some) type violations syntax errors, then you may leave the user puzzled about why their expression was rejected; in contrast, if you notice that a comparison operation is being used as an operand to multiply (and if your language's semantics don't allow an automatic conversion from True/False to 1/0), then you can produce a well-targetted error message ("comparisons cannot be used as the operand of an arithmetic operator", for example).
One possible reason to put different operators (but not parentheses) into different grammatical variables is to express grammatical precedence. That consideration might encourage you to rewrite the grammar with explicit precedence. (As written, the grammar assumes that all arithmetic operators have the same precedence, which would presumably lead to 2 + 3 * a being parsed as (2 + 3) * a, which might be a huge surprise.) Alternatively, you might use some simple precedence aware subparser for expressions.
If you want to test your ABNF grammar for determinism (i.e. LL(1)), you can use Tunnel Grammar Studio (TGS). I have tested the full grammar, and there are plenty of conflicts, not only this scopes. If you are able to extract the relevant rules, you can use the desktop version of TGS to visualize the conflicts (the online version checker is with a textual result only). If the rules are not too many, the demo may help you to create an LL(1) grammar from your rules.
If you extract all rules you need, and add them to your question, I can run it for you and will tell you is it LL(1). Note that the grammar is not exactly in ABNF meta syntax, because the case sensitivity is typed with ' for case sensitive strings. The ABNF (RFC 5234) by definition is case insensitive, as RFC 7405 defines the sensitivity with %s and %i (sensitive and insensitive) prefixes before the the actual string. The default case (without a prefix) still means insensitive. This means that you have to replace this invalid '...' strings with %s"..." before testing in TGS.
TGS is a project I work on.

Set a rule based on the value of a global variable

In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).

Faulty bison reduction using %glr-parser and %merge rules

Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.

How to adapt this LL(1) parser to a LL(k) parser?

In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.
statement : variable ':=' expression
| functionCall
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
variable : ID
| ID'.'variable
| ID '[' expression ']'
;
How could I adapt the lexer for LL(1) parser to support k look ahead tokens?
Are there some elegant ways?
I know I can add some buffers for tokens. I'd like to discuss some details of programming.
this is the Parser:
class Parser
{
private Lexer lex;
private Token look;
public Parser(Lexer l)
{
lex = l;
move();
}
private void move()
{
look = lex.scan();
}
}
and the Lexer.scan() returns the next token from the stream.
In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.
In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)
Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)
Two notes:
1) I'm skeptical about the rule:
functionCall : ID'(' (expression ( ',' expression )*)* ')'
;
That would allow, for example:
function(a[3], b[2] c[x] d[y], e.foo)
which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Resources