Faulty bison reduction using %glr-parser and %merge rules - parsing

Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...

That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.

Related

Validating expressions in the parser

I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".

Set a rule based on the value of a global variable

In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).

Shift/Reduce conflict in CUP

I'm trying to write a parser for a javascript-ish language with JFlex and Cup, but I'm having some issues with those deadly shift/reduce problems and reduce/reduce problems.
I have searched thoroughly and have found tons of examples, but I'm not able to extrapolate these to my grammar. My understanding so far is that these problems are because the parser cannot decide which way it should take because it can't distinguish.
My grammar is the following one:
start with INPUT;
INPUT::= PROGRAM;
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
FUNCTION ::= function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der;
OPTIONAL ::=
| TYPE;
TYPE::= integer
| boolean
ARG ::=
| TYPE id MORE_ARGS;
MORE_ARGS ::=
| colon TYPE id MORE_ARGS;
NEWLINE ::= salto NEWLINE
| ;
BODY ::= ;
I'm getting several conflicts but these 2 are a mere example:
Warning : *** Shift/Reduce conflict found in state #5
between NEWLINE ::= (*)
and NEWLINE ::= (*) salto NEWLINE
under symbol salto
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #0
between NEWLINE ::= (*)
and FUNCTION ::= (*) function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der
under symbol function
Resolved in favor of shifting.
PS: The grammar is far more complex, but I think that if I see how these shift/reduce problems are solved, I'll be able to fix the rest.
Thanks for your answers.
PROGRAM is useless (in the technical sense). That is, it cannot produce any sentence because in
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
both productions for PROGRAM are recursive. For a non-terminal to be useful, it needs to be able to eventually produce some string of terminals, and for it to do that, it must have at least one non-recursive production; otherwise, the recursion can never terminate. I'm surprised CUP didn't mention this to you. Or maybe it did, and you chose to ignore the warning.
That's a problem -- useless non-terminals really cannot ever match anything, so they will eventually result in a parse error -- but it's not the parsing conflict you're reporting. The conflicts come from another feature of the same production, which is related to the fact that you can't divide by 0.
The thing about nothing is that any number of nothings are still nothing. So if you had plenty of nothings and someone came along and asked you exactly how many nothings you had, you'd have a bit of problem, because to get "plenty" back from "0 * plenty", you'd have to compute "0 / 0", and that's not a well-defined value. (If you had plenty of two's, and someone asked you how many two's you had, that wouldn't be a problem: suppose the plenty of two's worked out to 40; you could easily compute that 40 / 2 = 20, which works out perfectly because 20 * 2 = 40.)
So here we don't have arithmetic, we have strings of symbols. And unfortunately the string containing no symbols is really invisible, like 0 was for all those millenia until some Arabian mathematician noticed the value of being able to write nothing.
Where this is all coming around to is that you have
PROGRAM::= ... something ...
| NEWLINE PROGRAM;
But NEWLINE is allowed to produce nothing.
NEWLINE ::= salto NEWLINE
| ;
So the second recursive production of PROGRAM might add nothing. And it might add nothing plenty of times, because the production is recursive. But the parser needs to be deterministic: it needs to know exactly how many nothings are present, so that it can reduce each nothing to a NEWLINE non-terminal and then reduce a new PROGRAM non-terminal. And it really doesn't know how many nothings to add.
In short, optional nothings and repeated nothings are ambiguous. If you are going to insert a nothing into your language, you need to make sure that there are a fixed finite number of nothings, because that's the only way the parser can unambiguously dissect nothingness.
Now, since the only point of that particular recursion (as far as I can see) is to allow empty newline-terminated statements (which are visible because of the newline, but do nothing). And you could do that by changing the recursion to avoid nothing:
PROGRAM ::= ...
| salto PROGRAM;
Although it's not relevant to your current problem, I feel obliged to mention that CUP is an LALR parser generator and all that stuff you might have learned or read on the internet about recursive descent parsers not being able to handle left recursion does not apply. (I deleted a rant about the way parsing technique are taught, so you'll have to try to recover it from the hints I've left behind.) Bottom-up parser generators like CUP and yacc/bison love left recursion. They can handle right recursion, too, of course, but reluctantly because they need to waste stack space for every recursion other than left recursion. So there is no need to warp your grammar in order to deal with the deficiency; just write the grammar naturally and be happy. (So you rarely if ever need non-terminals representing "the rest of" something.)
PD: Plenty of nothing is a culturally-specific reference to a song from the 1934 opera Porgy and Bess.

bison: a specific number of recursions?

I've been writing a parser with flex and bison for a few weeks now and have ground to a halt on account of a double recursion, the definitions of which are similar for the first few rules. Bison always chooses the wrong path at one particular stage and crashes because the grammar doesn't fit. The bison code looks a little like this:
set :
TOKEN_ /* token */
QString
QString
Integer /* number of descrs (see below) */
M_op /*'M' optional*/
alts;
and
alts :
alt | alts alt ;
alt :
QString
pName_op /* empty | TOKEN1 QString */
deVal_op /* empty | TOKEN2 Integer */
descrs
;
and
descrs :
descr | descrs descr ;
descr :
QString
QString_op /* optional qstring */
Integer
D_op /* optional 'D' */
Bison stays in the descrs recursion and never exits it to progress to the next alt. The integer that is read in in the initial block, however, tells us how many instances of descr are going to come. So my question is this:
Is there a way of preparing bison for a specific number of instances of the recursion so that he can exit this recursion and enter the recursion "above"? I can access this integer in the C code, but I'm not aware of syntax for said move, something like a descrs : {for (int i=0;i<n;++i){descr}} (I'm aware that probably looks ridiculous)
Failing this, is there any other way around this problem?
Any input would be much appreciated. Thanks in advance.
A context-free grammar cannot be contingent on semantic information. Yet, that is precisely what you are seeking: you wish the value of a numeric token to be taken into account in the syntax of an expression.
As a request, that's not unreasonable or immoral; it's simply outside of the reach of context-free grammars. And bison is intended to create parsers for context-free grammars. So it's simply not the correct tool for this problem.
Having said that, it is possible to use bison in this manner, if you are using a reasonably recent version of bison which includes support for GLR grammars. Bison`s GLR support includes the option of using semantic predicates to control the parse. (See the bison manual for details.) A solution based on that mechanism is possible, and probably not too complicated.
Much easier -- if the grammar allows for it -- would be to use a top-down parser. Parsing a number and then that number of descrs would be trivial in a recursive-descent parser, for example.
The liberal use of FOO_op non-terminals in the grammar suggests that top-down parsing would not be problematic, but it is impossible to say for sure without seeing the entire grammar. Artificial non-terminals (like FOO_op) often cause shift-reduce conflicts in LR(1) languages, because they force an immediate shift/reduce decision to be made. In an LR(1) language, a production of the form: A → ω B? χ
would normally be rendered as the pair of productions A → ω B χ; A → ω χ, rather than the substitution Bop → B | ε; A → ω Bop χ, in order to avoid creating conflicts with other productions of the form C → ω ζ where FIRST(ζ) ∩ FIRST(B ∪ ω) ≠ ∅.

SQL Parser Disambiguation

I have written a very simple SQL Parser for a very small subset of the language to handle a one time specific problem. I had to translate an extremely large amount of old SQL expressions into an intermediate form that could then possibly be brought into a business rule system. The initial attempt worked for about 80% of the existing data.
I looked at some commercial solutions but thought I could do this pretty easy based on some past experience and some reading. I hit a problem and decided to go and finish the task with a commercial solution, I know when to admit defeat. However I am still curious as to how to handle this or what I may have done wrong.
My initial solution was based on a simple recursive descent parser, found in many books and online articles, producing an Abstract Syntax Tree and then during the analysis phase, I would determine type differences and whether logical expressions were being mixed with algebraic expressions and such.
I referenced the ANTLR SQL Lite grammar by Bark Kiers
https://github.com/bkiers/sqlite-parser
I also referenced an online SQL grammar site
http://savage.net.au/SQL/
The main question is how to make the parser differentiate between the following
expr AND expr
BETWEEN expr AND expr
The problem I am encountering is when I hit the following unit test case
case when PP_ID between '009000' and '009999' then 'MA' when PP_ID between '001000' and '001999' then 'TL' else 'LA' end
The '009000' and '009999' is matched as a Binary Expression so the parser throws an error expecting the keyword AND but instead encounters THEN.
The online ANSI grammar actually breaks down expressions into finer grained productions and I suspect that is the proper approach. I am also wondering if my parser should detect if an expression is actually Boolean vs. Algebraic during the parse phase and not the semantic phase, and use that information to handle the above case.
I am sure I could brute force the solution but I want to learn the correct way to handle this.
Thanks for any help offered.
I also met with this problem while developed Jison (Bison) parser for SQLite, and solved it with who different rules in grammar for binary operations: one for AND and one for BETWEEN (this is a Jison grammar):
%left BETWEEN // Here I defined that AND has higher priority over BETWEEN
%left AND //
: expr AND expr // Rule for AND
{ $$ = {op: 'AND', left: $1, right: $3}; }
;
: expr BETWEEN expr // Rule for BETWEEN
{
if($3.op != 'AND') throw new Error('Wrong syntax of BETWEEN AND');
$$ = {op: 'BETWEEN', expr: $1, left:$3.left, right:$3.right};
}
;
and then parser checks right expression, and pass only expressions with AND operations. May be this approach can help you.
For ANTLR grammar I found the following rule (see this grammar made by Bart Kiers)
expr
:
| expr K_AND expr
| expr K_NOT? K_BETWEEN expr K_AND expr
;
But I am not sure, that it works in proper way.

Resources