Differences in the two different parsing methodologies - parsing

I am quite new to Antlr4, and am interested in simplifying a large grammar that I have. There are two possible approaches that I may take. The first one, is more of the "in-line" approach, which often makes things simpler as I run into less left-recursion errors (and everything is 'there in one place' so it it's a bit easier to track. Here is an example parse with it:
// inline
// input: CASE WHEN 1 THEN 1 WHEN 1 THEN 1 ELSE 1 END
grammar DBParser;
statement: expr EOF;
expr
: 'CASE' expr? ('WHEN' expr 'THEN' expr)+ ('ELSE' expr)? 'END' # caseExpressionInline
| ATOM # constantExpression
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
The second approach is to break everything out into separate small components, for example:
// separated
grammar DBParser;
statement: expr EOF;
expr
: case_expr # caseExpression
| ATOM # constantExpression
;
case_expr
: 'CASE' case_arg? when_clause_list case_default? 'END'
;
when_clause_list
: when_clause+
;
when_clause
: 'WHEN' expr 'THEN' expr
;
case_default
: 'ELSE' expr
;
case_arg
: expr
;
ATOM: [a-z]+ | [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
A couple high-level things I've noticed are:
The inline version comes out to about half the size of the output code files of the second approach.
The in-line version is about 20% faster! I tried both parsers on a 10MB file of repetitions of the same statement, and the inline version never took over 1s and the separated version took around 1.2s and never went under 1.1s (I ran each about 20x). That seems crazy to me that just in-lining the expressions would have such a large performance difference! That was certainly not my goal but noticed it as I was comparing the two different things in writing up this question.
What would be the preferable way to split things up then?

(With the caveat that SO frowns upon opinion-based questions…)
I think that you should let the programming experience of coding against the generated classes be the most important guidance on how much to break things out.
I find the first style much easier to deal with, and you’ve already pointed out performance benefits.
Like most guidance it can be taken too far in either direction. The second is pretty extreme in the “breaking things out” category and, in my opinion, going to be a pain to deal with the deeply nested structure. I think it appeals to our desire to “keep functions small” in general purpose languages, but you’ll spend a lot more time writing the rest of your code than you will getting the parser right.
Your first example comes close to going to the other extreme. I don’t think it quite crosses the line, but if it were to start getting out of hand, I might opt for something like:
grammar DBParser;
statement: expr EOF;
expr
: case_expr # caseExpression
| ATOM # constantExpression
;
case_expr
: 'CASE' expr? ('WHEN' expr 'THEN' expr)+ ('ELSE' expr)? 'END'
;
I’ve also found that I go back and rework the grammar sometimes as I’m fleshing out the language and find I want a slightly different structure. For example, I found that, when doing code completion, it was sometimes helpful to have different parser rules for different contexts that just required an ID token. The code completion could then tell me that I could use an assigment_dest instead of an `IDK and I could do much better completion, so, that’s a case where an extra level proved useful once I started using the grammar in practice.
To me, refining the grammar to give me the most useful ParseTree is the first priority.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Set a rule based on the value of a global variable

In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).

Shift/Reduce conflict in CUP

I'm trying to write a parser for a javascript-ish language with JFlex and Cup, but I'm having some issues with those deadly shift/reduce problems and reduce/reduce problems.
I have searched thoroughly and have found tons of examples, but I'm not able to extrapolate these to my grammar. My understanding so far is that these problems are because the parser cannot decide which way it should take because it can't distinguish.
My grammar is the following one:
start with INPUT;
INPUT::= PROGRAM;
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
FUNCTION ::= function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der;
OPTIONAL ::=
| TYPE;
TYPE::= integer
| boolean
ARG ::=
| TYPE id MORE_ARGS;
MORE_ARGS ::=
| colon TYPE id MORE_ARGS;
NEWLINE ::= salto NEWLINE
| ;
BODY ::= ;
I'm getting several conflicts but these 2 are a mere example:
Warning : *** Shift/Reduce conflict found in state #5
between NEWLINE ::= (*)
and NEWLINE ::= (*) salto NEWLINE
under symbol salto
Resolved in favor of shifting.
Warning : *** Shift/Reduce conflict found in state #0
between NEWLINE ::= (*)
and FUNCTION ::= (*) function OPTIONAL id p_izq ARG p_der NEWLINE l_izq NEWLINE BODY l_der
under symbol function
Resolved in favor of shifting.
PS: The grammar is far more complex, but I think that if I see how these shift/reduce problems are solved, I'll be able to fix the rest.
Thanks for your answers.
PROGRAM is useless (in the technical sense). That is, it cannot produce any sentence because in
PROGRAM::= FUNCTION NEWLINE PROGRAM
| NEWLINE PROGRAM;
both productions for PROGRAM are recursive. For a non-terminal to be useful, it needs to be able to eventually produce some string of terminals, and for it to do that, it must have at least one non-recursive production; otherwise, the recursion can never terminate. I'm surprised CUP didn't mention this to you. Or maybe it did, and you chose to ignore the warning.
That's a problem -- useless non-terminals really cannot ever match anything, so they will eventually result in a parse error -- but it's not the parsing conflict you're reporting. The conflicts come from another feature of the same production, which is related to the fact that you can't divide by 0.
The thing about nothing is that any number of nothings are still nothing. So if you had plenty of nothings and someone came along and asked you exactly how many nothings you had, you'd have a bit of problem, because to get "plenty" back from "0 * plenty", you'd have to compute "0 / 0", and that's not a well-defined value. (If you had plenty of two's, and someone asked you how many two's you had, that wouldn't be a problem: suppose the plenty of two's worked out to 40; you could easily compute that 40 / 2 = 20, which works out perfectly because 20 * 2 = 40.)
So here we don't have arithmetic, we have strings of symbols. And unfortunately the string containing no symbols is really invisible, like 0 was for all those millenia until some Arabian mathematician noticed the value of being able to write nothing.
Where this is all coming around to is that you have
PROGRAM::= ... something ...
| NEWLINE PROGRAM;
But NEWLINE is allowed to produce nothing.
NEWLINE ::= salto NEWLINE
| ;
So the second recursive production of PROGRAM might add nothing. And it might add nothing plenty of times, because the production is recursive. But the parser needs to be deterministic: it needs to know exactly how many nothings are present, so that it can reduce each nothing to a NEWLINE non-terminal and then reduce a new PROGRAM non-terminal. And it really doesn't know how many nothings to add.
In short, optional nothings and repeated nothings are ambiguous. If you are going to insert a nothing into your language, you need to make sure that there are a fixed finite number of nothings, because that's the only way the parser can unambiguously dissect nothingness.
Now, since the only point of that particular recursion (as far as I can see) is to allow empty newline-terminated statements (which are visible because of the newline, but do nothing). And you could do that by changing the recursion to avoid nothing:
PROGRAM ::= ...
| salto PROGRAM;
Although it's not relevant to your current problem, I feel obliged to mention that CUP is an LALR parser generator and all that stuff you might have learned or read on the internet about recursive descent parsers not being able to handle left recursion does not apply. (I deleted a rant about the way parsing technique are taught, so you'll have to try to recover it from the hints I've left behind.) Bottom-up parser generators like CUP and yacc/bison love left recursion. They can handle right recursion, too, of course, but reluctantly because they need to waste stack space for every recursion other than left recursion. So there is no need to warp your grammar in order to deal with the deficiency; just write the grammar naturally and be happy. (So you rarely if ever need non-terminals representing "the rest of" something.)
PD: Plenty of nothing is a culturally-specific reference to a song from the 1934 opera Porgy and Bess.

Faulty bison reduction using %glr-parser and %merge rules

Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.

SQL Parser Disambiguation

I have written a very simple SQL Parser for a very small subset of the language to handle a one time specific problem. I had to translate an extremely large amount of old SQL expressions into an intermediate form that could then possibly be brought into a business rule system. The initial attempt worked for about 80% of the existing data.
I looked at some commercial solutions but thought I could do this pretty easy based on some past experience and some reading. I hit a problem and decided to go and finish the task with a commercial solution, I know when to admit defeat. However I am still curious as to how to handle this or what I may have done wrong.
My initial solution was based on a simple recursive descent parser, found in many books and online articles, producing an Abstract Syntax Tree and then during the analysis phase, I would determine type differences and whether logical expressions were being mixed with algebraic expressions and such.
I referenced the ANTLR SQL Lite grammar by Bark Kiers
https://github.com/bkiers/sqlite-parser
I also referenced an online SQL grammar site
http://savage.net.au/SQL/
The main question is how to make the parser differentiate between the following
expr AND expr
BETWEEN expr AND expr
The problem I am encountering is when I hit the following unit test case
case when PP_ID between '009000' and '009999' then 'MA' when PP_ID between '001000' and '001999' then 'TL' else 'LA' end
The '009000' and '009999' is matched as a Binary Expression so the parser throws an error expecting the keyword AND but instead encounters THEN.
The online ANSI grammar actually breaks down expressions into finer grained productions and I suspect that is the proper approach. I am also wondering if my parser should detect if an expression is actually Boolean vs. Algebraic during the parse phase and not the semantic phase, and use that information to handle the above case.
I am sure I could brute force the solution but I want to learn the correct way to handle this.
Thanks for any help offered.
I also met with this problem while developed Jison (Bison) parser for SQLite, and solved it with who different rules in grammar for binary operations: one for AND and one for BETWEEN (this is a Jison grammar):
%left BETWEEN // Here I defined that AND has higher priority over BETWEEN
%left AND //
: expr AND expr // Rule for AND
{ $$ = {op: 'AND', left: $1, right: $3}; }
;
: expr BETWEEN expr // Rule for BETWEEN
{
if($3.op != 'AND') throw new Error('Wrong syntax of BETWEEN AND');
$$ = {op: 'BETWEEN', expr: $1, left:$3.left, right:$3.right};
}
;
and then parser checks right expression, and pass only expressions with AND operations. May be this approach can help you.
For ANTLR grammar I found the following rule (see this grammar made by Bart Kiers)
expr
:
| expr K_AND expr
| expr K_NOT? K_BETWEEN expr K_AND expr
;
But I am not sure, that it works in proper way.

Resources