Detect wrong grammar (error alternatives) for quick fixes - parsing

In my ANTLR grammar I would like to detect wrong keywords or wrong typed constants, e.g. 'null' instead of 'NULL'. I added error alternatives in the grammar, e.g.:
| extra='null' #error1
If the wrong constant is detected in my custom editor I can fix it by replacing it with the correct constant.
But I don't know if this is the correct way to address and detect wrong keywords or constants in a grammar.
In addition I tried to detect missing closing in the grammar (see ANTLR book chapter 9.4):
.......
| 'if' '(' expr expr #error2
| '{' exprlist #error3
| 'while' '(' expr expr #error4
........
But this massively slows down the parsing process and so I think that it is wrong to do so.
My questions are:
Is it correct to detect wrong keywords, constants, etc. in that way as described above?
Is it somehow possible to catch missing closing in the grammar without the massive speed decrease?
Any help is appreciated.

I found out that I can extract the second Token from the rule (to add a brace at that position):
| 'if' '(' expr expr #error2
with:
Token myToken = tokens.get(ctx.getChild(2).getSourceInterval().b);
parser.notifyErrorListeners(........
However it massively decreases the speed of the parser.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Interpretation variants of binary operators

I'm writing a grammar for a language that contains some binary operators that can also be used as unary operator (argument to the right side of the operator) and for a better error recovery I'd like them to be usable as nular operators as well).
My simplified grammar looks like this:
start:
code EOF
;
code:
(binaryExpression SEMICOLON?)*
;
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression //TODO: check before primaryExpression
| primaryExpression
;
primaryExpression:
unaryExpression
| nularExpression
;
unaryExpression:
operator primaryExpression
| BINARY_OPERATOR primaryExpression
;
nularExpression:
operator
| BINARY_OPERATOR
| NUMBER
| STRING
;
operator:
ID
;
BINARY_OPERATOR is just a set of defined keywords that are fed into the parser.
My problem is that Antlr prefers to use BINARY_OPERATORs as unary expressions (or nualr ones if there is no other choice) instead of trying to use them in a binary expression as I need it to be done.
For example consider the following intput: for varDec from one to twelve do something where from, to and do are binary operators the output of the parser is the following:
As you can see it interprets all binary operators as unary ones.
What I'm trying to achieve is the following: Try to match each BINARY_OPERATOR in a binary expression and only if that is not possible try to match them as a unary expression and if that isn't possible as well then it might be considered a nular expression (which can only be the case if the BINARY_OPERATORis the only content of an expression).
Has anyone an idea about how to achieve the desired behaviour?
Fairly standard approach is to use a single recursive rule to establish the acceptable expression syntax. ANTLR is default left associative, so op expr meets the stated unary op requirement of "argument to the right side of the operator". See, pg 70 of TDAR for a further discussion of associativity.
Ex1: -y+x -> binaryOp{unaryOp{-, literal}, +, literal}
Ex2: -y+-x -> binaryOp{unaryOp{-, literal}, +, unaryOp{-, literal}}
expr
: LPAREN expr RPAREN
| expr op expr #binaryOp
//| op expr #unaryOp // standard formulation
| op literal #unaryOp // limited formulation
| op #errorOp
| literal
;
op : .... ;
literal
: KEYWORD
| ID
| NUMBER
| STRING
;
You allow operators to act like operands ("nularExpression") and operands to act like operators ("operator: ID"). Between those two curious decisions, your grammar is 100% ambiguous, and there is never any need for a binary operator to be parsed. I don't know much about Antlr, but it surprises me that it doesn't warn you that your grammar is completely ambiguous.
Antlr has mechanisms to handle and recover from errors. You would be much better off using them than writing a deliberately ambiguous grammar which makes erroneous constructs part of the accepted grammar. (As I said, I'm not an Antlr expert, but there are Antlr experts who pass by here pretty regularly; if you ask a specific question about error recovery, I'm sure you'll get a good answer. You might also want to search this site for questions and answers about Antlr error recovery.)
I think what I'm going to write down now is what #GRosenberg meant with his answer. However as it took me a while to fully understand it I will provide a concrete solution for my problem in case someone else is stumbling across this question and is searching or an answer:
The trick was to remove the option to use a BINARY_OPERATOR inside the unaryExpression rule because this always got preferred. Instead what I really wanted was to specify that if there was no left side argument it should be okay to use a BINARY_OPERATOR in a unary way. And that's the way I had to specify it:
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression
| BINARY_OPERATOR primaryExpression
| primaryExpression
;
That way this syntax only becomes possible if there is nothing to the left side of a BINARY_OPERATOR and in every other case the binary syntax has to be used.

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

SQL Parser Disambiguation

I have written a very simple SQL Parser for a very small subset of the language to handle a one time specific problem. I had to translate an extremely large amount of old SQL expressions into an intermediate form that could then possibly be brought into a business rule system. The initial attempt worked for about 80% of the existing data.
I looked at some commercial solutions but thought I could do this pretty easy based on some past experience and some reading. I hit a problem and decided to go and finish the task with a commercial solution, I know when to admit defeat. However I am still curious as to how to handle this or what I may have done wrong.
My initial solution was based on a simple recursive descent parser, found in many books and online articles, producing an Abstract Syntax Tree and then during the analysis phase, I would determine type differences and whether logical expressions were being mixed with algebraic expressions and such.
I referenced the ANTLR SQL Lite grammar by Bark Kiers
https://github.com/bkiers/sqlite-parser
I also referenced an online SQL grammar site
http://savage.net.au/SQL/
The main question is how to make the parser differentiate between the following
expr AND expr
BETWEEN expr AND expr
The problem I am encountering is when I hit the following unit test case
case when PP_ID between '009000' and '009999' then 'MA' when PP_ID between '001000' and '001999' then 'TL' else 'LA' end
The '009000' and '009999' is matched as a Binary Expression so the parser throws an error expecting the keyword AND but instead encounters THEN.
The online ANSI grammar actually breaks down expressions into finer grained productions and I suspect that is the proper approach. I am also wondering if my parser should detect if an expression is actually Boolean vs. Algebraic during the parse phase and not the semantic phase, and use that information to handle the above case.
I am sure I could brute force the solution but I want to learn the correct way to handle this.
Thanks for any help offered.
I also met with this problem while developed Jison (Bison) parser for SQLite, and solved it with who different rules in grammar for binary operations: one for AND and one for BETWEEN (this is a Jison grammar):
%left BETWEEN // Here I defined that AND has higher priority over BETWEEN
%left AND //
: expr AND expr // Rule for AND
{ $$ = {op: 'AND', left: $1, right: $3}; }
;
: expr BETWEEN expr // Rule for BETWEEN
{
if($3.op != 'AND') throw new Error('Wrong syntax of BETWEEN AND');
$$ = {op: 'BETWEEN', expr: $1, left:$3.left, right:$3.right};
}
;
and then parser checks right expression, and pass only expressions with AND operations. May be this approach can help you.
For ANTLR grammar I found the following rule (see this grammar made by Bart Kiers)
expr
:
| expr K_AND expr
| expr K_NOT? K_BETWEEN expr K_AND expr
;
But I am not sure, that it works in proper way.

Incorrect Parsing of simple arithmetic grammar in ANTLR

I recently started studying ANTLR. Below is the grammar for the arithmetic expression.
The problem is that when I am putting (calling) expression rule in the term rule then it is parsing incorrectly even for (9+8). It is somehow ignoring the right parenthesis.
While when I put add rule instead of calling expression rule from the rule term, it is working fine.
As in:
term:
INTEGER
| '(' add ')'
;
Can anyone tell me why it is happening because more or les they both are the same.
Grammer for which it is giving incorrect results
term
:
INTEGER
| '(' expression ')'
;
mult
:
term ('*' term)*
;
add
:
mult ('+' mult)*
;
expression
:
add
;
When I parse "(8+9)" with a parser generated from your grammar, starting with the expression rule, I get the following parse tree:
In other words: it works just fine.
Perhaps you're using ANTLRWorks' (or ANTLR IDE's) interpreter to test your grammar? In thta case: don't use the interpreter, it's buggy. Use ANTLRWorks' debugger instead (the image is exported from ANTLRWorks' debugger).

Resources