I'm working with Antlr4 to parse a boolean-like DSL.
Here is my grammar:
grammar filter;
filter: overall EOF;
overall
: LPAREN overall RPAREN
| category
;
category
: expression # InferenceCategory
| category AND category # CategoryAndBlock
| label COLON expression # CategoryBlock
| LPAREN category RPAREN # NestedCategory
;
expression
: NOT expression # NotExpr
| expression AND expression # AndExpr
| expression OR expression # OrExpr
| atom # AtomExpr
| LPAREN expression RPAREN # NestedExpression
;
label
: ALPHANUM
;
atom
: ALPHANUM
;
Here is an example input string to parse:
(cat1:(1 OR 2) AND cat2:( 4 ))
This grammar works fine with this input; it produces the following parse tree which perfectly suits my needs:
However, there is weird case of the DSL, where the "cat1" label is implicit when no other category is specified. This is what the InferenceCategory tag catches, where this expression will be handled as a category in my code later.
For example, with
((1 OR 2) AND cat2:( 4 ))
I get (as expected):
However, in the following instance:
cat2:( 4 ) AND (1 OR 2)
I get:
Notice that the second block is not identified as a InferenceCategory and but instead as a normal expression, under the first category. This is because there the grammar parses ( 4 ) following cat2: as a normal expression, and everything past that is parsed as a normal expression.
Is there any way to fix this? I've tried:
label COLON expression (AND category)* # CategoryBlock
(which doesn't work)
and
category AND category AND category
(which "works", but is extremely hacky and only works in the specific case that I have exactly three categories. Any more, and it breaks again.)
The "alternative labels" like NOT expression # NotExpr do not make a difference in your parse tree. They are semantic-only. They will cause the code generation process to create specific signatures that you can override in your Visitor or Listener.
The rationale behind this is, for example, instead of getting just one Visitor override for expression, you'll get several, one for each alternative label. That way, you don't have to examine expression and determine what type it is before acting on it. Instead, you'll get an override for # OrExpr for example, and as soon as you're in that override code, you know you're dealing with an OR, with an expression on each side of the OR token.
The parse tree is useful, but much of the semantics only become apparent when you code up your Listener or Visitor.
Related
I'm working on a language that is meant to read much like English, and having issues with the grammar for if statements. In case you are curious, the language is inspired by HyperTalk, so I'm trying to make sure I match all the valid constructs in that language. The sample input I'm using that demonstrates all the possible if constructs can be viewed here. There are a lot, so I didn't want to inline the code.
I've removed most other constructs from the grammar to make it a bit easier to read, but basically statements look like this:
start
: statementList
;
statementList
: '\n'
| statement '\n'
| statementList '\n'
| statementList statement '\n'
;
statement
: ID
| ifStatement
;
The shift/reduce conflicts I'm seeing are in the ifStatement rules:
ifStatement
: ifCondition THEN statement
| ifCondition THEN statement ELSE statement
| ifCondition THEN statement ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList END IF
| ifCondition THEN '\n' END IF
| ifCondition THEN '\n' ELSE statement
| ifCondition THEN '\n' ELSE '\n' statementList END IF
| ifCondition THEN '\n' statementList ELSE statement
| ifCondition THEN '\n' statementList ELSE '\n' statementList END IF
// The following rules cause issues, but should be legal:
| ifCondition THEN statement newlines ELSE statement
| ifCondition THEN statement newlines ELSE '\n' statementList END IF
;
ifCondition
: IF expression
| IF expression '\n'
;
expression
: TRUE
| FALSE
;
newlines
: '\n'
| newlines '\n'
;
The problem is that I need to support this construct:
if true then statement # <- Any number of newlines
else statement
The problem (as I understand it) is that there isn't enough context to correctly determine whether to shift the else, or reduce just the if true then statement part without knowing what comes later (the end of the statement list, or another statement). Is this even parseable?
I have gists for the parser, scanner, and sample input to try out.
Getting this right is surprisingly difficult, so I've tried to annotate the steps. There are a lot of annoying details.
At its core, this is just a manifestation of the dangling else ambiguity, whose resolution is pretty well-known (force the parser to always shift the else). The solution below resolves the ambiguity in the grammar itself, which is unambiguous.
The basic principle that I've used here is the one outlined several decades ago in Principles of Compiler Design by Alfred Aho and Jeffrey Ullman (the so-called "Dragon book", which I mention since its authors were recently granted the Turing award precisely for that and their other influential works). In particular, I use the terms "matched" and "unmatched" (rather than "open" and "closed", which are also popular) because that's the way I learned it.
It is also possible to solve this grammar problem using precedence declarations; indeed, that often turns out to be much simpler. But in this particular case, it's not easy to work with operator precedence because the relevant token (the else) can be preceded by an arbitrary number of newline tokens. I'm pretty sure you could still construct a precedence-based solution, but there are advantages to using an unambiguous grammar, including the ease of porting to a parser generator which doesn't use the same precedence algorithm, and the fact that it is possible to analyze mechanically.
The basic outline of the solution is to divide all statements into two categories:
"matched" (or "closed") statements, which are complete in the sense that it is not possible to extend the statement with an else clause. (In other words, every if…then is matched by a corresponding else.) These
"unmatched" (or "open") statements, which could have been extended with an else clause. (In other words, at least one if…then clause is not matched by an else.) Since the unmatched statement is a complete statement, it cannot be immediately followed by an else token; had an else token appeared, it would have served to extend the statement.
Once we manage to construct grammars for these two categories of statement, it's only necessary to figure out which uses of statement in the ambiguous grammar can be followed by else. In all of these contexts, the non-terminal statement must be replaced with the non-terminal matched-statement, because only matched statements can be followed by else without interacting with it. In other contexts, where else could not be the next token, either category of statement is valid.
So the essential grammar style is (taken from the Dragon book):
stmt → matched_stmt
| unmatched_stmt
matched_stmt → "if" expr "then" matched_stmt "else" matched_stmt
| other_stmt
unmatched_stmt → "if" expr "then" matched_stmt "else" unmatched_stmt
| "if" expr "then" stmt
other_stmt is anything other than a conditional statement. Or, to be more precise, anything other than a compound statement which precisely ends with a stmt.
In Hypertalk, as far as I know, if statements are the only compound statements which can end with a statement. Other compound statements are precisely terminated with an end X, which effectively closes the statement. But in other languages, such as C, there are a variety of compound statements, and most of these need to be divided into "matched" and "unmatched" depending precisely on whether their terminating substatement is (recursively) matched or unmatched.
One thing I want to note here, which is apparent from that outline grammar if you look at it a bit sideways, is that the if…then…else part of the if statement is grammatically similar to a bracketed prefix operator. That is, both matched_stmt and unmatched_stmt are similar to the right-recursive rule for unary minus:
unary → '-' unary
| atom
which in turn could be written in an Extended BNF dialect which allows Kleene stars as
unary → ('-')* atom
If we were to do that transformation to Aho&Ullman's grammar, we'd end up with:
if_then_else → "if" expr "then" matched_stmt "else"
matched_stmt → (if_then_else)* other_stmt
unmatched_stmt → (if_then_else)* "if" expr "then" stmt
That makes it reasonably clear how to implement this grammar with a top-down recursive-descent parser. (A bit of left-factoring is needed, but it still ends up being similar to the unary minus grammar.) I'm not planning on developing this thought further in this answer, but I think that the EBNF conversion helps guide the intuitions about how this grammar actually works to undangle the else.
It was also really helpful in figuring out how to deal with newlines. The key insight (for me) was that statements must end with a newline. The one exception is the condensed single-line version of the if command. But that exception only happens just before an else token (and only when the then which it matches in on the same line). In this grammar, that case is implemented with the inner-matched non-terminal, assisted by the fact that one-line statements (like do-statement) lack the terminating newline. The newline which terminates one-line statements is added in the recursive base case for matched (single-statement NL); that's the only place it needs to be handled. Multi-line compound statements are all defined with a terminating newline (see, for example, repeat-statement).
Most of the rest of the complications deal with the variety of syntactic forms. The only one which is really interesting is the handling of blocks after a then token at the end of a line. That block can be terminated in two ways:
with an end if line, without an else clause. This is treated as a "matched" case, since it clearly could not be extended with an else clause.
with an else clause (which could be a single line else or a block else, where the else token is at the end of the line). But here there is a possible ambiguity; if the last statement in the block is an unmatched if, then an else line should extend that statement, rather than terminating the block. That's not really different from the rest of the matched/unmatched logic; to implement it, I created two different block non-terminals, one ending with a matched statement and the other ending with an unmatched statement. And then, as usual, only the matched block can be used before an else.
(I found the new counterexample generator in bison 3.7.6 extremely helpful here; my initial attempt just used block because I'd failed to notice the ambiguity. But it is a real ambiguity, and it lead to a shift-reduce conflict whose origins seemed mysterious. Once I saw the counterexample produced by the counterexample generator -- which showed the conflict happening inside a block following an if-then -- the problem became a lot more evident.)
The alternation between matched-block and unmatched-block is a simple example of the correspondence between grammar productions and state machines. The two non-terminals represent the two states in a very simple state machine, whose state records a single bit: whether or not the last statement was matched. The non-terminals must be right-recursive for this to work, which is a deviation from the usual "prefer left-recursion" heuristic for building LALR(1) grammars.
OK, with that overlong preamble, here's the grammar. In the interests of compactification, I simplified expressions down to just variables and boolean constants, included only one simple statement (do expr) and included only one other compound statement (repeat until expr / block / end repeat). (The last one is there as a placeholder.)
program : block
block : %empty
| matched-block
| unmatched-block
NL : '\n'
| NL '\n'
matched-block
: block matched
unmatched-block
: block unmatched
simple-statement
: "do" expression
repeat-statement
: "repeat" "until" expression NL block "end" "repeat" NL
matched : if-then matched else-matched
| if-then inner-matched else-matched
| if-then NL matched-block else-matched
| if-then NL else-matched
| if-then NL block "end" "if" NL
| repeat-statement
| simple-statement NL
inner-matched
: %empty
| simple-statement
| if-then inner-matched "else" inner-matched
unmatched
: if-then matched
| if-then unmatched
| if-then inner-matched "else" unmatched
| if-then matched "else" unmatched
if-then : "if" expression NL "then"
| "if" expression "then"
else-matched
: "else" NL block "end" "if" NL
| "else" matched
expression
: ID
| "true"
| "false"
Previous answer (to original question, only visible in the edit history)
There is an obvious ambiguity between
ifCondition THEN statement EOL ELSE statement
and
ifCondition THEN EOL statementList ELSE statement
Recall that
statement: %empty
statementList: statement
with the result that both statement and statementList can derive the empty sequence. So both of the above productions for ifStatement can derive:
ifCondition THEN EOL ELSE statement
The parser has no way to know whether there is an empty statement before the EOL or an empty statementList after it. (You might not care which of these is chosen but parsers obsess about this kind of decision.)
Nullable productions are often problematic. Where possible, avoid them. Instead of letting statement derive empty, indicate explicitly where an empty statement might go by adding a rule where the optional statement is omitted. And consider rewriting statementList so that it must end with an EOL, which I think was your intention anyway (but perhaps I'm wrong).
I'm new to Antlr4/CFG and am trying to write a parser for a boolean querying DSL of the form
(id AND id AND ID (OR id OR id OR id))
The logic can also take the form
(id OR id OR (id AND id AND id))
A more complex example might be:
(((id AND id AND (id OR id OR (id AND id)))))
(enclosed in an arbitrary amount of parentheses)
I've tried two things. First, I did a very simple parser, which ended up parsing everything left to right:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom
: INT;
I got the following parse tree for input:
( 60 ) AND ( 55 ) AND ( 53 ) AND ( 3337 OR 2830 OR 23)
This "works", but ideally I want to be able to separate my AND and OR blocks. Trying to separate these blocks into separate grammars leads to left-recursion. Secondly, I want my AND and OR blocks to be grouped together, instead of reading left-to-right, for example, on input (id AND id AND id),
I want:
(and id id id)
not
(and id (and id (and id)))
as it currently is.
The second thing I've tried is making OR blocks directly descendant of AND blocks (ie the first case).
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| and_expr;
and_expr
: term (AND term)* ;
term
: LPAREN or_expr RPAREN
| LPAREN atom RPAREN ;
or_expr
: atom (OR atom)+;
atom: INT ;
For the same input, I get the following parse tree, which is more along the lines of what I'm looking for but has one main problem: there isn't an actual hierarchy to OR and AND blocks in the DSL, so this doesn't work for the second case. This approach also seems a bit hacky, for what I'm trying to do.
What's the best way to proceed? Again, I'm not too familiar with parsing and CFGs, so some guidance would be great.
Both are equivalent in their ability to parse your sample input. If you simplify your input by removing the unnecessary parentheses, the output of this grammar looks pretty good too:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN
| expression (AND expression)+
| expression (OR expression)+
| atom;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
Which is what I suspect your first grammar looks like in its entirety.
Your second one requires too many parentheses for my liking (mainly in term), and the breaking up of AND and OR into separate rules instead of alternatives doesn't seem as clean to me.
You can simplify even more though:
grammar filter;
filter: expression EOF;
expression
: LPAREN expression RPAREN # ParenExp
| expression AND expression # AndBlock
| expression OR expression # OrBlock
| atom # AtomExp
;
atom : INT;
INT: DIGITS;
DIGITS : [0-9]+;
AND : 'AND';
OR : 'OR';
LPAREN : '(';
RPAREN : ')';
WS: [ \t\r\n]+ -> skip;
This gives a tree with a different shape but still is equivalent. And note the use of the # AndBlock and # OrBlock labels... these "alternative labels" will cause your generated listener or visitor to have separate methods for each, allowing you to completely separate these two in your code semantically as well as syntactically. Perhaps that's what you're looking for?
I like this one the best because it's the simplest and clearer recursion, and offers specific code alternatives for AND and OR.
I'm trying to parse VBA code, and the 5.4.2.10 section of the spec defines the Select Case statement, which we've defined as follows:
// 5.4.2.10 Select Case Statement
selectCaseStmt :
SELECT whiteSpace? CASE whiteSpace? selectExpression endOfStatement
caseClause*
caseElseClause?
END_SELECT
;
selectExpression : expression;
caseClause :
CASE whiteSpace rangeClause (whiteSpace? COMMA whiteSpace? rangeClause)* endOfStatement block
;
caseElseClause : CASE whiteSpace? ELSE endOfStatement block;
rangeClause :
expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
;
selectStartValue : expression;
selectEndValue : expression;
The problem is that the expression in rangeClause is taking precedence, and makes this:
Select Case foo
Case Is = 42
Exit Sub
End Select
...ultimately get picked up and treated as {undeclared-variable} {EQ} {literal}, which is a problem, because Is ought to be a lexer token, not the LHS of a comparison expression:
expression whiteSpace? (EQ | NEQ | LT | GT | LEQ | GEQ | LIKE | IS) whiteSpace? expression # relationalOp
I tried reordering the alternatives so that the expression branch has lower precedence, like this:
rangeClause :
selectStartValue whiteSpace TO whiteSpace selectEndValue
| (IS whiteSpace?)? comparisonOperator whiteSpace? expression
| expression
;
But that broke the entire grammar in all kinds of ways (breaks ~1000 tests in my project), so instead I tried changing the rangeClause to this (removed optional tokens, because Is without = is actually illegal VBA code):
rangeClause :
expression (whiteSpace TO whiteSpace expression)? #caseFromTo
| (IS whiteSpace comparisonOperator whiteSpace)? expression #caseIs
;
And then working with CaseFromToContext and CaseIsContext classes in the code (had to, to keep it compiling), but again it broke ~1000 tests in my project.
Then I figured, "hey that's potentially ambiguous!" and turned it into this:
rangeClause :
expression whiteSpace TO whiteSpace expression #caseFromTo
| IS whiteSpace comparisonOperator whiteSpace expression #caseIs
| expression #caseExpr
;
...but no luck, same identical outcome.
How can I make the rangeClause understand this annoying Case Is = foobar syntax? I'm using ANTLR 4.3, but we're planning to upgrade to ANTLR 4.6 soon-ish.
If additional context is needed, the complete VBAParser.g4 grammar is on github.
Turns out that re-ordering actually does work, but in order to keep the ambiguity out of the parse, the IS whiteSpace comparisonOperator has to come first:
rangeClause :
(IS whiteSpace?)? comparisonOperator whiteSpace? expression
| selectStartValue whiteSpace TO whiteSpace selectEndValue
| expression
The problem is with expression (and by extension selectStartValue and selectEndValue) which will recursively match Is = because comparisonOperator comparisonOperator is an expression match. There's probably some work that can be done to prevent comparisonOperator comparisonOperator from matching expression (it's never valid in VBA AFAIK), but the above works as a quick and dirty fix.
Basically all the above grammar does is ensure that the "invalid" comparisonOperator comparisonOperator matches as a rangeClause before it can be matched as an expression.
I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!
I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)