So, I was wondering what makes a parser like:
line : expression EOF;
expression : m_expression (PLUS m_expression)?;
m_expression: basic (TIMES basic)?;
basic : NUMBER | VARIABLE | (OPENING expression CLOSING) | expression;
left recursive and invalid, while a parser like
line : expression EOF;
expression : m_expression (PLUS m_expression)?;
m_expression: basic (TIMES basic)?;
basic : NUMBER | VARIABLE | (OPENING expression CLOSING);
is valid and works even though the definition of 'basic' still refers to 'expression'. In particular, I'd like to be able to parse expressions in the form of
a+b+c
without introducing operations acting on more than two operands.
line calls expression calls m_expression calls basic which calls expression... that is indirectly left recursive and bad for both v3 antlr and v4. The definition of left recursion means you can get back to the same rule without consuming a token. You have the OPENING token in front of expression in the second instance.
Related
I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".
I'm currently writing a simple grammar that requires operator precedence and mixed associativities in one expression. An example expression would be a -> b ?> C ?> D -> e, which should be parsed as (a -> (((b ?> C) ?> D) -> e). That is, the ?> operator is a high-precedence left-associative operator wheras the -> operator is a lower-precedence right-associative operator.
I'm prototyping the grammar in ANTLR 3.5.1 (via ANTLRWorks 1.5.2) and find that it can't handle the following grammar:
prog : expr EOF;
expr : term '->' expr
| term;
term : ID rest;
rest : '?>' ID rest
| ;
It produces rule expr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2 error.
The term and rest productions work fine in isolation when I tested it , so I assumed this happened because the parser is getting confused by expr. To get around that, I did the following refactor:
prog : expr EOF;
expr : term exprRest;
exprRest
: '->' expr
| ;
term : ID rest;
rest : DU ID rest
| ;
This works fine. However, because of this refactor I now need to check for empty exprRest nodes in the output parse tree, which is non-ideal. Is there a way to make ANTLR work around the ambiguity in the initial declaration of expr? I would of assumed that the generated parser would fully match term and then do a lookahead search for "->" and either continue parsing or return the lone term. What am I missing?
As stated, the problem is in this rule:
expr : term '->' expr
| term;
The problematic part is the term which is common to both alternatives.
LL(1) grammar doesn't allow this at all (unless term only matches zero tokens - but such rules would be pointless), because it cannot decide which alternative to use with only being able to see one token ahead (that's the 1 in LL(1)).
LL(k) grammar would only allow this if the term rule could match at most k - 1 tokens.
LL(*) grammar which ANTLR 3.5 uses does some tricks that allows it to handle rules that match any number of tokens (ANTLR author calls this "variable look-ahead").
However, one thing that these tricks cannot handle is if the rule is recursive, i.e. if it or any rules it invokes reference itself in any way (direct or indirect) - and that is exactly what your term rule does:
term : ID rest;
rest : '?>' ID rest
| ;
- the rule rest, referenced from term, recursively references itself. Thus, the error message
rule expr has non-LL(*) decision due to recursive rule invocations ...
The way to solve this limitation of LL grammars is called left-factoring:
expr : term
( '->' expr )?
;
What I did here is said "match term first" (since you want to match it in both alternatives, there's no point in deciding which one to match it in), then decide whether to match '->' expr (this can be decided just by looking at the very next token - if it's ->, use it - so this is even LL(1) decision).
This is very similar to what you came to as well, but the parse tree should look very much like you intended with the original grammar.
Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.
If I have an ANTLR grammar as follows:
grammar Test;
options {
language = Java;
}
rule : (foo | bar);
foo : FOO ',' FOO;
bar : BAR;
FOO: ('0'..'9')+;
BAR: ('a'..'z' | 'A'..'Z' | '0'..'9' | ' ')+;
WHITESPACE: (' ' | '\t')+ { $channel=HIDDEN; };
And I use a test string:
12abc3
this (I believe) is a BAR token which satisfies a bar rule and is parsed as such. Bravo.
However, if I have this string:
12
I receive line 1:2 mismatched input '' expecting ','
This seems rather non-deterministic although I'm sure it's not. I understand I'm already in trouble by having two tokens: FOO and BAR that accept digits. But if the parser is going to succeed or fail it should succeed or fail consistently. In other words, in the first case the first character is a 1 and apparently is being evaluated as a member of the BAR token and thus the parser heads down a successful path. In the second case, the SAME first character is being evaluated as a FOO token and thus the path is doomed to fail despite the fact that the string COULD be a successful bar parse. Why the inconsistency? Or am I missing something more fundamental about ANTLR and/or parsing?
ANTLR doesn't determine the token type until it sees the first character for the next token(or EOF). ANTLR will also attempt a longest match, which is why you see '12abc3' as BAR and not as FOO BAR. In the second case ANTLR will use FOO for '12' because it is listed first in the grammar.
ANTLR basics
ANTLR lexers
In addition to Adam answer, you must realize that the lexer and parser, although defined in the same grammar, are being constructed at different times. First the input source is being tokenized and when that has happened, only then the parser operates on these tokens. The tokens are not created while the parser goes through the source (character stream) to favor a complete match (ie. tokenize "12" as BAR). The fact that "12" is being tokenized as FOO is because FOO comes before the BAR rule and has therefor a higher precedence in case of an equal long match.
In short: ANTLR grammars are not PEG's.
I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.