Obscure Antlr Error when Parsing Data Type - parsing

I am trying to parse a variable type for a toy language meant to teach Antlr fundamentals. I wish to parse at the rule var using the below code.
// Parser
var : TYPE ID;
// Lexer
TYPE: SIGNED PTR? DIMENSIONS?
| UNSIGNED PTR? DIMENSIONS?
| UNSIGNABLE PTR? DIMENSIONS?;
fragment DIMENSIONS : '[' ((NAT | ':') ',')* (NAT | ':')? ']';
fragment SIGNED : 'I16' | 'I32' | 'I64' | 'F32' | 'CHAR';
fragment UNSIGNED : 'U_I16' | 'U_I32' | 'U_I64' | 'U_F32' | 'U_CHAR';
fragment UNSIGNABLE : 'VOID' | 'STR' | 'BOOL' | 'CPLX';
PTR : 'PTR';
NAT : [0-9]+;
ID : [A-Z][A-Z0-9_]*;
However, when I test my program with the example declaration I32 HELLO_9, I receive the following error.
line 1:0 missing TYPE at 'I32'
PTR and DIMENSIONS should be marked as optional, so I am unsure as to why my lexer will not identify the I32 token for the SIGNED fragment. As a secondary question, I wonder how it is ever possible for professional programmers to create sophisticated projects with Antlr. I have experimented with Haskell parsing libraries in the past and it appears (from my subjective view) that Antlr is more prone to producing obscure errors. My perception is probably just a consequence of my inexperience, and I would be thankful to hear the opinions of a more suave programmer.

Given your grammar, I can't reproduce this. If I add SPACE : [ \t\r\n] -> skip; to it, the following code:
TLexer lexer = new TLexer(CharStreams.fromString("I32 HELLO_9"));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTree root = parser.var();
System.out.println(root.toStringTree(parser));
produces no warnings/errors and prints:
(var I32 HELLO_9)
representing the parse tree:
The real problem is something #rici mentioned, or it is hidden by the fact that you've minimized your real grammar and this minimized form does not produce the error your real grammar does.

Related

Interpretation variants of binary operators

I'm writing a grammar for a language that contains some binary operators that can also be used as unary operator (argument to the right side of the operator) and for a better error recovery I'd like them to be usable as nular operators as well).
My simplified grammar looks like this:
start:
code EOF
;
code:
(binaryExpression SEMICOLON?)*
;
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression //TODO: check before primaryExpression
| primaryExpression
;
primaryExpression:
unaryExpression
| nularExpression
;
unaryExpression:
operator primaryExpression
| BINARY_OPERATOR primaryExpression
;
nularExpression:
operator
| BINARY_OPERATOR
| NUMBER
| STRING
;
operator:
ID
;
BINARY_OPERATOR is just a set of defined keywords that are fed into the parser.
My problem is that Antlr prefers to use BINARY_OPERATORs as unary expressions (or nualr ones if there is no other choice) instead of trying to use them in a binary expression as I need it to be done.
For example consider the following intput: for varDec from one to twelve do something where from, to and do are binary operators the output of the parser is the following:
As you can see it interprets all binary operators as unary ones.
What I'm trying to achieve is the following: Try to match each BINARY_OPERATOR in a binary expression and only if that is not possible try to match them as a unary expression and if that isn't possible as well then it might be considered a nular expression (which can only be the case if the BINARY_OPERATORis the only content of an expression).
Has anyone an idea about how to achieve the desired behaviour?
Fairly standard approach is to use a single recursive rule to establish the acceptable expression syntax. ANTLR is default left associative, so op expr meets the stated unary op requirement of "argument to the right side of the operator". See, pg 70 of TDAR for a further discussion of associativity.
Ex1: -y+x -> binaryOp{unaryOp{-, literal}, +, literal}
Ex2: -y+-x -> binaryOp{unaryOp{-, literal}, +, unaryOp{-, literal}}
expr
: LPAREN expr RPAREN
| expr op expr #binaryOp
//| op expr #unaryOp // standard formulation
| op literal #unaryOp // limited formulation
| op #errorOp
| literal
;
op : .... ;
literal
: KEYWORD
| ID
| NUMBER
| STRING
;
You allow operators to act like operands ("nularExpression") and operands to act like operators ("operator: ID"). Between those two curious decisions, your grammar is 100% ambiguous, and there is never any need for a binary operator to be parsed. I don't know much about Antlr, but it surprises me that it doesn't warn you that your grammar is completely ambiguous.
Antlr has mechanisms to handle and recover from errors. You would be much better off using them than writing a deliberately ambiguous grammar which makes erroneous constructs part of the accepted grammar. (As I said, I'm not an Antlr expert, but there are Antlr experts who pass by here pretty regularly; if you ask a specific question about error recovery, I'm sure you'll get a good answer. You might also want to search this site for questions and answers about Antlr error recovery.)
I think what I'm going to write down now is what #GRosenberg meant with his answer. However as it took me a while to fully understand it I will provide a concrete solution for my problem in case someone else is stumbling across this question and is searching or an answer:
The trick was to remove the option to use a BINARY_OPERATOR inside the unaryExpression rule because this always got preferred. Instead what I really wanted was to specify that if there was no left side argument it should be okay to use a BINARY_OPERATOR in a unary way. And that's the way I had to specify it:
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression
| BINARY_OPERATOR primaryExpression
| primaryExpression
;
That way this syntax only becomes possible if there is nothing to the left side of a BINARY_OPERATOR and in every other case the binary syntax has to be used.

ANTLR 4 Parser Grammar

How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

ANTLR doesn't find the defined start rule

I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi
There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.

Resources