Writing java-style If statements in XText results in an error - xtext

I'm writing a custom Xtext editor for my own DSL-Language and now want to add If-Statements to my language.
The Statements look something like this:
if (TRUE) {
(...)
}
But when I try adding them in, I get an error "A class may not be a super type of itself".
This is my code so far:
grammar XtextTest with org.eclipse.xtext.common.Terminals
generate xtextTest "http://www.my.xtext/Test"
Model:
statements+=Statement*;
Statement:
VariableAssignment |
IfStatement;
IfStatement:
'if' '(' BooleanExpression ')' '{' Statement '}';
BooleanExpression:
'TRUE' | 'FALSE';
VariableAssignment:
name=ID "=" INT ';';
How can I implement this? Or am I doing something obviously wrong?
Any help is appreciated ^^

Assignments are a important thing in Xtext. if you just call rules without assigning them it influences the supertype hierarchy that is inferred. => it is better to change the grammar to
IfStatement:
'if' '(' condition=BooleanExpression ')' '{' statement=Statement '}';
If you want to introduce a common supertype/subtype relationship don't use assignments
Number: Double | Long;

Related

Antlr4 - Match rule only if a previous rule matched

I have a parser laid out similarly to (though not exactly like) this:
compilationUnit: statement* EOF;
methodBody: statement*;
ifBlock: IF '(' expression ')' '{' statement* '}' ;
statement: methodBody | ifBlock | returnStatement | ... ;
This parser works fine, and I can use it. However, it has the flaw that returnStatement will parse even if it's not in a method body. Ideally, I would be able to set it up such that returnStatement will only match in the statement rule if one of its parents down the line was methodBody. Is there a way to do that?
You have to differentiate the statements that appear inside the methodBody from the ones that appear outside of that scope. Ideally you will have two different productions. Something like this:
compilationUnit: member* EOF;
member: method | class | ... ;
method: methodName '(' parameters ')' '{' methodBody '}' ;
methodBody: statement*;
statement: methodBody | ifBlock | returnStatement | ... ;
ifBlock: IF '(' expression ')' '{' statement* '}' ;
You are trying to solve this problem at the wrong level. It shouldn't be handled at the grammar level, but in the following semantic phase (which is used to find logical/semantic errors, instead of syntax errors, what your parser is dealing with). You can see the same approach in the C grammar. The statement rule references the jumpStatement rule, which in turn matches (among others) the return statement.
Handling such semantic errors also allows for better error messages. Instead of "no viable alt" you can print an error that says "return only allowed in a function body" or similar. To do that examine the generated parse tree, search for return statements and check the parent context(s) of that sub tree to know if it is valid or not.

Ambiguous call expression in ANTLR4 grammar

I have a simple grammar (for demonstration)
grammar Test;
program
: expression* EOF
;
expression
: Identifier
| expression '(' expression? ')'
| '(' expression ')'
;
Identifier
: [a-zA-Z_] [a-zA-Z_0-9?]*
;
WS
: [ \r\t\n]+ -> channel(HIDDEN)
;
Obviously the second and third alternatives in the expression rule are ambiguous. I want to resolve this ambiguity by permitting the second alternative only if an expression is immediately followed by a '('.
So the following
bar(foo)
should match the second alternative while
bar
(foo)
should match the 1st and 3rd alternatives (even if the token between them is in the HIDDEN channel).
How can I do that? I have seen these ambiguities, between call expressions and parenthesized expressions, present in languages that have no (or have optional) expression terminator tokens (or rules) - example
The solution to this is to temporary "unhide" whitespace in your second alternative. Have a look at this question for how this can be done.
With that solution your code could look somthing like this
expression
: Identifier
| {enableWS();} expression '(' {disableWS();} expression? ')'
| '(' expression ')'
;
That way the second alternative matches the input WS-sensitive and will therefore only be matched if the identifier is directly followed by the bracket.
See here for the implementation of the MultiChannelTokenStream that is mentioned in the linked question.

Can I do something to avoid the need to backtrack in this grammar?

I am trying to implement an interpreter for a programming language, and ended up stumbling upon a case where I would need to backtrack, but my parser generator (ply, a lex&yacc clone written in Python) does not allow that
Here's the rules involved:
'var_access_start : super'
'var_access_start : NAME'
'var_access_name : DOT NAME'
'var_access_idx : OPSQR expression CLSQR'
'''callargs : callargs COMMA expression
| expression
| '''
'var_access_metcall : DOT NAME LPAREN callargs RPAREN'
'''var_access_token : var_access_name
| var_access_idx
| var_access_metcall'''
'''var_access_tokens : var_access_tokens var_access_token
| var_access_token'''
'''fornew_var_access_tokens : var_access_tokens var_access_name
| var_access_tokens var_access_idx
| var_access_name
| var_access_idx'''
'type_varref : var_access_start fornew_var_access_tokens'
'hard_varref : var_access_start var_access_tokens'
'easy_varref : var_access_start'
'varref : easy_varref'
'varref : hard_varref'
'typereference : NAME'
'typereference : type_varref'
'''expression : new typereference LPAREN callargs RPAREN'''
'var_decl_empty : NAME'
'var_decl_value : NAME EQUALS expression'
'''var_decl : var_decl_empty
| var_decl_value'''
'''var_decls : var_decls COMMA var_decl
| var_decl'''
'statement : var var_decls SEMIC'
The error occurs with statements of the form
var x = new SomeGuy.SomeOtherGuy();
where SomeGuy.SomeOtherGuy would be a valid variable that stores a type (types are first class objects) - and that type has a constructor with no arguments
What happens when parsing that expression is that the parser constructs a
var_access_start = SomeGuy
var_access_metcall = . SomeOtherGuy ( )
and then finds a semicolon and ends in an error state - I would clearly like the parser to backtrack, and try constructing an expression = new typereference(SomeGuy .SomeOtherGuy) LPAREN empty_list RPAREN and then things would work because the ; would match the var statement syntax all right
However, given that PLY does not support backtracking and I definitely do not have enough experience in parser generators to actually implement it myself - is there any change I can make to my grammar to work around the issue?
I have considered using -> instead of . as the "method call" operator, but I would rather not change the language just to appease the parser.
Also, I have methods as a form of "variable reference" so you can do
myObject.someMethod().aChildOfTheResult[0].doSomeOtherThing(1,2,3).helloWorld()
but if the grammar can be reworked to achieve the same effect, that would also work for me
Thanks!
I assume that your language includes expressions other than the ones you've included in the excerpt. I'm also going to assume that new, super and var are actually terminals.
The following is only a rough outline. For readability, I'm using bison syntax with quoted literals, but I don't think you'll have any trouble converting.
You say that "types are first-class values" but your syntax explicitly precludes using a method call to return a type. In fact, it also seems to preclude a method call returning a function, but that seems odd since it would imply that methods are not first-class values, even though types are. So I've simplified the grammar by allowing expressions like:
new foo.returns_method_which_returns_type()()()
It's easy enough to add the restrictions back in, but it makes the exposition harder to follow.
The basic idea is that to avoid forcing the parser to make a premature decision; once new is encountered, it is only possible to distinguish between a method call and a constructor call from the lookahead token. So we need to make sure that the same reductions are used up to that point, which means that when the open parenthesis is encountered, we must still retain both possibilities.
primary: NAME
| "super"
;
postfixed: primary
| postfixed '.' NAME
| postfixed '[' expression ']'
| postfixed '(' call_args ')' /* PRODUCTION 1 */
;
expression: postfixed
| "new" postfixed '(' call_args ')' /* PRODUCTION 2 */
/* | other stuff not relevant here */
;
/* Your callargs allows (,,,3). This one doesn't */
call_args : /* EMPTY */
| expression_list
;
expression_list: expression
| expression_list ',' expression
;
/* Another slightly simplified production */
var_decl: NAME
| NAME '=' expression
;
var_decl_list: var_decl
| var_decl_list ',' var_decl
;
statement: "var" var_decl_list ';'
/* | other stuff not relevant here */
;
Now, take a look at PRODUCTION 1 and PRODUCTION 2, which are very similar. (Marked with comments.) These are basically the ambiguity for which you sought backtracking. However, in this grammar, there is no issue, since once a new has been encountered, the reduction of PRODUCTION 2 can only be performed when the lookahead token is , or ;, while PRODUCTION 1 can only be performed with lookahead tokens ., ( and [.
(Grammar tested with bison, just to make sure there are no conflicts.)

Why if I add lambda left recursion occurs?

I am trying to write if syntax by using flex bison and in parser I have a problem
here is a grammar for if syntax in cpp
program : //start rule
|statements;
block:
TOKEN_BEGIN statements ';'TOKEN_END;
reexpression:
| TOKEN_OPERATOR expression;
expression: //x+2*a*3
TOKEN_ID reexpression
| TOKEN_NUMBER reexpression;
assignment:
TOKEN_ID'='expression
statement:
assignment;
statements:
statement';'
| block
| if_statement;
else_statement:
TOKEN_ELSE statements ;
else_if_statement:
TOKEN_ELSE_IF '(' expression ')' statements;
if_statement:
TOKEN_IF '(' expression ')' statements else_if_statement else_statement;
I can't understand why if I replace these three rules , left recursion happen I just add lambda to These rules
else_statement:
|TOKEN_ELSE statements ;
else_if_statement:
|TOKEN_ELSE_IF '(' expression ')' statements;
if_statement:
TOKEN_IF '(' expression ')' statements else_if_statement else_statement;
please help me understand.
There's no lambda or left-recursion involved.
When you add epsilon to the if rules (making the else optional), you get conflicts, because the resulting grammar is ambiguous. This is the classic dangling else ambiguity where when you have TWO ifs with a single else, the else can bind to either if.
IF ( expr1 ) IF ( expr2 ) block1 ELSE block2

Shift/reduce conflict in yacc due to look-ahead token limitation?

I've been trying to tackle a seemingly simple shift/reduce conflict with no avail. Naturally, the parser works fine if I just ignore the conflict, but I'd feel much safer if I reorganized my rules. Here, I've simplified a relatively complex grammar to the single conflict:
statement_list
: statement_list statement
|
;
statement
: lvalue '=' expression
| function
;
lvalue
: IDENTIFIER
| '(' expression ')'
;
expression
: lvalue
| function
;
function
: IDENTIFIER '(' ')'
;
With the verbose option in yacc, I get this output file describing the state with the mentioned conflict:
state 2
lvalue -> IDENTIFIER . (rule 5)
function -> IDENTIFIER . '(' ')' (rule 9)
'(' shift, and go to state 7
'(' [reduce using rule 5 (lvalue)]
$default reduce using rule 5 (lvalue)
Thank you for any assistance.
The problem is that this requires 2-token lookahead to know when it has reached the end of a statement. If you have input of the form:
ID = ID ( ID ) = ID
after parser shifts the second ID (lookahead is (), it doesn't know whether that's the end of the first statement (the ( is the beginning of a second statement), or this is a function. So it shifts (continuing to parse a function), which is the wrong thing to do with the example input above.
If you extend function to allow an argument inside the parenthesis and expression to allow actual expressions, things become worse, as the lookahead required is unbounded -- the parser needs to get all the way to the second = to determine that this is not a function call.
The basic problem here is that there's no helper punctuation to aid the parser in finding the end of a statement. Since text that is the beginning of a valid statement can also appear in the middle of a valid statement, finding statement boundaries is hard.

Resources