Problems with left-recursion - xtext

I have a little grammar containing a few commands which have to be used with Numbers and some of these commands return Numbers as well.
My grammar snippet looks like this:
Command:
name Numbers
| Numbers "test"
;
name:
"abs"
| "acos"
;
Numbers:
NUMBER
| numberReturn
;
numberReturn:
name Numbers
;
terminal NUMBER:
('0'..'9')+("."("0".."9")+)?
;
After having inserted the "Numbers 'test'" part in rule command the compiler complains about non-LL() decicions and tells me I have to work around these (left-factoring, syntactic predicates, backtracking) but my problem is that I have no idea what kind of input wouldn't be non-LL() in this case nor do I have an idea how to left-factor my grammar (I don't want toturn on backtracking).
EDIT:
A few examples of what this grammar should match:
abs 3;
acos abs 4; //interpreted as "acos (abs 4)"
acos 3 test; //(acos 3) test
Best regards
Raven

The grammar you are trying to achieve is left-recursive; that means the parser does not know how to tell between (acos 10) test and acos (10 test) (without the parentheses). However, you can give the parser some hints for it to know the correct order, such as parenthesized expressions.
This would be a valid Xtext grammar, with testparenthesized expressions:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model
: operations += UnaryOperation*
;
UnaryOperation returns Expression
: 'abs' exp = Primary
| 'acos' exp = Primary
| '(' exp = Primary 'test' ')'
;
Primary returns Expression
: NumberLiteral
| UnaryOperation
;
NumberLiteral
: value = INT
;
The parser will correctly recognize expressions such as:
(acos abs (20 test) test)
acos abs 20
acos 20
(20 test)
These articles may be helpful for you:
https://dslmeinte.wordpress.com/tag/unary-operator/
http://blog.efftinge.de/2010/08/parsing-expressions-with-xtext.html

Related

Cannot seem to resolve ambiguity in Antlr Grammar

I am creating the simplest grammar possible that basically recognizes arithmetic expressions. The grammar needs to correctly follow arithmetic operators precedence rules (PEMDAS), and for that I placed expr ('*'|'/') term before expr ('+'|'-') term to ensure this precedence.
This is the arithmetic.g4 file that I have:
/*Productions */
expr: expr ('*'|'/') term
| expr ('+'|'-') term
| term
;
term: '('expr')'
| ID
| NUM
;
/*Tokens */
ID: [a-z]+;
NUM: [0-9]+;
WS: [\t\r\n]+->skip;
The output of the grammar is however not what it should be. For example for the arithmetic expression 4 * (3 + 10) I get the below parse tree (which is absolutely not correct):
Any suggestions on how I can change the grammar to get what I am looking for. I am new to antlr and am not sure what mistake I am making. (jbtw my OS is windows)
(I'm assuming that you've made a mistake in your example (which looks fine) and you really meant that you're getting the wrong tree for the input 4 + 3 * 10, so that's what I'm going to answer. If that's not what you meant, please clarify.)
You're right that ANTLR resolves ambiguities based on the order of rules, but that does not apply to your grammar because your grammar is not ambiguous. For an input like 4 + 3 * 10, there's only one way to parse it according to your grammar: with * being the outer operator, with 4 + 3 as its left and 10 as its right operand. The correct way (+ as the outer operator with 3 * 10 as the right operand) doesn't work with your grammar because 3 * 10 is not a valid term and the right operand needs to be a term according to your grammar.
In order to get an ambiguity that's resolved in the way you want, you'll need to make both operands of your operators exprs.

xtext not accepting string constant - expecting RULE_ID

I have tried to cut down my problem to the simplest problem I can in xtext - I would like to use the following grammar:
M: lines += T*;
T:
DT
| BDT
| N
;
BDT:
name = ('a' | 'b' | 'c')
;
DT:
'd' name=ID
('(' (ts += BDT (','ts += BDT)*) ')')?
;
N:
'n' name=ID ':' type=[T]
;
I am intending to parse expressions of the form d f(a,b,b) for example which works fine. I would also like to be able to parse n g:f which also works, but not n g:a - where a here is part of the BDT rule. The error given is "Missing RULE_ID at 'a'".
I'd like to allow the grammar to parse n g:a for example, and I'd be very grateful if anyone could point out where I'm going wrong here on this very simple grammar.
Lexing is done context free. A keyword can never be an ID. You can address this trough parser rules.
You can introduce a datatype rule
MyID: ID | "a" | ... | "c";
And use it where you use ID

Bison subscript expression unexpected error

With the following grammar:
program: /*empty*/ | stmt program;
stmt: var_decl | assignment;
var_decl: type ID '=' expr ';';
assignment: expr '=' expr ';';
type: ID | ID '[' NUMBER ']';
expr: ID | NUMBER | subscript_expr;
subscript_expr: expr '[' expr ']';
I'd expect the following to be valid:
array[5] = 0;
That's just an assignment with a subscript_expr on the left-hand-side. However the generated parser gives an error for that statement:
syntax error, unexpected '=', expecting ID
Generating the parser also warns that there's 1 shift/reduce conflict. Removing subscript_expr makes it go away.
Why does this happen and how can I get it to parse array[5] = 0; as an assignment with a subscript_expr?
I'm using Bison 2.3.
The following two statements are both valid in your language:
x [ 3 ] = 42;
x [ 3 ] y = 42;
The first is an assignment of an element of the array variable x, while the second is a declaration and initialization of the array variable y whose elements are of type x.
But from the parser's viewpoint, x and y are both just IDs; it has no way of knowing that x is a variable in the first case and a type in the second case. All it can do is notice that the two statements match the productions assignment and var_decl, respectively.
Unfortunately, it cannot do that until it sees the token after the ]. If that token is an ID, then the statement must be a var_decl; otherwise, it's an assignment (assuming the statement is valid, of course).
But in order to parse the statement as an assignment, the parser must be able to produce
expr '=' expr
which in this case is the result of expr: subsciprt_expr, which in turn is subscript_expr: expr[expr]`.
So the set of reductions for the first statement will be as follows: (Note: I didn't write the shifts; rather, I mark the progress of the parse by putting a • at the end of each reduction. To get to the next step, just shift the • until you reach the end of the handle.)
ID • [ NUMBER ] = NUMBER ; expr: ID
expr [ NUMBER • ] = NUMBER ; expr: NUMBER
expr [ expr ] • = NUMBER ; subscript_expr: expr '[' expr ']'
subscript_expr • = NUMBER ; expr: subscript_expr
expr = NUMBER • ; expr: NUMBER
expr = expr ; • assignment: expr '=' expr ';'
assignment
The second statement must be parsed as follows:
ID [ NUMBER ] • ID = NUMBER ; type: ID '[' NUMBER ']'
type ID = NUMBER • ; expr: NUMBER
type ID = expr ; • var_decl: type ID '=' expr ';'
var_decl
That's a shift/reduce conflict, because the crucial decision must be made immediately after the first ID. In the first statement, we need to reduce the identifier to an expr. In the second statement, we must continue shifting until we are ready to reduce a type.
Of course, this problem wouldn't exist if we could lexically distinguish type IDs from variable name IDs, but that may not be possible (or, if possible, it may not be desirable because it requires feedback from the parser to the lexer).
As written, the shift/reduce prediction can be made with fixed lookahead, since the fourth token after the ID will determine the possibilities. That makes the grammar LALR(4), but that doesn't help much since bison only implements LALR(1) parsers. In any case, it is likely that a less simplified grammar will not be fixed-lookahead, for example if constant expressions are allowed for array sizes, or if arrays can have multiple dimensions.
Even so, the grammar is not ambiguous, so it is possible to use a GLR parser to parse it. Bison does implement GLR parsers; it is only necessary to insert
%glr-parser
into the prologue. (The shift/reduce warning will still be produced, but the parser will correctly identify both kinds of statement.)
It's worth noting that C doesn't have this particular parsing problem precisely because it puts the array size after the name of the variable being declared. I don't believe this was done to avoid parsing problems (although who knows?) but rather because it was believed that it is more natural to write declarations the way variables are used. Hence, we write int a[3] and char *p, because in the program we will dereference using a[i] and *p.
It is possible to write an LALR(1) grammar for this syntax, but it's a bit annoying. The key is to delay the reduction of the syntax ID [ NUMBER ] until we know for sure which production it will be the start of. That means we need to include the production expr: ID '[' NUMBER ']'. That will result in a larger number of shift/reduce warnings (since it makes the grammar ambiguous), but since bison always prefers to shift, it should produce a correct parser.
Adding %glr-parser solves this.

ANTLR 4 Parser Grammar

How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.

Parsing floating-point number and ranges separated by two periods with ANTLR 3

I am working on a parser for a DSL that has two currently 'conflicting' features:
Floating-point numbers like 123.4.
Ranges specified like ID[2..5] (ID is defined as 'a'..'z'+ and doesn't matter much. The part '[2..5]' matters most.
The test grammar that should parse it looks as follows:
grammar DotTest;
span returns [double value]
: ID'['e=INT'..'f=INT']' { /*some code to process the values*/ $value = (double)(Int32.Parse($e.text) + Int32.Parse($f.text)); } ;
num returns [double value]
: DOUBLE {$value = double.Parse($DOUBLE.text); } ;
INT : '0'..'9'+ ;
DOUBLE : '0'..'9'+'.''0'..'9'+ ;
ID : 'a'..'z'+ ;
WS : ( ' ' | '\t' | '\r' | '\n' ) {$channel=HIDDEN;} ;
The problem: the rule span cannot parse its input correctly, because it conflicts with DOUBLE token. The lexer tries to match 2..5 as a DOUBLE and fails. Here is how it looks in ANTLR Works:
What will be the correct way to solve this conflict and parse the two INTs in the span correctly?
P.S. I'm using ANTLR 3 and not ANTLR 4 as I'm going to generate a C# parser, which is not currently implemented in ANTLR 4.
This solution (the second grammar) works fine. After I transformed the lexer rules to the following:
NUM : (INT RNG)=> INT {$type=INT;}
| (DOUBLE)=> DOUBLE {$type=DOUBLE;}
| INT {$type=INT;};
fragment INT : '0'..'9'+ ;
fragment DOUBLE : '0'..'9'+'.''0'..'9'+ ;
RNG: '..' ;
parsing of intervals like 1..2 started working smoothly.
The DOUBLE rule you posted above does not conflict with the .. operator since the '0'..'9'+ following the '.' contains at least one digit. The following alternate definition of DOUBLE would in fact conflict:
DOUBLE : '0'..'9'+ '.' '0'..'9'*;
I suspect you are using the interpreter in ANTLRWorks, which is known to give incorrect results in many cases.

Resources