I am trying to understand a xtext grammar I have found (below). I have two questions:
The XFeatureCall has return Type XExpression but this is overruled by {XFeatureCall} so I could set "returns XFeatureCall" as well?. Or is it actually necessary to do it this way?
Line 8 and 14 start with "=>". Are these "chosen predicates" or something else that did not come to my attention so far? I could not find this variation of chosen predicates in the xtext documentation. So I would appreciate clarification in its application.
xtext grammar:
StaticEquals:':=';
XFeatureCall returns XExpression:
// Same as Xbase...
{XFeatureCall}
(declaringType=[JvmDeclaredType|StaticQualifier])?
('<' typeArguments+=JvmArgumentTypeReference (',' typeArguments+=JvmArgumentTypeReference)* '>')?
(feature=[JvmIdentifiableElement|IdOrSuper]|'class')
(=>explicitOperationCall?='('
(
featureCallArguments+=XShortClosure
| featureCallArguments+=XExpression (',' featureCallArguments+=XExpression)*
)?
')')?
=>featureCallArguments+=XClosure?
// ... Except with this additional optional clause that allows static members to be set with := operator
({XAssignment.assignable = current} StaticEquals value = XAssignment)?;
First question: In fact in this case your rule returns a XFeatureCall but XFeatureCall has XExpression as its supertype. It is useful for example if you have:
SomeRule: (parts+=XFeatureCall)* (parts+=XOtherFeatureCall)*
Let XOtherFeatureCall also extend XExpression, and parts be a list of XExpressions.
Second question: it is a priority operator and means that what follows should be parsed now, even if there are other parsing solutions. See this classic example:
if a
if b
do;
else
doelse;
else could be parsed for the inner if or the outer if. Of course we want it in the inner if. Setting a rule such as:
=>'else' else=ElseExpression
forces the grammar to parse the else as soon as it finds it instead of returning to the outer rule that could consume a else too. So it solves an ambiguity.
Related
Here's my grammar to the for statements:
FOR x>0 {
//somthing
}
// or
FOR x = 0; x > 0; x++ {
//somthing
}
it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression,
however the codes right after FOR will become useless because of confliction.
ForStmt
: FOR {
printf("for_begin_%d:\n", n);
} Expression {
printf("ifeq for_exit_%d\n", n);
} ForBlock
| FOR ForClause ForBlock
;
ForClause
: InitExpression ';' {
printf("for_begin_%d:\n", n);
} Expression ';' Expression { printf("ifeq for_exit_%d\n", n); }
;
I had tried to change it to something like:
ForStart
: FOR
| FOR InitExpression
;
or use a flag to mention where to print the for_begin label,
but also fail to resolve the conflict.
How to make it not conflict?
How can the parser know which alternative of the FOR statement it sees?
While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.
But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.
Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.
But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.
Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.
So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.
As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.
That would make the grammar look something like this:
ForLabel
: %empty { $$ = n; printf("for_begin_%d:\n", n++); }
ForStmt
: FOR
ForLabel[label]
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
| FOR
ForLabel
InitExpress ';'
ForLabel[label]
Expression ';'
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
;
([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)
Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160
Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.
This syntax module is syntactically valid:
module mod1
syntax Empty =
;
And so is this one, which should be an equivalent grammar to the previous one:
module mod2
syntax Empty =
( )
;
(The resulting parser accepts only empty strings.)
Which means that you can make grammars such as this one:
module mod3
syntax EmptyOrKitchen =
( ) | "kitchen"
;
But, the following is not allowed (nested parenthesis):
module mod4
syntax Empty =
(( ))
;
I would have guessed that redundant parenthesis are allowed, since they are allowed in things like expressions, e.g. ((2)) + 2.
This problem came up when working with the data types for internal representation of rascal syntax definitions. The following code will create the same module as in the last example, namely mod4 (modulo some whitespace):
import Grammar;
import lang::rascal::format::Grammar;
str sm1 = definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
alt({seq([])})
],{})))))));
The problematic part of the data is on its own line - alt({seq([])}). If this code is changed to seq([]), then you get the same syntax module as mod2. If you further delete this whole expression, i.e. so that you get this:
str sm3 =
definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
], {})))))));
Then you get mod1.
So should such redundant parenthesis by printed by the definition2rascal(...) function? And should it matter with regards to making the resulting module valid or not?
Why they are not allowed is basically we wanted to see if we could do without. There is currently no priority relation between the symbol kinds, so in general there is no need to have a bracket syntax (like you do need to + and * in expressions).
Already the brackets have two different semantics, one () being the epsilon symbol and two (Sym1 Sym2 ...) being a nested sequence. This nested sequence is defined (syntactically) to expect at least two symbols. Now we could without ambiguity introduce a third semantics for the brackets with a single symbol or relax the requirement for sequence... But we reckoned it would be confusing that in one case you would get an extra layer in the resulting parse tree (sequence), while in the other case you would not (ignored superfluous bracket).
More detailed wise, the problem of printing seq([]) is not so much a problem of the meta syntax but rather that the backing abstract notation is more relaxed than the concrete notation (i.e. it is a bigger language or an over-approximation). The parser generator will generate a working parser for seq([]). But, there is no Rascal notation for an empty sequence and I guess the pretty printer should throw an exception.
I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.