AntlrWorks 2 output - antlrworks

So i am using Antlrworks 2, working on a rather large grammar. Problem is that in this grammar there are multiple ambiguities that i am trying to work through.
I was wondering if there is a way to interpret which rules were invoked when there was a failure.
For instance, when i run my rule i get following output
[#0,0:1='99',<20>,1:0]
[#1,2:1='<EOF>',<-1>,1:2]
line 1:0 mismatched input '99' expecting Digit2
(dummy 99)
I am wondering what [#0,0:1='99',<20>,1:0] means. Do the #0 or <20> have any relationship to the rule number in my grammar or something ?

Here is a breakdown of the default token formatting.
[#{TokenIndex},{StartIndex}:{StopIndex}={Text},<{TokenType}>,{Line}:{Column}]
The {TokenType} field generally corresponds to a particular lexer rule (the constant will be declared in your generated lexer). However, the -> type(X) command can be used in any lexer rule to reassign tokens produced by that rule to another type. If the value 20 is assigned to the token named Foo, then the first token in your listing was produced by either a lexer rule named Foo or a lexer rule containing the action -> type(foo) or you have a user-defined action which explicitly assigns the type Foo to a token produced by some other rule (this will be code you wrote, not code generated by ANTLR).

Related

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.
Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.
This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

ANTLR4: implicit or explicit token definition

What are the benefits and drawbacks of using explicit token definitions in ANTLR4? I find the text in single parentheses more descriptive and easier to use than creating a separate token and using that in place of the text.
E.g.:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name ';' ;
library_name: IDENTIFIER;
module: MODULE module_name ';' ;
module_name: IDENTIFIER;
MODULE: 'module' ;
IDENTIFIER: [a-zA-Z0-9]+;
The generated tokens are:
T__0=1
T__1=2
MODULE=3
IDENTIFIER=4
'library'=1
';'=2
'module'=3
If I am not interested in the 'library' "token", since the rule already establishes what I am matching against, and I will just skip over it anyway, does it make any sense to replace it with LIBRARY and a token declaration? (The number of tokens then will grow.) Why is this a warning in ANTLRWorks?
Actually, there is a difference between implicit and explicit tokens:
From "The Definitive ANTLR4 Reference", page 76:
ANTLR collects and separates all of the string literals and lexer
rules from the parser rules. Literals such as 'enum' become lexical
rules and go immediately after the parser rules but before the
explicit lexical rules.
ANTLR lexers resolve ambiguities between
lexical rules by favoring the rule specified first.
Highlight from me.
The Antlr (and most compiler/compiler generators) implementation use the concept of a separate lexer and parser, mostly for performance reasons. In this model, the lexer is responsible for reading the actual characters in the input string and returning a list of the tokens found, in a more concise represetations, like an enum or int-codes for each token. The parser will work on these tokens instead of the original input for ease of implementation and performance.
There are two ways to "declare" the usage of a token in Antlr, one is explicit and have a regular pattern expresion, the other is implicit, is always a fixed string.
ExplicitRegExp: [A-Z][a-z]+; // lexer rule starts with uppercase letter
ExplicitFixed: 'fixed';
parserRule: 'implicit' ExplicitRegExp; // parser rules starts with lowercase letter
When declare a token explicitly, it's assigned an int-code to be used in the parsing state machine. Let's say ExplicitRegExp becomes 1 and ExplicitFixed becomes 2. But the parser will also need the implicit tokens to be able to parse the grammar correctly, so the implicit token is assigned the code 3 implicitly.
How is that bad? You may have typos in different parts of the grammar:
a : 'implicit' c;
b : 'implcit' d; // typo here
And your grammar will not work as expected, because implcit will be a valid token, assigned the int-code 4. It also makes your grammar/lexer harder to debug due to Antlr auto-generating names for the implicit rules, like T___0. Another thing is that you lose the ordering of lexer rules, which could make a diference (usually not because implicit tokens are all fixed content).
The Antlr compiler could choose to give you an error message and require you to write the tokens explicitly, but it chooses to let it go and just warn you that you should not to that, probably for prototyping/testing reasons.
To let Antlr be happy, do it the verbose way and declare all of your tokens:
grammar SimpleTest;
top: library | module ;
library: 'library' library_name=IDENTIFIER ';' ; // I'm using aliasing instead of different parser rule here, just a preference
module: 'module' module_name=IDENTIFIER ';' ;
MODULE: 'module' ;
LIBRARY: 'library' ;
IDENTIFIER: [a-zA-Z0-9]+;
Then it makes no difference if you reference a fixed token by it's explicit name (like MODULE) or by its content (like 'module').

Why are redundant parenthesis not allowed in syntax definitions?

This syntax module is syntactically valid:
module mod1
syntax Empty =
;
And so is this one, which should be an equivalent grammar to the previous one:
module mod2
syntax Empty =
( )
;
(The resulting parser accepts only empty strings.)
Which means that you can make grammars such as this one:
module mod3
syntax EmptyOrKitchen =
( ) | "kitchen"
;
But, the following is not allowed (nested parenthesis):
module mod4
syntax Empty =
(( ))
;
I would have guessed that redundant parenthesis are allowed, since they are allowed in things like expressions, e.g. ((2)) + 2.
This problem came up when working with the data types for internal representation of rascal syntax definitions. The following code will create the same module as in the last example, namely mod4 (modulo some whitespace):
import Grammar;
import lang::rascal::format::Grammar;
str sm1 = definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
alt({seq([])})
],{})))))));
The problematic part of the data is on its own line - alt({seq([])}). If this code is changed to seq([]), then you get the same syntax module as mod2. If you further delete this whole expression, i.e. so that you get this:
str sm3 =
definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
], {})))))));
Then you get mod1.
So should such redundant parenthesis by printed by the definition2rascal(...) function? And should it matter with regards to making the resulting module valid or not?
Why they are not allowed is basically we wanted to see if we could do without. There is currently no priority relation between the symbol kinds, so in general there is no need to have a bracket syntax (like you do need to + and * in expressions).
Already the brackets have two different semantics, one () being the epsilon symbol and two (Sym1 Sym2 ...) being a nested sequence. This nested sequence is defined (syntactically) to expect at least two symbols. Now we could without ambiguity introduce a third semantics for the brackets with a single symbol or relax the requirement for sequence... But we reckoned it would be confusing that in one case you would get an extra layer in the resulting parse tree (sequence), while in the other case you would not (ignored superfluous bracket).
More detailed wise, the problem of printing seq([]) is not so much a problem of the meta syntax but rather that the backing abstract notation is more relaxed than the concrete notation (i.e. it is a bigger language or an over-approximation). The parser generator will generate a working parser for seq([]). But, there is no Rascal notation for an empty sequence and I guess the pretty printer should throw an exception.

JvmFormalParameter rule ambigouous?

I have a simple little grammar which keeps giving a multiple alternatives error when I try to generate Xtext artefacts.
The grammar is:
grammar org.xtext.example.hyrule.HyRule with org.eclipse.xtext.xbase.Xbase
generate hyRule (You can only use links to eclipse.org sites while you have fewer than 25 messages )
Start:
rules+=Rule+
;
Rule:
'FOR''PAYLOAD'payload=PAYLOAD'ELEMENTS' elements+=JvmFormalParameter+'CONSTRAINED' 'BY' expressions+= XExpression*;
PAYLOAD:
"Stacons"|"PFResults"|"any"
;
And the exact error I get is:
![warning(200): ../org.xtext.example.hyrule/src-gen/org/xtext/example/hyrule/parser/antlr/internal/InternalHyRule.g:3197:2: Decision can match input such as "{RULE_ID, '=>', '('}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
error(201): ../org.xtext.example.hyrule/src-gen/org/xtext/example/hyrule/parser/antlr/internal/InternalHyRule.g:3197:2: The following alternatives can never be matched: 2][1]
I have attached the Syntax diagram for the generated antlr grammar in antlrworks, and can clearly see the multiple alternatives(JvmFormalParameter can match RULE_ID via the JvmTypeReference or the ValidID rule).
So it looks as if JvmFormalParameter is ambiguous...Apologies for my stupidity but could someone point out what it is I'm missing? Is there some way of overcoming this ambiguity when using the JvmFormalParameter rule in my grammar?
The rule JvmFormalParameter is defined as
JvmFormalParameter returns types::JvmFormalParameter:
(parameterType=JvmTypeReference)? name=ValidID;
so the type of the parameter is optional. If you use elements+=JvmFormalParameter+, you allow multiple parameters without a delimiter thus the parser cannot decide about the input sequence
String s
since both String and s could be names of two parameters or String s could be a single parameter with a type String and the name s. You should use a delimiter like
elements+=JvmFormalParameter (',' elements+=JvmFormalParameter)*
or use the rule FullJvmFormalParameter which is defined with a mandatory type reference:
FullJvmFormalParameter returns types::JvmFormalParameter:
parameterType=JvmTypeReference name=ValidID;

Resources