JvmFormalParameter rule ambigouous? - xtext

I have a simple little grammar which keeps giving a multiple alternatives error when I try to generate Xtext artefacts.
The grammar is:
grammar org.xtext.example.hyrule.HyRule with org.eclipse.xtext.xbase.Xbase
generate hyRule (You can only use links to eclipse.org sites while you have fewer than 25 messages )
Start:
rules+=Rule+
;
Rule:
'FOR''PAYLOAD'payload=PAYLOAD'ELEMENTS' elements+=JvmFormalParameter+'CONSTRAINED' 'BY' expressions+= XExpression*;
PAYLOAD:
"Stacons"|"PFResults"|"any"
;
And the exact error I get is:
![warning(200): ../org.xtext.example.hyrule/src-gen/org/xtext/example/hyrule/parser/antlr/internal/InternalHyRule.g:3197:2: Decision can match input such as "{RULE_ID, '=>', '('}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
error(201): ../org.xtext.example.hyrule/src-gen/org/xtext/example/hyrule/parser/antlr/internal/InternalHyRule.g:3197:2: The following alternatives can never be matched: 2][1]
I have attached the Syntax diagram for the generated antlr grammar in antlrworks, and can clearly see the multiple alternatives(JvmFormalParameter can match RULE_ID via the JvmTypeReference or the ValidID rule).
So it looks as if JvmFormalParameter is ambiguous...Apologies for my stupidity but could someone point out what it is I'm missing? Is there some way of overcoming this ambiguity when using the JvmFormalParameter rule in my grammar?

The rule JvmFormalParameter is defined as
JvmFormalParameter returns types::JvmFormalParameter:
(parameterType=JvmTypeReference)? name=ValidID;
so the type of the parameter is optional. If you use elements+=JvmFormalParameter+, you allow multiple parameters without a delimiter thus the parser cannot decide about the input sequence
String s
since both String and s could be names of two parameters or String s could be a single parameter with a type String and the name s. You should use a delimiter like
elements+=JvmFormalParameter (',' elements+=JvmFormalParameter)*
or use the rule FullJvmFormalParameter which is defined with a mandatory type reference:
FullJvmFormalParameter returns types::JvmFormalParameter:
parameterType=JvmTypeReference name=ValidID;

Related

How to differentiate between a parser and lexer rule in the following?

Sometimes I get a bit confused between a lexing rule vs. a parsing rule, and there's been a nice thread on it here. For example in the following:
value
: string CAST_OPERATOR type
;
string
: S_QUOTE STRING_VALUE S_QUOTE
;
# <-- what is this?
type
: 'date' | 'string'
;
STRING_VALUE
: [a-zA-Z0-9-]+
;
CAST_OPERATOR
: '::'
;
For the type -- this is either the string (or character stream) date or string. Should that be a lexing rule or a parsing rule? I suppose I could break it down even more into:
type
: DATE_TYPE | STRING_TYPE
;
DATE_TYPE
: 'date'
;
STRING_TYPE
: 'string'
;
But still I'm not quite sure which of the above is preferable, and why it would be so. The first two rules -- value and string seem clear to me to be parsing rules -- and the last two rules -- STRING_VALUE and CAST_OPERATOR seem clear to me to be lexing rules (only by intuition though, I could not give a proper explanation). So why would the type be one way or the other?
Literally the only practical difference I've found is that a lexing rule can include a character class and a parsing rule cannot.
Update: I suppose another thing is a lexing rule is terminal, it won't provide any subdivision of parts. For example in the following we can break down $55 into $ and 55:
But if we set the cost as a lexing rule, it will not break it down any further:
So basically a lexing rule is atomic and terminal, whereas a parsing rule is more like a molecule that consists of various parts (atoms) that can be seen within it. Is that a good description/understanding of it?
Your "Update" is on the right track. That's a definite distinction.
You also need to understand the ANTLR pipeline. I.e. that the stream of characters is processed by the Lexer rules to produce a stream of tokens (atoms, in you analogy). It does not do that with recursive descent rule matching, but rather attempts to match you input against all of the Lexer rules. Where:
The rule that matches the longest sequence of input characters will "win"
In the event that multiple Lexer rules match the same length character sequence, then the rule that occurs first will "win"
Once you've got you stream of "atoms" (aka Tokens), then ANTLR uses the parser rules (recursively from the start rule) to try to match sequences of tokens.

Is there a way to insert phases between the lexer and parser in ANTLR

I am writing a lexer/parser for a language that allows abbreviations (and globs) for its keywords. And, I am trying to determine the best way to do it.
And one thought that occurs to me, is to insert a phase between the lexer and the parser, where the lexer recognizes the general class, e.g. is this a "command name" or is it an "option" and then passes those general tokens to a second phase which does further analysis and recognizes which command name it is and passes that on as the token type to the parser.
It will make the parser simple. I will only have to deal with well formed command names. Every token will be clear what it means.
It will keep the lexer simple. It will only have to divide things into classes. This is a simple name. This is a glob. This is an option name (starts with a dash).
The phase is the middle will also be relatively simple. The simple name (and option forms) will only have to deal with strings. The glob form can use standard glob techniques to match the glob against the legal candidates, which are in the tables for the simple names and options.
The question is how to insert that phase into ANTLR, so that I call the lexer and it creates tokens and the intermediate phase massages them and then the parser gets the tokens the intermediate phase has categorized.
Is there a known solution for this?
Something like:
lexer grammar simple
letter: [A-Z][a-z];
digit: [0-9];
glob-char: [*?];
name: letter (letter | digit)*;
option: '-'name;
glob: (glob-char|letter)(glob-char|letter|digit)*;
glob-option: '-'glob;
filter grammar name;
end: 'e' | 'end';
generate: 'ge' | 'generate';
goto: 'go' | 'goto';
help: 'h' | 'help';
if: 'i' | 'if';
then: 't' | 'then';
parser grammar simple;
The user (programmer writing the language I am parsing) need to be to write
g*te and have if match generate.
The phase between the lexer and the parser when it sees a glob needs to look at the glob (and the list of keywords) and see if only one of them matches the glob and if so, return that keyword. The stuff I listed in the "filter grammar" is the stuff that builds the list of keywords globs can match. I have found code on the web that matches globs to a list of names. That part isn't hard.
And, I've since found in the ANTLR doc how to run arbitrary code on matching a token and how to change the resulting tokens type. (See my answer.)
It looks like you can use lexerCustomActions to achieve the desired effect. Something like the following.
in your lexer:
GLOB: [-A-Za-z0-9_.]* '*' [-A-Za-z0-9_.*]* { setType(lexGlob(getText())); }
in your Java (or whatever language you are using code):
void int lexGlob(String origText()) {
return xyzzy; // some code that computes the right kind of token type
}

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
for the input: "identifierkeyword" it produces this error:
line 1:1 mismatched input 'd' expecting 'keyword'
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments.
A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.
The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.
Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

AntlrWorks 2 output

So i am using Antlrworks 2, working on a rather large grammar. Problem is that in this grammar there are multiple ambiguities that i am trying to work through.
I was wondering if there is a way to interpret which rules were invoked when there was a failure.
For instance, when i run my rule i get following output
[#0,0:1='99',<20>,1:0]
[#1,2:1='<EOF>',<-1>,1:2]
line 1:0 mismatched input '99' expecting Digit2
(dummy 99)
I am wondering what [#0,0:1='99',<20>,1:0] means. Do the #0 or <20> have any relationship to the rule number in my grammar or something ?
Here is a breakdown of the default token formatting.
[#{TokenIndex},{StartIndex}:{StopIndex}={Text},<{TokenType}>,{Line}:{Column}]
The {TokenType} field generally corresponds to a particular lexer rule (the constant will be declared in your generated lexer). However, the -> type(X) command can be used in any lexer rule to reassign tokens produced by that rule to another type. If the value 20 is assigned to the token named Foo, then the first token in your listing was produced by either a lexer rule named Foo or a lexer rule containing the action -> type(foo) or you have a user-defined action which explicitly assigns the type Foo to a token produced by some other rule (this will be code you wrote, not code generated by ANTLR).

Resources