Handling Grammar Ambiguity in ANTLR4 - parsing

I have a grammar that should parse the following snippet (as an example):
vmthread programm_start
{
CALL main
}
subcall main
{
// Declarations
DATAF i
CALL i
// Statements
MOVEF_F 3 i
}
The problem is the ambiguity between the CALL statement. This op code is valid in the vmthread section (and only the CALL!) but also in those subcall sections. If I define a OP_CODES token with all op codes and an additional OC_CALL token, the lexer can't handle the situation (obviously).
The following listings are snippets of my grammar (first lexer, second parser):
VMTHREAD
: 'vmthread'
;
SUBCALL
: 'subcall'
;
CURLY_OPEN
: '{'
;
CURLY_CLOSE
: '}'
;
OP_CODES
: 'DATA8'
| 'DATAF'
| 'MOVE8_8'
| 'MOVEF_F'
| 'CALL'
;
OC_CALL
: 'CALL'
;
lms
: vmthread subcalls+
;
vmthread
: VMTHREAD name = ID CURLY_OPEN vmthreadCall CURLY_CLOSE
;
vmthreadCall
: oc = OC_CALL name = ID
;
subcalls
: SUBCALL name = ID CURLY_OPEN ins = instruction* CURLY_CLOSE
;
//instruction+
instruction
: oc = OP_CODES args = argumentList
;
argumentList
: arguments+
;
arguments
: INTEGER
| NUMBER
| TEXT
| ID
;
To continue my work I've switched the OC_CALL token in the vmthreadCall parser rule with the OP_CODES token. That solves the problem for now, because the code is auto generated. But there's the possibility that a user can type this code so this could go wrong.
Is there a solution for this or should I move the validation into the parser. There I can easily determine if the statement in the vmthread section contains just the call statement.
For clarification: In the vmthread there's only the CALL allowed. In the subcall (could be more than one) every op code is allowed (CALL + every other op code defined). And I do not want to distinguish between those different CALL statements. I know that's not possible in a context free grammar. I will handle this in the parser. I just want to restrict the vmthread to the one CALL statement and allow all statements (all op codes) in the subcalls. Hopefully that's more clear.

Change your lexer rules like this:
OP_CODES
: 'DATA8'
| 'DATAF'
| 'MOVE8_8'
| 'MOVEF_F'
| OP_CALL
;
OC_CALL
: 'CALL'
;
or alternatively so:
OP_CODES
: 'DATA8'
| 'DATAF'
| 'MOVE8_8'
| 'MOVEF_F'
| CALL
;
OC_CALL
: CALL
;
fragment CALL: 'CALL';
Btw, I recommend that you create explicit lexer rules for your literals (like that CALL fragment), which will make later processing easier. ANTLR assigns generic names to implicitly created literals, which makes it hard to find out which token belongs to which literal.

Related

The following token definitions can never be matched because prior tokens match the same input: INT,STRING

Trying a simple Grammar on antlr. it should parse inputs such as L=[1,2,hello].
However, antlr is producing this error: The following token definitions can never be matched because prior tokens match the same input: INT,STRING.Any Help?
grammar List;
decl: ID '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|STRING) Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|STRING) Inside2 | ; //
ID:('0'..'Z')+;
INT:('0'..'9')+;
STRING:('a'..'Z')+;
EDIT: The updated Grammar. The error remains with INT Only.
grammar List;
decl: STRING '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|'"'STRING'"') Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|'"'STRING'"') Inside2 | ; //
STRING:('A'..'Z')+;
INT:('0'..'9')+;
Your ID pattern matches everything that would be matched by INT or STRING, making them irrelevant. I don't think that's what you want.
ID shouldn't match tokens starting with a digit; 42 is not an identifier. And your comment implies that STRING is intended to be a string literal ('hello') but your lexical pattern makes no attempt to match '.

How to resolve Xtext variables' names and keywords statically?

I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier
No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.
In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.

ANTRL 3 grammar omitted part of input source code

I am using this ANTLR 3 grammar and ANTLRWorks for testing that grammar.
But I can't figure out why some parts of my input text are omitted.
I would like to rewrite this grammar and display every element (lparen, keywords, semicolon,..) of the source file (input) in AST / CST.
I've tried everything, but without success. Can someone who is experienced with ANTLR help me?
Parse tree:
I've managed to narrow it down to the semic rule:
/*
This rule handles semicolons reported by the lexer and situations where the ECMA 3 specification states there should be semicolons automaticly inserted.
The auto semicolons are not actually inserted but this rule behaves as if they were.
In the following situations an ECMA 3 parser should auto insert absent but grammaticly required semicolons:
- the current token is a right brace
- the current token is the end of file (EOF) token
- there is at least one end of line (EOL) token between the current token and the previous token.
The RBRACE is handled by matching it but not consuming it.
The EOF needs no further handling because it is not consumed by default.
The EOL situation is handled by promoting the EOL or MultiLineComment with an EOL present from off channel to on channel
and thus making it parseable instead of handling it as white space. This promoting is done in the action promoteEOL.
*/
semic
#init
{
// Mark current position so we can unconsume a RBRACE.
int marker = input.mark();
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| RBRACE { input.rewind(marker); }
| EOL | MultiLineComment // (with EOL in it)
;
So, the EVIL semicolon insertion strikes again!
I'm not really sure, but I think these mark/rewind calls are getting out of sync. The #init block is executed when the rule is entered for branch selection and for actual matching. It's actually creating a lot of marks but not cleaning them up. But I don't know why it messes up the parse tree like that.
Anyway, here's a working version of the same rule:
semic
#init
{
// Promote EOL if appropriate
promoteEOL(retval);
}
: SEMIC
| EOF
| { int pos = input.index(); } RBRACE { input.seek(pos); }
| EOL | MultiLineComment // (with EOL in it)
;
It's much simpler and doesn't use the mark/rewind mechanism.
But there's a catch: the semic rule in the parse tree will have a child node } in the case of a semicolon insertion before a closing brace. Try to remove the semicolon after i-- and see the result. You'll have to detect this and handle it in your code. semic should either contain a ; token, or contain EOL (which means a semicolon got silently inserted at this point).

Xtext Grammar: "The following alternatives can never be matched"

I am running into a problem with ambiguity in a rather complicated grammar I have been building up. It's too complex to post here, so I've reduced my problem down to aid comprehension.
I am getting the following error:
error(201): ../org.xtext.example.mydsl.ui/src-gen/org/xtext/example/mydsl/ui/contentassist/antlr/internal/InternalMyDsl.g:398:1: The following alternatives can never be matched: 2
From this grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(contents+=ModelMember)*;
ModelMember:
Field | Assignment | Static | Class
;
Static:
"static" type=TypeDef name=ID
;
Class:
"class" name=ID "{"
(fields+=Field)*
"}"
;
Field:
"var" type=TypeDef name=ID
;
TypeDef:
{Primtive} ("String" | "int") |
{Object} clazz=[Class]
;
Reference:
(
{StaticField} static=[Static] (withDiamond?="<>")?
|
{DynamicField} field=[Field]
)
;
ObjectReference:
reference=Reference ({ObjectReference.target=current} '.' reference=Reference)*
;
Assignment:
field=ObjectReference "=" value=ObjectReference
;
I know the problem relates to Reference, which is struggling with the ambiguity of which rule to chose.
I can get it to compile with the following grammar change, but this allows syntax that I deem to be illegal:
Reference:
ref=[RefType] (withDiamond?="<>")?
;
RefType:
Static|Field
;
Where my use-case is:
static String a
class Person {
String name
}
Person paul
// This should be legal
paul.name = a<>;
// This should be illegal, diamond not vaild against non-static vars
paul.name = paul.name<>;
// This sohuld be legal
paul.name = paul.name
Your second grammar is the way to go. The fact that diamond is only legal for static variables can be handled in your language's validator.
Generally, make your grammar loose and your validation strict. That makes your grammar easier to maintain. It also gives your users better error messages ("Diamand is not allowed for non-static vars" instead of "Invalid input '<'")

Skipping tokens in yacc

I want to have a grammar rule like below in my yacc file:
insert_statement: INSERT INTO NAME (any_token)* ';'
We can skip all the tokens until a given token at an error, in yacc as follows:
stat: error ';'
Is there any mechanism to skip any number of characters in yacc, when there is no error?
Thanks
After sometime I could solve my problem the following way and would like to mention it as it would be helpful to someone:
Add a token definition to lex including the characters that should be in a skipping token:
<*>[A-Za-z0-9_:.-]* { return SKIPPINGTOKS; }
(this would identify any token like a, 1, hello, hello123 etc.)
Then add the following such rules to yacc as required:
insert_statement: INSERT INTO NAME skipping_portion ';'
skipping_portion: SKIPPINGTOKS | skipping_portion SKIPPINGTOKS
Hope this may help someone...
I think you would want to do something like this. It skips any and all tokens that are not the semicolon.
insert_statement: INSERT INTO NAME discardable_tokens_or_epsilon ';' ;
discardable_tokens_or_epsilon: discardable_tokens
| epsilon
;
discardable_tokens: discardable_tokens discardable_token
| discardable_token
;
discardable_token: FOO
| BAR
| BLETCH
...et cetera... anything other than a semicolon
;
epsilon: ;
Simply don't specify a production rule containing those tokens, you'd like to skip.

Resources