I am running into a problem with ambiguity in a rather complicated grammar I have been building up. It's too complex to post here, so I've reduced my problem down to aid comprehension.
I am getting the following error:
error(201): ../org.xtext.example.mydsl.ui/src-gen/org/xtext/example/mydsl/ui/contentassist/antlr/internal/InternalMyDsl.g:398:1: The following alternatives can never be matched: 2
From this grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(contents+=ModelMember)*;
ModelMember:
Field | Assignment | Static | Class
;
Static:
"static" type=TypeDef name=ID
;
Class:
"class" name=ID "{"
(fields+=Field)*
"}"
;
Field:
"var" type=TypeDef name=ID
;
TypeDef:
{Primtive} ("String" | "int") |
{Object} clazz=[Class]
;
Reference:
(
{StaticField} static=[Static] (withDiamond?="<>")?
|
{DynamicField} field=[Field]
)
;
ObjectReference:
reference=Reference ({ObjectReference.target=current} '.' reference=Reference)*
;
Assignment:
field=ObjectReference "=" value=ObjectReference
;
I know the problem relates to Reference, which is struggling with the ambiguity of which rule to chose.
I can get it to compile with the following grammar change, but this allows syntax that I deem to be illegal:
Reference:
ref=[RefType] (withDiamond?="<>")?
;
RefType:
Static|Field
;
Where my use-case is:
static String a
class Person {
String name
}
Person paul
// This should be legal
paul.name = a<>;
// This should be illegal, diamond not vaild against non-static vars
paul.name = paul.name<>;
// This sohuld be legal
paul.name = paul.name
Your second grammar is the way to go. The fact that diamond is only legal for static variables can be handled in your language's validator.
Generally, make your grammar loose and your validation strict. That makes your grammar easier to maintain. It also gives your users better error messages ("Diamand is not allowed for non-static vars" instead of "Invalid input '<'")
Related
I have a grammar that should parse the following snippet (as an example):
vmthread programm_start
{
CALL main
}
subcall main
{
// Declarations
DATAF i
CALL i
// Statements
MOVEF_F 3 i
}
The problem is the ambiguity between the CALL statement. This op code is valid in the vmthread section (and only the CALL!) but also in those subcall sections. If I define a OP_CODES token with all op codes and an additional OC_CALL token, the lexer can't handle the situation (obviously).
The following listings are snippets of my grammar (first lexer, second parser):
VMTHREAD
: 'vmthread'
;
SUBCALL
: 'subcall'
;
CURLY_OPEN
: '{'
;
CURLY_CLOSE
: '}'
;
OP_CODES
: 'DATA8'
| 'DATAF'
| 'MOVE8_8'
| 'MOVEF_F'
| 'CALL'
;
OC_CALL
: 'CALL'
;
lms
: vmthread subcalls+
;
vmthread
: VMTHREAD name = ID CURLY_OPEN vmthreadCall CURLY_CLOSE
;
vmthreadCall
: oc = OC_CALL name = ID
;
subcalls
: SUBCALL name = ID CURLY_OPEN ins = instruction* CURLY_CLOSE
;
//instruction+
instruction
: oc = OP_CODES args = argumentList
;
argumentList
: arguments+
;
arguments
: INTEGER
| NUMBER
| TEXT
| ID
;
To continue my work I've switched the OC_CALL token in the vmthreadCall parser rule with the OP_CODES token. That solves the problem for now, because the code is auto generated. But there's the possibility that a user can type this code so this could go wrong.
Is there a solution for this or should I move the validation into the parser. There I can easily determine if the statement in the vmthread section contains just the call statement.
For clarification: In the vmthread there's only the CALL allowed. In the subcall (could be more than one) every op code is allowed (CALL + every other op code defined). And I do not want to distinguish between those different CALL statements. I know that's not possible in a context free grammar. I will handle this in the parser. I just want to restrict the vmthread to the one CALL statement and allow all statements (all op codes) in the subcalls. Hopefully that's more clear.
Change your lexer rules like this:
OP_CODES
: 'DATA8'
| 'DATAF'
| 'MOVE8_8'
| 'MOVEF_F'
| OP_CALL
;
OC_CALL
: 'CALL'
;
or alternatively so:
OP_CODES
: 'DATA8'
| 'DATAF'
| 'MOVE8_8'
| 'MOVEF_F'
| CALL
;
OC_CALL
: CALL
;
fragment CALL: 'CALL';
Btw, I recommend that you create explicit lexer rules for your literals (like that CALL fragment), which will make later processing easier. ANTLR assigns generic names to implicitly created literals, which makes it hard to find out which token belongs to which literal.
I'm working on a Delphi Grammar in Rascal and I'm having some problems parsing its “record” type. The relevant section of Delphi code can look as follows:
record
private
a,b,c : Integer;
x : Cardinal;
end
Where the "private" can be optional, and the variable declaration lines can also be optional.
I tried to interpret this section using the rules below:
syntax FieldDecl = IdentList ":" Type
| IdentList ":" Type ";"
;
syntax FieldSection = FieldDecl
| "var" FieldDecl
| "class" "var" FieldDecl
;
syntax Visibility = "private" | "protected" | "public"| "published" ;
syntax VisibilitySectionContent = FieldSection
| MethodOrProperty
| ConstSection
| TypeSection
;
syntax VisibilitySection = Visibility? VisibilitySectionContent+
;
syntax RecordType = "record" "end"
| "record" VisibilitySection+ "end"
;
Problem is ambiguity. The entire text between “record” and “end” can be parsed in a single VisibilitySection, but every line on its own can also be a seperate VisibilitySection.
I can change the rule VisibilitySection to
syntax VisibilitySection = Visibility
| VisibilitySectionContent
;
Then the grammar is no longer ambiguous, but the VisibilitySection becomes, flat, there is no nesting anymore of the variable lines under an optional 'private' node, which I would prefer.
Any suggestions on how to solve this problem? What I would like to do is demand a longest /greedy match on the VisibilitySectionContent+ symbol of VisibilitySection.
But changing
syntax VisibilitySection = Visibility? VisibilitySectionContent+
to
syntax VisibilitySection = Visibility? VisibilitySectionContent+ !>> VisibilitySectionContent
does not seem to work for this.
I also ran the Ambiguity report tool on Rascal, but it does not provide me any insights.
Any thoughts?
Thanks
I can't check since you did not provide the full grammar, but I believe this should work to get your "longest match" behavior:
syntax VisibilitySection
= Visibility? VisibilitySectionContent+ ()
>> "public"
>> "private"
>> "published"
>> "protected"
>> "end"
;
In my mind this should remove the interpretation where your nested VisibilitySections are cut short. Now we only accept such sections if they are immediately followed by either the end of the record, or the next section. I'm curious to find out if it really works because it is always hard to predict the behavior of a grammar :-)
The () at the end of the rule (empty non-terminal) makes sure we can skip to the start of the next part before applying the restriction. This only works if you have a longest match rule on layout already somewhere in the grammar.
The VisibilitySectionContent+ in VisibilitySection should be VisibilitySectionContent (without the Kleene plus).
I’m guessing here, but your intention is probably to allow a number of sections/declarations within the record type, and any of those may or may not have a Visibility modifier. To avoid putting this optional Visibility in every section, you have created a VisibilitySectionContent nonterminal which basically models “things that can happen within the record type definition”, one thing per nonterminal, without worrying about visibility modifiers. In this case, you’re fine with one VisibilitySectionContent per VisibilitySection since there is explicit repetition when you refer to the VisibilitySection from the RecordType anyway.
What approach would allow me to get the most on reporting lexing errors?
For a simple example I would like to write a grammar for the following text
(white space is ignored and string constants cannot have a \" in them for simplicity):
myvariable = 2
myvariable = "hello world"
Group myvariablegroup {
myvariable = 3
anothervariable = 4
}
Catching errors with a lexer
How can you maximize the error reporting potential of a lexer?
After reading this post: Where should I draw the line between lexer and parser?
I understood that the lexer should match as much as it can with regards to the parser grammar but what about lexical error reporting strategies?
What are the ordinary strategies for catching lexing errors?
I am imagining a grammar which would have the following "error" tokens:
GROUP_OPEN: 'Group' WS ID WS '{';
EMPTY_GROUP: 'Group' WS ID WS '{' WS '}';
EQUALS: '=';
STRING_CONSTANT: '"~["]+"';
GROUP_CLOSE: '}';
GROUP_ERROR: 'Group' .; // the . character is an invalid token
// you probably meant '{'
GROUP_ERROR2: .'roup' ; // Did you mean 'group'?
STRING_CONSTANT_ERROR: '"' .+; // Unterminated string constant
ID: [a-z][a-z0-9]+;
WS: [ \n\r\t]* -> skip();
SINGLE_TOKEN_ERRORS: .+?;
There are clearly some problems with your approach:
You are skipping WS (which is good), but yet you're using it in your other rules. But you're in the lexer, which leads us to...
Your groups are being recognized by the lexer. I don't think you want them to become a single token. Your groups belong in the parser.
Your grammar, as written, will create specific token types for things ending in roup, so croup for instance may never match an ID. That's not good.
STRING_CONSTANT_ERROR is much too broad. It's able to glob the entire input. See my UNTERMINATED_STRING below.
I'm not quite sure what happens with SINGLE_TOKEN_ERRORS... See below for an alternative.
Now, here are some examples of error tokens I use, and this works very well for error reporting:
UNTERMINATED_STRING
: '"' ('\\' ["\\] | ~["\\\r\n])*
;
UNTERMINATED_COMMENT_INLINE
: '/*' ('*' ~'/' | ~'*')*? EOF -> channel(HIDDEN)
;
// This should be the LAST lexer rule in your grammar
UNKNOWN_CHAR
: .
;
Note that these unterminated tokens represent single atomic values, they don't span logical structures.
Also, UNKNOWN_CHAR will be a single char no matter what, if you define it as .+? it will always match exactly one char anyway, since it will be trying to match as few chars as possible, and that minimum is one char.
Non-greedy quantifiers make sense when something follows them. For instance in the expression .+? '#', the .+? will be forced to consume characters until it encounters a # sign. If the .+? expression is alone, it won't have to consume more than a single character to match, and therefore will be equivalent to ..
I use the following code in the lexer (.NET ANTLR):
partial class MyLexer
{
public override IToken Emit()
{
CommonToken token;
RecognitionException ex;
switch (Type)
{
case UNTERMINATED_STRING:
Type = STRING;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_STRING, Line, Column, "Unterminated string: " + GetTokenTextForDisplay(token), ex);
return token;
case UNTERMINATED_COMMENT_INLINE:
Type = COMMENT_INLINE;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_COMMENT_INLINE, Line, Column, "Unterminated comment: " + GetTokenTextForDisplay(token), ex);
return token;
default:
return base.Emit();
}
}
// ...
}
Notice that when the lexer encounters a bad token type, it explicitly changes it it to a valid token, so the parser can actually make sense of it.
Now, it is the job of the parser to identify bad structure. ANTLR is smart enough to perform single-token deletion and single-token insertion while trying to resynchronize itself with an invalid input. This is also the reason why I'm letting UNKNOWN_CHAR slip though to the parser, so it can discard it with an error message.
Just take the errors it generates and alter them in order to present something nicer to the user.
So, just make your groups into a parser rule.
An example:
Consider the following input:
Group ,ygroup {
Here, the , is clearly a typo (user pressed , instead of m).
If you use UNKNOWN_CHAR: .; you will get the following tokens:
Group of type GROUP
, of type UNKNOWN_CHAR
ygroup of type ID
{ of type '{ '
The parser will be able to figure out the UNKNOWN_CHAR token needs to be deleted and will correctly match a group (defined as GROUP ID '{' ...).
ANTLR will insert so-called error nodes at the points where it finds unexpected tokens (in this case between GROUP and ID). These nodes are then ignored for the purposes of parsing, but you can retrieve them with your visitors/listeners to handle them (you can use a visitor's VisitErrorNode method for instance).
I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!
What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.
I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.
I want to have a grammar rule like below in my yacc file:
insert_statement: INSERT INTO NAME (any_token)* ';'
We can skip all the tokens until a given token at an error, in yacc as follows:
stat: error ';'
Is there any mechanism to skip any number of characters in yacc, when there is no error?
Thanks
After sometime I could solve my problem the following way and would like to mention it as it would be helpful to someone:
Add a token definition to lex including the characters that should be in a skipping token:
<*>[A-Za-z0-9_:.-]* { return SKIPPINGTOKS; }
(this would identify any token like a, 1, hello, hello123 etc.)
Then add the following such rules to yacc as required:
insert_statement: INSERT INTO NAME skipping_portion ';'
skipping_portion: SKIPPINGTOKS | skipping_portion SKIPPINGTOKS
Hope this may help someone...
I think you would want to do something like this. It skips any and all tokens that are not the semicolon.
insert_statement: INSERT INTO NAME discardable_tokens_or_epsilon ';' ;
discardable_tokens_or_epsilon: discardable_tokens
| epsilon
;
discardable_tokens: discardable_tokens discardable_token
| discardable_token
;
discardable_token: FOO
| BAR
| BLETCH
...et cetera... anything other than a semicolon
;
epsilon: ;
Simply don't specify a production rule containing those tokens, you'd like to skip.