Xtext - enum literal overrides id - parsing

I have following excerpt of my grammar, where the rule Format seems to override the FieldColumnName rule.
Statement:
'select * from' table=Table where=WhereClause;
WhereClause:
'where' symbol=FieldColumn op="=" right=STRING;
FieldColumn:
fieldName=FieldColumnName;
FieldColumnName hidden():
ID ('.' ID)?;
enum Format:
iso | de | en;
Developing an DSL-Script on following grammar I am getting an validation error in the editor, with following Statement:
select * from foo where foo.de = 'bar';
The error marks the de in foo.de and its message is:
mismatched input 'de' expecting RULE_ID
How can I use reserved words like the de in contexts where I do not expect that keyword?

You should be very careful with spaces in keywords. Please try to refactor your grammar, e.g. use 'select' '*' 'from' instead of 'select * from'.
To fix your issue, you'll have to introduce a rule ValidID: ID | 'de' |'en' | 'iso'; and use ValidID instead of ID in FieldColumnName.

Related

Token with different interpretations (i.e. keyword and identifier)

I am writing a grammar with a lot of case-insensitive keywords in ANTLR4. I collected some example files for the format, that I try to test parse and some use the same tokens which exist as keywords as identifiers in other places. For example there is a CORE keyword, which in other places is used as a ID for a structure from user input. Here some parts of my grammar:
fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
[...]
CORE: C O R E ;
[...]
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
id: IDSTRING ;
The error thrown then is line 7982:8 mismatched input 'core' expecting IDSTRING, as the user input is intended as IDSTRING, but always eaten by the keyword rule. In the input it exists both as keyword and as id like this:
MACRO oa12f01
CLASS CORE ; #here it is a KEYWORD
[...]
SITE core ; #here it is a ID
Is there a way I can let users use some keywords as identifiers by changing my grammar somehow like "casting" the token to IDSTRING for conjunctive rules like this or is this a false hope in not hand written parsers?
You can simply list the keywords that are allowed as identifiers as alternatives in the id rule:
id: IDSTRING | CORE | ... ;

How do underivable rules affect parsing?

When writing an XText grammar for a simple SQL dialect, I found out, that apparently rules that cannot be derived from the start symbol affect parsing.
E.g. given the following (very simplified) extract of my grammar which should be able to parse expressions like FROM table1;:
Start:
subquery ';';
subquery:
/*select=select_clause */tables=from_clause;
from_clause:
'FROM' tables;
tables:
tables+=table (',' tables+=table)*;
table:
name=table_name (alias=alias)?;
table_name:
prefix=qualified_name_prefix? name=qualified_name;
qualified_name_prefix:
ID'.';
qualified_name :
=>qualified_name_prefix? ID;
alias returns EString:
'AS'? alias=ID;
with_clause :
'WITH' elements+=with_list_element (',' elements+=with_list_element)*;
with_list_element :
name=ID (column_list_clause=column_list_clause)? 'AS' '(' subquery=subquery ')';
column_list_clause :
'(' names+=ID+ ')';
When trying to parse the string FROM table1;, I get the following error:
'no viable alternative at input ';'' on EString
If I remove rule with_clause, the error is gone and the string is parsed properly. How is this possible even though with_clause cannot be derived from Start?
the problem is that the predicate (=>) covers an ambiguity
maybe you can pull together prefix and name
Table_name:
name=Qualified_name;
Qualified_name :
(ID '.' (ID '.')?)? ID;
or you try something like
Table_name:
((prefix=ID ".")? =>name=Qualified_name);
Qualified_name :
=>(ID '.' ID) | ID;

Parsing of optionals with PEG (Grako) falling short?

My colleague PaulS asked me the following:
I'm writing a parser for an existing language (SystemVerilog - an IEEE standard), and the specification has a rule in it that is similar in structure to this:
cover_point
=
[[data_type] identifier ':' ] 'coverpoint' identifier ';'
;
data_type
=
'int' | 'float' | identifier
;
identifier
=
?/\w+/?
;
The problem is that when parsing the following legal string:
anIdentifier: coverpoint another_identifier;
anIdentifier matches with data_type (via its identifier option) successfully, which means Grako is looking for another identifier after it and then fails. It doesn't then try to parse without the data_type part.
I can re-write the rule as follows,
cover_point_rewrite
=
[data_type identifier ':' | identifier ':' ] 'coverpoint' identifier ';'
;
but I wonder if:
this is intentional and
if there's a better syntax?
Is this a PEG-in-general issue, or a tool (Grako) one?
It says here that in PEGs the choice operator is ordered to avoid CFGs ambiguities by using the first match.
In your first example [data_type] succeeds parsing id, so it fails when it finds : instead of another identifier.
That may be because [data_type] behaves like (data_type | ε) so it will always parse data_type with the first id.
In [data_type identifier ':' | identifier ':' ] the first choice fails when there is no second id, so the parser backtracks and tries with the second choice.

Need keywords to be recognized as such only in the correct places

I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!
What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.
I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.

Xtext Grammar: "The following alternatives can never be matched"

I am running into a problem with ambiguity in a rather complicated grammar I have been building up. It's too complex to post here, so I've reduced my problem down to aid comprehension.
I am getting the following error:
error(201): ../org.xtext.example.mydsl.ui/src-gen/org/xtext/example/mydsl/ui/contentassist/antlr/internal/InternalMyDsl.g:398:1: The following alternatives can never be matched: 2
From this grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(contents+=ModelMember)*;
ModelMember:
Field | Assignment | Static | Class
;
Static:
"static" type=TypeDef name=ID
;
Class:
"class" name=ID "{"
(fields+=Field)*
"}"
;
Field:
"var" type=TypeDef name=ID
;
TypeDef:
{Primtive} ("String" | "int") |
{Object} clazz=[Class]
;
Reference:
(
{StaticField} static=[Static] (withDiamond?="<>")?
|
{DynamicField} field=[Field]
)
;
ObjectReference:
reference=Reference ({ObjectReference.target=current} '.' reference=Reference)*
;
Assignment:
field=ObjectReference "=" value=ObjectReference
;
I know the problem relates to Reference, which is struggling with the ambiguity of which rule to chose.
I can get it to compile with the following grammar change, but this allows syntax that I deem to be illegal:
Reference:
ref=[RefType] (withDiamond?="<>")?
;
RefType:
Static|Field
;
Where my use-case is:
static String a
class Person {
String name
}
Person paul
// This should be legal
paul.name = a<>;
// This should be illegal, diamond not vaild against non-static vars
paul.name = paul.name<>;
// This sohuld be legal
paul.name = paul.name
Your second grammar is the way to go. The fact that diamond is only legal for static variables can be handled in your language's validator.
Generally, make your grammar loose and your validation strict. That makes your grammar easier to maintain. It also gives your users better error messages ("Diamand is not allowed for non-static vars" instead of "Invalid input '<'")

Resources