Parsing of optionals with PEG (Grako) falling short? - parsing

My colleague PaulS asked me the following:
I'm writing a parser for an existing language (SystemVerilog - an IEEE standard), and the specification has a rule in it that is similar in structure to this:
cover_point
=
[[data_type] identifier ':' ] 'coverpoint' identifier ';'
;
data_type
=
'int' | 'float' | identifier
;
identifier
=
?/\w+/?
;
The problem is that when parsing the following legal string:
anIdentifier: coverpoint another_identifier;
anIdentifier matches with data_type (via its identifier option) successfully, which means Grako is looking for another identifier after it and then fails. It doesn't then try to parse without the data_type part.
I can re-write the rule as follows,
cover_point_rewrite
=
[data_type identifier ':' | identifier ':' ] 'coverpoint' identifier ';'
;
but I wonder if:
this is intentional and
if there's a better syntax?
Is this a PEG-in-general issue, or a tool (Grako) one?

It says here that in PEGs the choice operator is ordered to avoid CFGs ambiguities by using the first match.
In your first example [data_type] succeeds parsing id, so it fails when it finds : instead of another identifier.
That may be because [data_type] behaves like (data_type | ε) so it will always parse data_type with the first id.
In [data_type identifier ':' | identifier ':' ] the first choice fails when there is no second id, so the parser backtracks and tries with the second choice.

Related

The following token definitions can never be matched because prior tokens match the same input: INT,STRING

Trying a simple Grammar on antlr. it should parse inputs such as L=[1,2,hello].
However, antlr is producing this error: The following token definitions can never be matched because prior tokens match the same input: INT,STRING.Any Help?
grammar List;
decl: ID '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|STRING) Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|STRING) Inside2 | ; //
ID:('0'..'Z')+;
INT:('0'..'9')+;
STRING:('a'..'Z')+;
EDIT: The updated Grammar. The error remains with INT Only.
grammar List;
decl: STRING '=[' Inside1 ']'; // Declaration of a List. Example : L=[1,'hello']
Inside1: (INT|'"'STRING'"') Inside2| ; // First element in the List. Could be nothing
Inside2:',' (INT|'"'STRING'"') Inside2 | ; //
STRING:('A'..'Z')+;
INT:('0'..'9')+;
Your ID pattern matches everything that would be matched by INT or STRING, making them irrelevant. I don't think that's what you want.
ID shouldn't match tokens starting with a digit; 42 is not an identifier. And your comment implies that STRING is intended to be a string literal ('hello') but your lexical pattern makes no attempt to match '.

How to resolve Xtext variables' names and keywords statically?

I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier
No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.
In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.

Need keywords to be recognized as such only in the correct places

I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!
What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.
I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.

Xtext Grammar: "The following alternatives can never be matched"

I am running into a problem with ambiguity in a rather complicated grammar I have been building up. It's too complex to post here, so I've reduced my problem down to aid comprehension.
I am getting the following error:
error(201): ../org.xtext.example.mydsl.ui/src-gen/org/xtext/example/mydsl/ui/contentassist/antlr/internal/InternalMyDsl.g:398:1: The following alternatives can never be matched: 2
From this grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(contents+=ModelMember)*;
ModelMember:
Field | Assignment | Static | Class
;
Static:
"static" type=TypeDef name=ID
;
Class:
"class" name=ID "{"
(fields+=Field)*
"}"
;
Field:
"var" type=TypeDef name=ID
;
TypeDef:
{Primtive} ("String" | "int") |
{Object} clazz=[Class]
;
Reference:
(
{StaticField} static=[Static] (withDiamond?="<>")?
|
{DynamicField} field=[Field]
)
;
ObjectReference:
reference=Reference ({ObjectReference.target=current} '.' reference=Reference)*
;
Assignment:
field=ObjectReference "=" value=ObjectReference
;
I know the problem relates to Reference, which is struggling with the ambiguity of which rule to chose.
I can get it to compile with the following grammar change, but this allows syntax that I deem to be illegal:
Reference:
ref=[RefType] (withDiamond?="<>")?
;
RefType:
Static|Field
;
Where my use-case is:
static String a
class Person {
String name
}
Person paul
// This should be legal
paul.name = a<>;
// This should be illegal, diamond not vaild against non-static vars
paul.name = paul.name<>;
// This sohuld be legal
paul.name = paul.name
Your second grammar is the way to go. The fact that diamond is only legal for static variables can be handled in your language's validator.
Generally, make your grammar loose and your validation strict. That makes your grammar easier to maintain. It also gives your users better error messages ("Diamand is not allowed for non-static vars" instead of "Invalid input '<'")

Skipping tokens in yacc

I want to have a grammar rule like below in my yacc file:
insert_statement: INSERT INTO NAME (any_token)* ';'
We can skip all the tokens until a given token at an error, in yacc as follows:
stat: error ';'
Is there any mechanism to skip any number of characters in yacc, when there is no error?
Thanks
After sometime I could solve my problem the following way and would like to mention it as it would be helpful to someone:
Add a token definition to lex including the characters that should be in a skipping token:
<*>[A-Za-z0-9_:.-]* { return SKIPPINGTOKS; }
(this would identify any token like a, 1, hello, hello123 etc.)
Then add the following such rules to yacc as required:
insert_statement: INSERT INTO NAME skipping_portion ';'
skipping_portion: SKIPPINGTOKS | skipping_portion SKIPPINGTOKS
Hope this may help someone...
I think you would want to do something like this. It skips any and all tokens that are not the semicolon.
insert_statement: INSERT INTO NAME discardable_tokens_or_epsilon ';' ;
discardable_tokens_or_epsilon: discardable_tokens
| epsilon
;
discardable_tokens: discardable_tokens discardable_token
| discardable_token
;
discardable_token: FOO
| BAR
| BLETCH
...et cetera... anything other than a semicolon
;
epsilon: ;
Simply don't specify a production rule containing those tokens, you'd like to skip.

Resources