How to resolve Xtext variables' names and keywords statically? - parsing

I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier

No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.

In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.

Related

Tatsu Parser, unclear why it isn't moving to the next rule in the line?

I am writing a code parser/formatter for a language that doesn't have one, OSTW (Overwatch higher level language for workshop code). So that I can be lazy and have pretty code.
I am pretty new to this idea, so if tatsu is a poor choice for this usecase, please let me know, I am rather ignorant. I have been going back and forth between the grammar syntax and some of the tutorials and it isn't clicking for me yet.
My sample document:
doSomething(param1,param2,arg=stuff,arg2=stuff2);
My EBNF:
##grammar::Ostw
##eol_comments :: /\/\/.*?$/
start = statement $ ;
statement = func:alpha '(' ','%{param:alpha}* [',' ','%{kwarg}*] ')' eol ;
eol = ';' ;
kwarg = key:alpha '=' val:alpha ;
alpha = (numbers|letters) ;
numbers = /\d+/ ;
letters = /\w+/ ;
The grammar compiles successfully, but when I attempt to parse my code, I get this output:
FailedToken: (1:30) expecting ')' :
doSomething(param1,param2,arg=stuff,arg2=stuff2);
^
statement
start
My expectation would be that, since the = is not a valid character for the alpha rule, it would go to the next thing in the list, since it is an unknown number of entries of either types.
My intention is to have my parser expect similarly to Python, params then keyword arguments.
I think I missed a paragraph somewhere in something basic is what it feels like!
Thanks for any help!
Mriswithe

Alphabetic characters not recognized in tatsu parse

I have defined a very simple grammar, but tatsu does not behave as expected.
I have added a "start" rule and terminated it with a "$" character, but I still see the same behavior.
If I define the "fingering" rule with a regular expression (digit = /[1-5x]/) instead of the individual terminal symbols, the problem disappears. But shouldn't the old-school BNF-like syntax below work?
from pprint import pprint
from tatsu import parse
GRAMMAR = """
##grammar :: test
##nameguard :: False
start = sequence $ ;
sequence = {digit}+ ;
digit = 'x' | '1' | '2' | '3' | '4' | '5' ;"""
test = "23"
ast = parse(GRAMMAR, test)
pprint(ast) # Prints ['2', '3']
test = "xx"
ast = parse(GRAMMAR, test)
pprint(ast) # Throws tatsu.exceptions.FailedParse: (1:1) no available options :
The "xx" test should produce "['x', 'x']" and not throw an exception.
What am I missing?
You probably need to check interactions with ##nameguard, which is turned on by default.
For the first version of the grammar, use:
##nameguard :: False
You can also consider the definitions of ##whitespace and ##namechars that best suite the language and grammar.
Okay, I think there is a problem with ##nameguard. See https://github.com/neogeny/TatSu/issues/95. The easy workaround for the time being is to use a pattern expression in lieu of individual alphabetic terminals. Also, when ##nameguard is fixed, the documentation should clarify that it only relates to alphanumerics that begin with an alphabetic. Clearly, we did not need ##nameguard for the numeric terminals here.

How to remove ambiguity from this piece of Delphi grammar

I'm working on a Delphi Grammar in Rascal and I'm having some problems parsing its “record” type. The relevant section of Delphi code can look as follows:
record
private
a,b,c : Integer;
x : Cardinal;
end
Where the "private" can be optional, and the variable declaration lines can also be optional.
I tried to interpret this section using the rules below:
syntax FieldDecl = IdentList ":" Type
| IdentList ":" Type ";"
;
syntax FieldSection = FieldDecl
| "var" FieldDecl
| "class" "var" FieldDecl
;
syntax Visibility = "private" | "protected" | "public"| "published" ;
syntax VisibilitySectionContent = FieldSection
| MethodOrProperty
| ConstSection
| TypeSection
;
syntax VisibilitySection = Visibility? VisibilitySectionContent+
;
syntax RecordType = "record" "end"
| "record" VisibilitySection+ "end"
;
Problem is ambiguity. The entire text between “record” and “end” can be parsed in a single VisibilitySection, but every line on its own can also be a seperate VisibilitySection.
I can change the rule VisibilitySection to
syntax VisibilitySection = Visibility
| VisibilitySectionContent
;
Then the grammar is no longer ambiguous, but the VisibilitySection becomes, flat, there is no nesting anymore of the variable lines under an optional 'private' node, which I would prefer.
Any suggestions on how to solve this problem? What I would like to do is demand a longest /greedy match on the VisibilitySectionContent+ symbol of VisibilitySection.
But changing
syntax VisibilitySection = Visibility? VisibilitySectionContent+
to
syntax VisibilitySection = Visibility? VisibilitySectionContent+ !>> VisibilitySectionContent
does not seem to work for this.
I also ran the Ambiguity report tool on Rascal, but it does not provide me any insights.
Any thoughts?
Thanks
I can't check since you did not provide the full grammar, but I believe this should work to get your "longest match" behavior:
syntax VisibilitySection
= Visibility? VisibilitySectionContent+ ()
>> "public"
>> "private"
>> "published"
>> "protected"
>> "end"
;
In my mind this should remove the interpretation where your nested VisibilitySections are cut short. Now we only accept such sections if they are immediately followed by either the end of the record, or the next section. I'm curious to find out if it really works because it is always hard to predict the behavior of a grammar :-)
The () at the end of the rule (empty non-terminal) makes sure we can skip to the start of the next part before applying the restriction. This only works if you have a longest match rule on layout already somewhere in the grammar.
The VisibilitySectionContent+ in VisibilitySection should be VisibilitySectionContent (without the Kleene plus).
I’m guessing here, but your intention is probably to allow a number of sections/declarations within the record type, and any of those may or may not have a Visibility modifier. To avoid putting this optional Visibility in every section, you have created a VisibilitySectionContent nonterminal which basically models “things that can happen within the record type definition”, one thing per nonterminal, without worrying about visibility modifiers. In this case, you’re fine with one VisibilitySectionContent per VisibilitySection since there is explicit repetition when you refer to the VisibilitySection from the RecordType anyway.

Ambiguous ANTLR parser rule

I have a very simple example text which I want to parse with ANTLR, and yet I'm getting wrong results due to ambiguous definition of the rule.
Here is the grammar:
grammar SimpleExampleGrammar;
prog : event EOF;
event : DEFINE EVT_HEADER eventName=eventNameRule;
eventNameRule : DIGIT+;
DEFINE : '#define';
EVT_HEADER : 'EVT_';
DIGIT : [0-9a-zA-Z_];
WS : ('' | ' ' | '\r' | '\n' | '\t') -> channel(HIDDEN);
First text example:
#define EVT_EX1
Second text example:
#define EVT_EX1
#define EVT_EX2
So, the first example is parsed correctly.
However, the second example doesn't work, as the eventNameRule matches the next "#define ..." and the parse tree is incorrect
Appreciate any help to change the grammar to parse this correctly.
Thanks,
Busi
Beside the missing loop specifier you also have a problem in your WS rule. The first alt matches anything. Remove that. And, btw, give your DIGIT rule a different name. It matches more than just digits.
As Adrian pointed out, my main mistake here is that in the initial rule (prog) I used "event" and not "event+" this will solve the issue.
Thanks Adrian.

Parsing of optionals with PEG (Grako) falling short?

My colleague PaulS asked me the following:
I'm writing a parser for an existing language (SystemVerilog - an IEEE standard), and the specification has a rule in it that is similar in structure to this:
cover_point
=
[[data_type] identifier ':' ] 'coverpoint' identifier ';'
;
data_type
=
'int' | 'float' | identifier
;
identifier
=
?/\w+/?
;
The problem is that when parsing the following legal string:
anIdentifier: coverpoint another_identifier;
anIdentifier matches with data_type (via its identifier option) successfully, which means Grako is looking for another identifier after it and then fails. It doesn't then try to parse without the data_type part.
I can re-write the rule as follows,
cover_point_rewrite
=
[data_type identifier ':' | identifier ':' ] 'coverpoint' identifier ';'
;
but I wonder if:
this is intentional and
if there's a better syntax?
Is this a PEG-in-general issue, or a tool (Grako) one?
It says here that in PEGs the choice operator is ordered to avoid CFGs ambiguities by using the first match.
In your first example [data_type] succeeds parsing id, so it fails when it finds : instead of another identifier.
That may be because [data_type] behaves like (data_type | ε) so it will always parse data_type with the first id.
In [data_type identifier ':' | identifier ':' ] the first choice fails when there is no second id, so the parser backtracks and tries with the second choice.

Resources