Alphabetic characters not recognized in tatsu parse - tatsu

I have defined a very simple grammar, but tatsu does not behave as expected.
I have added a "start" rule and terminated it with a "$" character, but I still see the same behavior.
If I define the "fingering" rule with a regular expression (digit = /[1-5x]/) instead of the individual terminal symbols, the problem disappears. But shouldn't the old-school BNF-like syntax below work?
from pprint import pprint
from tatsu import parse
GRAMMAR = """
##grammar :: test
##nameguard :: False
start = sequence $ ;
sequence = {digit}+ ;
digit = 'x' | '1' | '2' | '3' | '4' | '5' ;"""
test = "23"
ast = parse(GRAMMAR, test)
pprint(ast) # Prints ['2', '3']
test = "xx"
ast = parse(GRAMMAR, test)
pprint(ast) # Throws tatsu.exceptions.FailedParse: (1:1) no available options :
The "xx" test should produce "['x', 'x']" and not throw an exception.
What am I missing?

You probably need to check interactions with ##nameguard, which is turned on by default.
For the first version of the grammar, use:
##nameguard :: False
You can also consider the definitions of ##whitespace and ##namechars that best suite the language and grammar.

Okay, I think there is a problem with ##nameguard. See https://github.com/neogeny/TatSu/issues/95. The easy workaround for the time being is to use a pattern expression in lieu of individual alphabetic terminals. Also, when ##nameguard is fixed, the documentation should clarify that it only relates to alphanumerics that begin with an alphabetic. Clearly, we did not need ##nameguard for the numeric terminals here.

Related

Tatsu Parser, unclear why it isn't moving to the next rule in the line?

I am writing a code parser/formatter for a language that doesn't have one, OSTW (Overwatch higher level language for workshop code). So that I can be lazy and have pretty code.
I am pretty new to this idea, so if tatsu is a poor choice for this usecase, please let me know, I am rather ignorant. I have been going back and forth between the grammar syntax and some of the tutorials and it isn't clicking for me yet.
My sample document:
doSomething(param1,param2,arg=stuff,arg2=stuff2);
My EBNF:
##grammar::Ostw
##eol_comments :: /\/\/.*?$/
start = statement $ ;
statement = func:alpha '(' ','%{param:alpha}* [',' ','%{kwarg}*] ')' eol ;
eol = ';' ;
kwarg = key:alpha '=' val:alpha ;
alpha = (numbers|letters) ;
numbers = /\d+/ ;
letters = /\w+/ ;
The grammar compiles successfully, but when I attempt to parse my code, I get this output:
FailedToken: (1:30) expecting ')' :
doSomething(param1,param2,arg=stuff,arg2=stuff2);
^
statement
start
My expectation would be that, since the = is not a valid character for the alpha rule, it would go to the next thing in the list, since it is an unknown number of entries of either types.
My intention is to have my parser expect similarly to Python, params then keyword arguments.
I think I missed a paragraph somewhere in something basic is what it feels like!
Thanks for any help!
Mriswithe

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

ANTLR Tries to Match an Expression That Wasn't Specified as an Option

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.

How to resolve Xtext variables' names and keywords statically?

I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier
No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.
In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.

Spirit: Allowing a character at the begining but not in the middle

I'm triying to write a parser for javascript identifiers so far this is what I have:
// All this rules have string as attribute.
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
)
;
identifier_part = +(qi::alnum | qi::char_("_"));
identifier_start = qi::char_("a-zA-Z$_");
This parser work fine for the list of "good identifiers" in my tests:
"x__",
"__xyz",
"_",
"$",
"foo4_.bar_3",
"$foo.bar",
"$foo",
"_foo_bar.foo",
"_foo____bar.foo"
but I'm having trouble with one of the bad identifiers: foo$bar. This is supposed to fail, but it success!! And the sintetized attribute has the value "foo".
Here is the debug ouput for foo$bar:
<identifier_>
<try>foo$bar</try>
<identifier_start>
<try>foo$bar</try>
<success>oo$bar</success>
<attributes>[[f]]</attributes>
</identifier_start>
<identifier_part>
<try>oo$bar</try>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_part>
<identifier_part>
<try>$bar</try>
<fail/>
</identifier_part>
<success>$bar</success>
<attributes>[[f, o, o]]</attributes>
</identifier_>
What I want is to the parser fails when parsing foo$bar but not when parsing $foobar.
What I'm missing?
You don't require that the parser needs to consume all input.
When a rule stops matching before the $ sign, it returns with success, because nothing says it can't be followed by a $ sign. So, you would like to assert that it isn't followed by a character that could be part of an identifier:
identifier_ = identifier_start
>>
*(
identifier_part >> -(qi::char_(".") > identifier_part)
) >> !identifier_start
;
A related directive is distinct from the Qi repository: http://www.boost.org/doc/libs/1_55_0/libs/spirit/repository/doc/html/spirit_repository/qi_components/directives/distinct.html

Resources