Need help starting with Tatsu to parse grammar - tatsu

I am getting a Tatsu error
"tatsu.exceptions.FailedExpectingEndOfText: (1:1) Expecting end of text"
running a test, using a grammar I supplied - it is not clear what the problem is.
In essence, the statement calling the parser is:
ast = parse(GRAMMAR, '(instance ?FIFI Dog)')
The whole python file follows:
GRAMMAR = """
##grammar::SUOKIF
KIF = {KIFexpression}* $ ;
WHITESPACE = /\s+/ ;
StringLiteral = /['"'][A-Za-z]+['"']/ ;
NumericLiteral = /[0-9]+/ ;
Identifier = /[A-Za-z]+/ ;
LPAREN = "(" ;
RPAREN = ")" ;
QUESTION = "?" ;
MENTION = "#" ;
EQUALS = "=" ;
RARROW = ">" ;
LARROW = "<" ;
NOT = "not"|"NOT" ;
OR = "or"|"OR" ;
AND = "and"|"AND" ;
FORALL = "forall"|"FORALL" ;
EXISTS = "exists"|"EXISTS" ;
STRINGLITERAL = {StringLiteral} ;
NUMERICLITERAL = {NumericLiteral} ;
IDENTIFIER = {Identifier} ;
KIFexpression
= Word
| Variable
| String
| Number
| Sentence
;
Sentence = Equation
| RelSent
| LogicSent
| QuantSent
;
LogicSent
= Negation
| Disjunction
| Conjunction
| Implication
| Equivalence
;
QuantSent
= UniversalSent
| ExistentialSent
;
Word = IDENTIFIER ;
Variable = ( QUESTION | MENTION ) IDENTIFIER ;
String = STRINGLITERAL ;
Number = NUMERICLITERAL ;
ArgumentList
= {KIFexpression}*
;
VariableList
= {Variable}+
;
Equation = LPAREN EQUALS KIFexpression KIFexpression RPAREN ;
RelSent = LPAREN ( Variable | Word ) ArgumentList RPAREN ;
Negation = LPAREN NOT KIFexpression RPAREN ;
Disjunction
= LPAREN OR ArgumentList RPAREN
;
Conjunction
= LPAREN AND ArgumentList RPAREN
;
Implication
= LPAREN EQUALS RARROW KIFexpression KIFexpression RPAREN
;
Equivalence
= LPAREN LARROW EQUALS RARROW KIFexpression KIFexpression RPAREN
;
UniversalSent
= LPAREN FORALL LPAREN VariableList RPAREN KIFexpression RPAREN
;
ExistentialSent
= LPAREN EXISTS LPAREN VariableList RPAREN KIFexpression RPAREN
;
"""
if __name__ == '__main__':
import pprint
import json
from tatsu import parse
from tatsu.util import asjson
ast = parse(GRAMMAR, '(instance ?FIFI Dog)')
print('# PPRINT')
pprint.pprint(ast, indent=2, width=20)
print()
print('# JSON')
print(json.dumps(asjson(ast), indent=2))
print()
Can anyone help me with a fix?
Thanks.
Colin Goldberg

I can see two problems with that grammar.
As written in man pages, rule names that start with upper case character have special meaning. Change all the rule names to lower case.
Also let's review IDENTIFIER rule:
IDENTIFIER = {Identifier} ;
This means that identifier can be used multiple times, or may be missing at all. Remove the closure by defining IDENTIFIER directly:
IDENTIFIER = /[A-Za-z]+/ ;
You can do the same for NUMERICLITERAL and STRINGLITERAL.
When I did those steps, the expression could be parsed.

You need to pass the name of the "start" symbol to parse().
You can also define:
start = KIF ;
in the grammar.

Related

ANTLR 3 bug, mismatched input, but what's wrong?

I have the following problem:
My ANTLR 3 grammar compiles, but my simple testprogram doesn't work. The grammar is as follows:
grammar Rietse;
options {
k=1;
language=Java;
output=AST;
}
tokens {
COLON = ':' ;
SEMICOLON = ';' ;
OPAREN = '(' ;
CPAREN = ')' ;
COMMA = ',' ;
OCURLY = '{' ;
CCURLY = '}' ;
SINGLEQUOTE = '\'' ;
// operators
BECOMES = '=' ;
PLUS = '+' ;
MINUS = '-' ;
TIMES = '*' ;
DIVIDE = '/' ;
MODULO = '%' ;
EQUALS = '==' ;
LT = '<' ;
LTE = '<=' ;
GT = '>' ;
GTE = '>=' ;
UNEQUALS = '!=' ;
AND = '&&' ;
OR = '||' ;
NOT = '!' ;
// keywords
PROGRAM = 'program' ;
COMPOUND = 'compound' ;
UNARY = 'unary' ;
DECL = 'decl' ;
SDECL = 'sdecl' ;
STATIC = 'static' ;
PRINT = 'print' ;
READ = 'read' ;
IF = 'if' ;
THEN = 'then' ;
ELSE = 'else' ;
DO = 'do' ;
WHILE = 'while' ;
// types
INTEGER = 'int' ;
CHAR = 'char' ;
BOOLEAN = 'boolean' ;
TRUE = 'true' ;
FALSE = 'false' ;
}
#lexer::header {
package Eindopdracht;
}
#header {
package Eindopdracht;
}
// Parser rules
program
: program2 EOF
-> ^(PROGRAM program2)
;
program2
: (declaration* statement)+
;
declaration
: STATIC type IDENTIFIER SEMICOLON -> ^(SDECL type IDENTIFIER)
| type IDENTIFIER SEMICOLON -> ^(DECL type IDENTIFIER)
;
type
: INTEGER
| CHAR
| BOOLEAN
;
statement
: assignment_expr SEMICOLON!
| while_stat SEMICOLON!
| print_stat SEMICOLON!
| if_stat SEMICOLON!
| read_stat SEMICOLON!
;
while_stat
: WHILE^ OPAREN! or_expr CPAREN! OCURLY! statement+ CCURLY! // while (expression) {statement+}
;
print_stat
: PRINT^ OPAREN! or_expr (COMMA! or_expr)* CPAREN! // print(expression)
;
read_stat
: READ^ OPAREN! IDENTIFIER (COMMA! IDENTIFIER)+ CPAREN! // read(expression)
;
if_stat
: IF^ OPAREN! or_expr CPAREN! comp_expr (ELSE! comp_expr)? // if (expression) compound else compound
;
assignment_expr
: or_expr (BECOMES^ or_expr)*
;
or_expr
: and_expr (OR^ and_expr)*
;
and_expr
: compare_expr (AND^ compare_expr)*
;
compare_expr
: plusminus_expr ((LT|LTE|GT|GTE|EQUALS|UNEQUALS)^ plusminus_expr)?
;
plusminus_expr
: timesdivide_expr ((PLUS | MINUS)^ timesdivide_expr)*
;
timesdivide_expr
: unary_expr ((TIMES | DIVIDE | MODULO)^ unary_expr)*
;
unary_expr
: operand
| PLUS operand -> ^(UNARY PLUS operand)
| MINUS operand -> ^(UNARY MINUS operand)
| NOT operand -> ^(UNARY NOT operand)
;
operand
: TRUE
| FALSE
| charliteral
| IDENTIFIER
| NUMBER
| OPAREN! or_expr CPAREN!
;
comp_expr
: OCURLY program2 CCURLY -> ^(COMPOUND program2)
;
// Lexer rules
charliteral
: SINGLEQUOTE! LETTER SINGLEQUOTE!
;
IDENTIFIER
: LETTER (LETTER | DIGIT)*
;
NUMBER
: DIGIT+
;
COMMENT
: '//' .* '\n'
{ $channel=HIDDEN; }
;
WS
: (' ' | '\t' | '\f' | '\r' | '\n')+
{ $channel=HIDDEN; }
;
fragment DIGIT : ('0'..'9') ;
fragment LOWER : ('a'..'z') ;
fragment UPPER : ('A'..'Z') ;
fragment LETTER : LOWER | UPPER ;
// EOF
I then use the following java file to test programs:
package Package;
import java.io.FileInputStream;
import java.io.InputStream;
import org.antlr.runtime.ANTLRInputStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.tree.BufferedTreeNodeStream;
import org.antlr.runtime.tree.CommonTree;
import org.antlr.runtime.tree.CommonTreeNodeStream;
import org.antlr.runtime.tree.DOTTreeGenerator;
import org.antlr.runtime.tree.TreeNodeStream;
import org.antlr.stringtemplate.StringTemplate;
public class Rietse {
public static void main (String[] args)
{
String inputFile = args[0];
try {
InputStream in = inputFile == null ? System.in : new FileInputStream(inputFile);
RietseLexer lexer = new RietseLexer(new ANTLRInputStream(in));
CommonTokenStream tokens = new CommonTokenStream(lexer);
RietseParser parser = new RietseParser(tokens);
RietseParser.program_return result = parser.program();
} catch (RietseException e) {
System.err.print("ERROR: RietseException thrown by compiler: ");
System.err.println(e.getMessage());
} catch (RecognitionException e) {
System.err.print("ERROR: recognition exception thrown by compiler: ");
System.err.println(e.getMessage());
e.printStackTrace();
} catch (Exception e) {
System.err.print("ERROR: uncaught exception thrown by compiler: ");
System.err.println(e.getMessage());
e.printStackTrace();
}
}
}
And at last, the testprogram itself:
print('a');
Now when I run this, I get the following errors:
line 1:7 mismatched input 'a' expecting LETTER
line 1:9 mismatched input ')' expecting LETTER
I have no clue whatsoever what causes this bug. I have tried several changes of things but nothing fixed it. Does anyone here know what's wrong with my code and how I can fix it?
Every bit of help is greatly appreciated, thanks in advance.
Greetings,
Rien
Using a rule:
CHARLITERAL
: SINGLEQUOTE (LETTER | DIGIT) SINGLEQUOTE
;
and changing operand to:
operand
: TRUE
| FALSE
| CHARLITERAL
| IDENTIFIER
| NUMBER
| OPAREN! or_expr CPAREN!
;
will fix the problem. It does give the problem of having singlequotes in the AST, but that can be fixed optionally by changing the text of the node with the
setText(String);
method.
Turn charliteral into a lexer rule (rename it to CHARLITERAL). Right now, the string 'a' is tokenized like this: SINGLEQUOTE IDENTIFIER SINGLEQUOTE, so you're getting an IDENTIFIER instead of a LETTER.
I wonder how this code can compile at all given that you're using a fragment (LETTER) from a parser rule.

Changing associativity schema in a grammar

I'm trying to use SableCC to generate a Parser for models, which I call LAM. LAM in itself are simple, and a simple grammar (where I omit a lot of things) for these is:
L := 0 | (x,y) | F(x1,...,xn) | L || L | L ; L
I wrote this grammar:
Helpers
number = ['0' .. '9'] ;
letter = ['a' .. 'z'] ;
uletter = ['A' .. 'Z'] ;
Tokens
zero = '0' ;
comma = ',' ;
parallel = '||' ;
point = ';' ;
lpar = '(' ;
rpar = ')' ;
identifier = letter+ number* ;
uidentifier = uletter+ number* ;
Productions
expr = {term} term |
{parallel} expr parallel term |
{point} expr point term;
term = {parenthesis} lpar expr rpar |
{zero} zero |
{invk} uidentifier lpar paramlist rpar |
{pair} lpar [left]:identifier comma [right]:identifier rpar ;
paramlist = {list} list |
{empty} ;
list = {var} identifier |
{com} identifier comma list ;
This basically works, but there is a side effect: it is left associative. For example, if I have
L = L1 || L2 ; L3 || L4
Then it is parsed like:
L = ((L1 || L2) ; L3) || L4
I want to give all precedence to the ";" operator, and so have L parsed like
L = (L1 || L2) ; (L3 || L4)
(other things, like "||", could remains left-associative)
My questions are:
There are tips to do such conversions in a "automated" way?
How could be a grammar with all the precedence on the ";" ?
It is accepted also "RTFM link" :-D
Thank you all
You need to create a hierarchy of rules that matches the desired operator precedence.
expr = {subexp} subexp |
{parallel} subexp parallel expr ;
subexp = {term} term |
{point} term point subexp;
Note that I also changed the associativity.

Is it possible to create a very permissive grammar using Menhir?

I'm trying to parse some bits and pieces of Verilog - I'm primarily interested in extracting module definitions and instantiations.
In verilog a module is defined like:
module foo ( ... ) endmodule;
And a module is instantiated in one of two different possible ways:
foo fooinst ( ... );
foo #( ...list of params... ) fooinst ( .... );
At this point I'm only interested in finding the name of the defined or instantiated module; 'foo' in both cases above.
Given this menhir grammar (verParser.mly):
%{
type expr = Module of expr
| ModInst of expr
| Ident of string
| Int of int
| Lparen
| Rparen
| Junk
| ExprList of expr list
%}
%token <string> INT
%token <string> IDENT
%token LPAREN RPAREN MODULE TICK OTHER HASH EOF
%start expr2
%type <expr> mod_expr
%type <expr> expr1
%type <expr list> expr2
%%
mod_expr:
| MODULE IDENT LPAREN { Module ( Ident $2) }
| IDENT IDENT LPAREN { ModInst ( Ident $1) }
| IDENT HASH LPAREN { ModInst ( Ident $1) };
junk:
| LPAREN { }
| RPAREN { }
| HASH { }
| INT { };
expr1:
| junk* mod_expr junk* { $2 } ;
expr2:
| expr1* EOF { $1 };
When I try this out in the menhir interpretter it works fine extracting the module instantion:
MODULE IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: MODULE IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
It works fine for the single module instantiation:
IDENT IDENT LPAREN
ACCEPT
[expr2:
[list(expr1):
[expr1:
[list(junk):]
[mod_expr: IDENT IDENT LPAREN]
[list(junk):]
]
[list(expr1):]
]
EOF
]
But of course, if there is an IDENT that appears prior to any of these it will REJECT:
IDENT MODULE IDENT LPAREN IDENT IDENT LPAREN
REJECT
... and of course there will be identifiers in an actual verilog file prior to these defs.
I'm trying not to have to fully specify a Verilog grammar, instead I want to build the grammar up slowly and incrementally to eventually parse more and more of the language.
If I add IDENT to the junk rule, that fixes the problem above, but then the module instantiation rule doesn't work because now the junk rule is capturing the IDENT.
Is it possible to create a very permissive rule that will bypass stuff I don't want to match, or is it generally required that you must create a complete grammar to actually do something like this?
Is it possible to create a rule that would let me match:
MODULE IDENT LPAREN stuff* RPAREN ENDMODULE
where "stuff*" initially matches everything but RPAREN?
Something like :
stuff:
| !RPAREN { } ;
I've used PEG parsers in the past which would allow constructs like that.
I've decided that PEG is a better fit for a permissive, non-exhaustive grammar. Took a look at peg/leg and was able to very quickly put together a leg grammar that does what I need to do:
start = ( comment | mod_match | char)
line = < (( '\n' '\r'* ) | ( '\r' '\n'* )) > { lines++; chars += yyleng; }
module_decl = module modnm:ident lparen ( !rparen . )* rparen { chars += yyleng; printf("Module decl: <%s>\n",yytext);}
module_inst = modinstname:ident ident lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
|modinstname:ident hash lparen { chars += yyleng; printf("Module Inst: <%s>\n",yytext);}
mod_match = ( module_decl | module_inst )
module = 'module' ws { modules++; chars +=yyleng; printf("Module: <%s>\n", yytext); }
endmodule = 'endmodule' ws { endmodules++; chars +=yyleng; printf("EndModule: <%s>\n", yytext); }
kwd = (module|endmodule)
ident = !kwd<[a-zA-z][a-zA-Z0-9_]+>- { words++; chars += yyleng; printf("Ident: <%s>\n", yytext); }
char = . { chars++; }
lparen = '(' -
rparen = ')' -
hash = '#'
- = ( space | comment )*
ws = space+
space = ' ' | '\t' | EOL
comment = '//' ( !EOL .)* EOL
| '/*' ( !'*/' .)* '*/'
EOF = !.
EOL = '\r\n' | '\n' | '\r'
Aurochs is possibly also an option, but I have concerns about speed and memory usage of an Aurochs generated parser. peg/leg produce a parser in C which should be quite speedy.

Error generating files in ANTLR

So I'm trying to write a parser in ANTLR, this is my first time using it and I'm running into a problem that I can't find a solution for, apologies if this is a very simple problem. Anyway, the error I'm getting is:
"(100): Expr.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52)"
The code I'm currently using is:
grammar Expr.g;
options{
output=AST;
}
tokens{
MAIN = 'main';
OPENBRACKET = '(';
CLOSEBRACKET = ')';
OPENCURLYBRACKET = '{';
CLOSECURLYBRACKET = '}';
COMMA = ',';
SEMICOLON = ';';
GREATERTHAN = '>';
LESSTHAN = '<';
GREATEROREQUALTHAN = '>=';
LESSTHANOREQUALTHAN = '<=';
NOTEQUAL = '!=';
ISEQUALTO = '==';
WHILE = 'while';
IF = 'if';
ELSE = 'else';
READ = 'read';
OUTPUT = 'output';
PRINT = 'print';
RETURN = 'return';
READC = 'readc';
OUTPUTC = 'outputc';
PLUS = '+';
MINUS = '-';
DIVIDE = '/';
MULTIPLY = '*';
PERCENTAGE = '%';
}
#header {
//package test;
import java.util.HashMap;
}
#lexer::header {
//package test;
}
#members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}
prog: stat+ ;
stat: expr NEWLINE {System.out.println($expr.value);}
| ID '=' expr NEWLINE
{memory.put($ID.text, new Integer($expr.value));}
| NEWLINE
;
expr returns [int value]
: e=multExpr {$value = $e.value;}
( '+' e=multExpr {$value += $e.value;}
| '-' e=multExpr {$value -= $e.value;}
)*
;
multExpr returns [int value]
: e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
;
atom returns [int value]
: INT {$value = Integer.parseInt($INT.text);}
| ID
{
Integer v = (Integer)memory.get($ID.text);
if ( v!=null ) $value = v.intValue();
else System.err.println("undefined variable "+$ID.text);
}
| '(' e=expr ')' {$value = $e.value;}
;
IDENT : ('a'..'z'^|'A'..'Z'^)+ ; : .;
INT : '0'..'9'+ ;
NEWLINE:'\r'? '\n' ;
WS : (' '|'\t')+ {skip();} ;
Thanks for any help.
EDIT: Well, I'm an idiot, it's just a formatting error. Thanks for the responses from those who helped out.
You have some illegal characters after your IDENT token:
IDENT : ('a'..'z'^|'A'..'Z'^)+ ; : .;
The : .; are invalid there. And you're also trying to mix the tree-rewrite operator ^ inside a lexer rule, which is illegal: remove them. Lastly, you've named it IDENT while in your parser rules, you're using ID.
It should be:
ID : ('a'..'z' | 'A'..'Z')+ ;

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

Resources