No viable input ANTLR4 - parsing

currently having some parsing issues with my grammar and I just can't figure out what is going wrong..
Here's my grammar:
grammar Demo;
#header {
import java.util.List;
import java.util.ArrayList;
}
program:
functionList #programFunction
;
functionList:
function*
;
function:
'haupt()' '{' stmntList '}' #hauptFunction
| type='zahl' ID '(' paramList ')' '{' stmntList '}' #zahlFunction
| type='Zeichenkette' ID '(' paramList ')' '{' stmntList '}' #zeichenketteFunction
| type='nix' ID '(' paramList ')' '{' stmntList '}' #nixFunction
;
paramList:
param (',' paramList)?
;
param:
'zahl' ID
| 'Zeichenkette' ID
|
;
variableList:
ID (',' variableList)?
;
stmntList:
stmnt (stmntList)?
;
stmnt:
'zahl' varName=ID ';' #zahlStmnt
| 'Zeichenkette' varName=ID ';' #zeichenketteStmnt
| varName=ID '=' varValue=expr ';' #varAssignment
| 'Schreibe' '(' argument=expr ')' ';' #schreibeImmediat
| 'Schreibe''(' argument=ID ')' ';' #schreibeText
| 'zuZeichenkette' '(' ID ')'';' #convertString
| 'zuZahl''('ID')'';' #convertInteger
| 'wenn' '(' boolExpr ')' '{' stmntList '}' ('sonst' '{' stmntList '}')? #wennsonstStmnt
| 'fuer' '(' ID '=' expr ',' boolExpr ',' stmnt ')' '{' stmntList '}' #forLoop
| 'waehrend' '(' boolExpr ')' '{' stmntList '}' #whileLoop
| 'tu' '{' stmntList '}' 'waehrend' '(' boolExpr ')' ';' #doWhile
| 'return' expr ';' #returnVar
| fctName=ID '(' (variableList)? ')'';' #functionCall
;
boolExpr:
boolParts ('&&' boolExpr)? #logicAnd
| boolParts ('||' boolExpr)? #logicOr
;
boolParts:
expr '==' expr #isEqual
| expr '!=' expr #isUnequal
| expr '>' expr #biggerThan
| expr '<' expr #smallerThan
| expr '>=' expr #biggerEqual
| expr '<=' expr #smallerEqual
;
expr:
links=expr '+' rechts=product #addi
| links = expr '-' rechts=product #diff
|product #prod
;
product:
links=product '*' rechts=factor #mult
| links=product '/' rechts=factor #teil
| factor #fact
;
factor:
'(' expr')' #bracket
| var=ID #var
| zahl=NUMBER #numb
;
ID : [a-zA-Z]+;
NUMBER : '0'|[1-9][0-9]*;
WS: [\r\n\t ]+ -> skip ;
And this is the code I am trying to parse:
haupt() {
zahl zz;
zz = 2;
zahl cc;
cc = 3;
zz = zz+cc;
Schreibe(cc+cc+cc);
}
the problems arise already in the first row, telling me that it expects a '{' instead of ' '. This is something I cannot understand since I skipped all WS in my grammar. Next errors are the wrong recognition of the 2nd row: the variable declaration of "zahl zz;" is not understood as it should be: the first grammar rule of stmnt should work it, but it does not...
Here are the errors antlrs TestRig gives me:
line 2:6 no viable alternative at input 'zahlzz'
line 4:6 no viable alternative at input 'zahlcc'
line 9:12 mismatched input '+' expecting ')'
Thanks for your help!
Tim

When I see a weird nonsensical bugaboo like this, it generally means that there is a token mismatch between what the lexer and parser think the token types are. Make sure the you regenerate all of your grammars using ANTLR and recompile everything. Hopefully that will clear it up.

Related

Parsing nested `if/else' statements [duplicate]

to solve the dangling else problem, I used the following solution:
stmt : stmt_matched
| stmt_unmatched
;
stmt_unmatched : IF '(' exp ')' stmt
| IF '(' exp ')' stmt_matched ELSE stmt_unmatched
;
stmt_matched : IF '(' exp ')' stmt_matched ELSE stmt_matched
| stmt_for
| ...
;
For defining the rules of grammar about the for loop, I produce a conflict shift/reduce due to the same problem:
stmt_for : FOR '(' exp ';' exp ';' exp ')' stmt
;
How can I solve this problem?
Not all for statements are matched. Consider, for example
if (c) for (;;) if (d) ; else ;
So it is necessary to divide for statements into for_matched and for_unmatched. (And similarly with other compound statements such as while.)

ANTLR4 mismatched input '' expecting

Currently, I've just defined simple rules in ANTLR4:
// Recognizer Rules
program : (class_dcl)+ EOF;
class_dcl: 'class' ID ('extends' ID)? '{' class_body '}';
class_body: (const_dcl|var_dcl|method_dcl)*;
const_dcl: ('static')? 'final' PRIMITIVE_TYPE ID '=' expr ';';
var_dcl: ('static')? id_list ':' type ';';
method_dcl: PRIMITIVE_TYPE ('static')? ID '(' para_list ')' block_stm;
para_list: (para_dcl (';' para_dcl)*)?;
para_dcl: id_list ':' PRIMITIVE_TYPE;
block_stm: '{' '}';
expr: <assoc=right> expr '=' expr | expr1;
expr1: term ('<' | '>' | '<=' | '>=' | '==' | '!=') term | term;
term: ('+'|'-') term | term ('*'|'/') term | term ('+'|'-') term | fact;
fact: INTLIT | FLOATLIT | BOOLLIT | ID | '(' expr ')';
type: PRIMITIVE_TYPE ('[' INTLIT ']')?;
id_list: ID (',' ID)*;
// Lexer Rules
KEYWORD: PRIMITIVE_TYPE | BOOLLIT | 'class' | 'extends' | 'if' | 'then' | 'else'
| 'null' | 'break' | 'continue' | 'while' | 'return' | 'self' | 'final'
| 'static' | 'new' | 'do';
SEPARATOR: '[' | ']' | '{' | '}' | '(' | ')' | ';' | ':' | '.' | ',';
OPERATOR: '^' | 'new' | '=' | UNA_OPERATOR | BIN_OPERATOR;
UNA_OPERATOR: '!';
BIN_OPERATOR: '+' | '-' | '*' | '\\' | '/' | '%' | '>' | '>=' | '<' | '<='
| '==' | '<>' | '&&' | '||' | ':=';
PRIMITIVE_TYPE: 'integer' | 'float' | 'bool' | 'string' | 'void';
BOOLLIT: 'true' | 'false';
FLOATLIT: [0-9]+ ((('.'[0-9]* (('E'|'e')('+'|'-')?[0-9]+)? ))|(('E'|'e')('+'|'-')? [0-9]+));
INTLIT: [0-9]+;
STRINGLIT: '"' ('\\'[bfrnt\\"]|~[\r\t\n\\"])* '"';
ILLEGAL_ESC: '"' (('\\'[bfrnt\\"]|~[\n\\"]))* ('\\'(~[bfrnt\\"]))
{if (true) throw new bkool.parser.IllegalEscape(getText());};
UNCLOSED_STRING: '"'('\\'[bfrnt\\"]|~[\r\t\n\\"])*
{if (true) throw new bkool.parser.UncloseString(getText());};
COMMENT: (BLOCK_COMMENT|LINE_COMMENT) -> skip;
BLOCK_COMMENT: '(''*'(('*')?(~')'))*'*'')';
LINE_COMMENT: '#' (~[\n])* ('\n'|EOF);
ID: [a-zA-z_]+ [a-zA-z_0-9]* ;
WS: [ \t\r\n]+ -> skip ;
ERROR_TOKEN: . {if (true) throw new bkool.parser.ErrorToken(getText());};
I opened the parse tree, and tried to test:
class abc
{
final integer x=1;
}
It returned errors:
BKOOL::program:3:8: mismatched input 'integer' expecting PRIMITIVE_TYPE
BKOOL::program:3:17: mismatched input '=' expecting {':', ','}
I still haven't got why. Could you please help me why it didn't recognize rules and tokens as I expected?
Lexer rules are exclusive. The longest wins, and the tiebreaker is the grammar order.
In your case; integer is a KEYWORD instead of PRIMITIVE_TYPE.
What you should do here:
Make one distinct token per keyword instead of an all-catching KEYWORD rule.
Turn PRIMITIVE_TYPE into a parser rule
Same for operators
Right now, your example:
class abc
{
final integer x=1;
}
Gets converted to lexemes such as:
class ID { final KEYWORD ID = INTLIT ; }
This is thanks to the implicit token typing, as you've used definitions such as 'class' in your parser rules. These get converted to anonymous tokens such as T_001 : 'class'; which get the highest priority.
If this weren't the case, you'd end up with:
KEYWORD ID SEPARATOR KEYWORD KEYWORD ID OPERATOR INTLIT ; SEPARATOR
And that's... not quite easy to parse ;-)
That's why I'm telling you to breakdown your tokens properly.

Context-Free-Grammar for assignment statements in ANTLR

I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*

Context Free Grammar in ANTLR throwing error for if-statement

I wrote a grammar in ANTLR for a Java-like if statement as follows:
if_statement
: 'if' expression
(statement | '{' statement+ '}')
('elif' expression (statement | '{' statement+ '}'))*
('else' (statement | '{' statement+ '}'))?
;
I've implemented the "statement" and "expression" correctly, but the if_statement is giving me the following error:
Decision can match input such as "'elif'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
|---> ('elif' expression (statement | '{' statement+ '}'))*
warning(200): /OptDB/src/OptDB/XL.g:38:9:
Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
|---> ('else' (statement | '{' statement+ '}'))?
It seems like there are problems with the "elif" and "else" block.
Basically, we can have 0 or more "elif" blocks, so I wrapped them with *
Also we can have 0 or 1 "else" block, so I wrapped it it with ?.
What seems to cause the error?
========================================================================
I'll also put the implementations of "expression" and "statements":
statement
: assignment_statement
| if_statement
| while_statement
| for_statement
| function_call_statement
;
term
: IDENTIFIER
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENTIFIER '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
actualParameters
: expression (',' expression)*
;
Because your grammar allows for statement block without being grouped by {...}, you've got yourself a classic dangling else ambiguity.
Short explanation. The input:
if expr1 if expr2 ... else ...
could be parsed as:
Parse 1
if expr1
if expr2
...
else
...
but also as this:
Parse 2
if expr1
if expr2
...
else
...
To eliminate the ambiguity, either change:
(statement | '{' statement+ '}')
into:
'{' statement+ '}'
// or
'{' statement* '}'
so that it's clear by looking at the braces to which if the else belongs to, or add a predicate to force the parser to choose Parse 1:
if_statement
: 'if' expression statement_block
(('elif')=> 'elif' expression statement_block)*
(('else')=> 'else' statement_block)?
;
statement_block
: '{' statement* '}'
| statement
;

ANTLR ambiguity '-'

I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;
As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...

Resources