Currently, I've just defined simple rules in ANTLR4:
// Recognizer Rules
program : (class_dcl)+ EOF;
class_dcl: 'class' ID ('extends' ID)? '{' class_body '}';
class_body: (const_dcl|var_dcl|method_dcl)*;
const_dcl: ('static')? 'final' PRIMITIVE_TYPE ID '=' expr ';';
var_dcl: ('static')? id_list ':' type ';';
method_dcl: PRIMITIVE_TYPE ('static')? ID '(' para_list ')' block_stm;
para_list: (para_dcl (';' para_dcl)*)?;
para_dcl: id_list ':' PRIMITIVE_TYPE;
block_stm: '{' '}';
expr: <assoc=right> expr '=' expr | expr1;
expr1: term ('<' | '>' | '<=' | '>=' | '==' | '!=') term | term;
term: ('+'|'-') term | term ('*'|'/') term | term ('+'|'-') term | fact;
fact: INTLIT | FLOATLIT | BOOLLIT | ID | '(' expr ')';
type: PRIMITIVE_TYPE ('[' INTLIT ']')?;
id_list: ID (',' ID)*;
// Lexer Rules
KEYWORD: PRIMITIVE_TYPE | BOOLLIT | 'class' | 'extends' | 'if' | 'then' | 'else'
| 'null' | 'break' | 'continue' | 'while' | 'return' | 'self' | 'final'
| 'static' | 'new' | 'do';
SEPARATOR: '[' | ']' | '{' | '}' | '(' | ')' | ';' | ':' | '.' | ',';
OPERATOR: '^' | 'new' | '=' | UNA_OPERATOR | BIN_OPERATOR;
UNA_OPERATOR: '!';
BIN_OPERATOR: '+' | '-' | '*' | '\\' | '/' | '%' | '>' | '>=' | '<' | '<='
| '==' | '<>' | '&&' | '||' | ':=';
PRIMITIVE_TYPE: 'integer' | 'float' | 'bool' | 'string' | 'void';
BOOLLIT: 'true' | 'false';
FLOATLIT: [0-9]+ ((('.'[0-9]* (('E'|'e')('+'|'-')?[0-9]+)? ))|(('E'|'e')('+'|'-')? [0-9]+));
INTLIT: [0-9]+;
STRINGLIT: '"' ('\\'[bfrnt\\"]|~[\r\t\n\\"])* '"';
ILLEGAL_ESC: '"' (('\\'[bfrnt\\"]|~[\n\\"]))* ('\\'(~[bfrnt\\"]))
{if (true) throw new bkool.parser.IllegalEscape(getText());};
UNCLOSED_STRING: '"'('\\'[bfrnt\\"]|~[\r\t\n\\"])*
{if (true) throw new bkool.parser.UncloseString(getText());};
COMMENT: (BLOCK_COMMENT|LINE_COMMENT) -> skip;
BLOCK_COMMENT: '(''*'(('*')?(~')'))*'*'')';
LINE_COMMENT: '#' (~[\n])* ('\n'|EOF);
ID: [a-zA-z_]+ [a-zA-z_0-9]* ;
WS: [ \t\r\n]+ -> skip ;
ERROR_TOKEN: . {if (true) throw new bkool.parser.ErrorToken(getText());};
I opened the parse tree, and tried to test:
class abc
{
final integer x=1;
}
It returned errors:
BKOOL::program:3:8: mismatched input 'integer' expecting PRIMITIVE_TYPE
BKOOL::program:3:17: mismatched input '=' expecting {':', ','}
I still haven't got why. Could you please help me why it didn't recognize rules and tokens as I expected?
Lexer rules are exclusive. The longest wins, and the tiebreaker is the grammar order.
In your case; integer is a KEYWORD instead of PRIMITIVE_TYPE.
What you should do here:
Make one distinct token per keyword instead of an all-catching KEYWORD rule.
Turn PRIMITIVE_TYPE into a parser rule
Same for operators
Right now, your example:
class abc
{
final integer x=1;
}
Gets converted to lexemes such as:
class ID { final KEYWORD ID = INTLIT ; }
This is thanks to the implicit token typing, as you've used definitions such as 'class' in your parser rules. These get converted to anonymous tokens such as T_001 : 'class'; which get the highest priority.
If this weren't the case, you'd end up with:
KEYWORD ID SEPARATOR KEYWORD KEYWORD ID OPERATOR INTLIT ; SEPARATOR
And that's... not quite easy to parse ;-)
That's why I'm telling you to breakdown your tokens properly.
Related
I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*
I wrote a grammar in ANTLR for a Java-like if statement as follows:
if_statement
: 'if' expression
(statement | '{' statement+ '}')
('elif' expression (statement | '{' statement+ '}'))*
('else' (statement | '{' statement+ '}'))?
;
I've implemented the "statement" and "expression" correctly, but the if_statement is giving me the following error:
Decision can match input such as "'elif'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
|---> ('elif' expression (statement | '{' statement+ '}'))*
warning(200): /OptDB/src/OptDB/XL.g:38:9:
Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
|---> ('else' (statement | '{' statement+ '}'))?
It seems like there are problems with the "elif" and "else" block.
Basically, we can have 0 or more "elif" blocks, so I wrapped them with *
Also we can have 0 or 1 "else" block, so I wrapped it it with ?.
What seems to cause the error?
========================================================================
I'll also put the implementations of "expression" and "statements":
statement
: assignment_statement
| if_statement
| while_statement
| for_statement
| function_call_statement
;
term
: IDENTIFIER
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENTIFIER '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
actualParameters
: expression (',' expression)*
;
Because your grammar allows for statement block without being grouped by {...}, you've got yourself a classic dangling else ambiguity.
Short explanation. The input:
if expr1 if expr2 ... else ...
could be parsed as:
Parse 1
if expr1
if expr2
...
else
...
but also as this:
Parse 2
if expr1
if expr2
...
else
...
To eliminate the ambiguity, either change:
(statement | '{' statement+ '}')
into:
'{' statement+ '}'
// or
'{' statement* '}'
so that it's clear by looking at the braces to which if the else belongs to, or add a predicate to force the parser to choose Parse 1:
if_statement
: 'if' expression statement_block
(('elif')=> 'elif' expression statement_block)*
(('else')=> 'else' statement_block)?
;
statement_block
: '{' statement* '}'
| statement
;
currently having some parsing issues with my grammar and I just can't figure out what is going wrong..
Here's my grammar:
grammar Demo;
#header {
import java.util.List;
import java.util.ArrayList;
}
program:
functionList #programFunction
;
functionList:
function*
;
function:
'haupt()' '{' stmntList '}' #hauptFunction
| type='zahl' ID '(' paramList ')' '{' stmntList '}' #zahlFunction
| type='Zeichenkette' ID '(' paramList ')' '{' stmntList '}' #zeichenketteFunction
| type='nix' ID '(' paramList ')' '{' stmntList '}' #nixFunction
;
paramList:
param (',' paramList)?
;
param:
'zahl' ID
| 'Zeichenkette' ID
|
;
variableList:
ID (',' variableList)?
;
stmntList:
stmnt (stmntList)?
;
stmnt:
'zahl' varName=ID ';' #zahlStmnt
| 'Zeichenkette' varName=ID ';' #zeichenketteStmnt
| varName=ID '=' varValue=expr ';' #varAssignment
| 'Schreibe' '(' argument=expr ')' ';' #schreibeImmediat
| 'Schreibe''(' argument=ID ')' ';' #schreibeText
| 'zuZeichenkette' '(' ID ')'';' #convertString
| 'zuZahl''('ID')'';' #convertInteger
| 'wenn' '(' boolExpr ')' '{' stmntList '}' ('sonst' '{' stmntList '}')? #wennsonstStmnt
| 'fuer' '(' ID '=' expr ',' boolExpr ',' stmnt ')' '{' stmntList '}' #forLoop
| 'waehrend' '(' boolExpr ')' '{' stmntList '}' #whileLoop
| 'tu' '{' stmntList '}' 'waehrend' '(' boolExpr ')' ';' #doWhile
| 'return' expr ';' #returnVar
| fctName=ID '(' (variableList)? ')'';' #functionCall
;
boolExpr:
boolParts ('&&' boolExpr)? #logicAnd
| boolParts ('||' boolExpr)? #logicOr
;
boolParts:
expr '==' expr #isEqual
| expr '!=' expr #isUnequal
| expr '>' expr #biggerThan
| expr '<' expr #smallerThan
| expr '>=' expr #biggerEqual
| expr '<=' expr #smallerEqual
;
expr:
links=expr '+' rechts=product #addi
| links = expr '-' rechts=product #diff
|product #prod
;
product:
links=product '*' rechts=factor #mult
| links=product '/' rechts=factor #teil
| factor #fact
;
factor:
'(' expr')' #bracket
| var=ID #var
| zahl=NUMBER #numb
;
ID : [a-zA-Z]+;
NUMBER : '0'|[1-9][0-9]*;
WS: [\r\n\t ]+ -> skip ;
And this is the code I am trying to parse:
haupt() {
zahl zz;
zz = 2;
zahl cc;
cc = 3;
zz = zz+cc;
Schreibe(cc+cc+cc);
}
the problems arise already in the first row, telling me that it expects a '{' instead of ' '. This is something I cannot understand since I skipped all WS in my grammar. Next errors are the wrong recognition of the 2nd row: the variable declaration of "zahl zz;" is not understood as it should be: the first grammar rule of stmnt should work it, but it does not...
Here are the errors antlrs TestRig gives me:
line 2:6 no viable alternative at input 'zahlzz'
line 4:6 no viable alternative at input 'zahlcc'
line 9:12 mismatched input '+' expecting ')'
Thanks for your help!
Tim
When I see a weird nonsensical bugaboo like this, it generally means that there is a token mismatch between what the lexer and parser think the token types are. Make sure the you regenerate all of your grammars using ANTLR and recompile everything. Hopefully that will clear it up.
I'm trying to implement a grammar for parsing lucene queries. So far everything went smooth until i tried to add support for range queries . Lucene details aside my grammar looks like this :
grammar ModifiedParser;
TERM_RANGE : '[' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) ']'
| '{' ('*' | TERM_TEXT) 'TO' ('*' | TERM_TEXT) '}'
;
query : not (booleanOperator? not)* ;
booleanOperator : andClause
| orClause
;
andClause : 'AND' ;
notClause : 'NOT' ;
orClause : 'OR' ;
not : notClause? MODIFIER? clause;
clause : unqualified
| qualified
;
unqualified : TERM_RANGE # termRange
| TERM_PHRASE # termPhrase
| TERM_PHRASE_ANYTHING # termTruncatedPhrase
| '(' query ')' # queryUnqualified
| TERM_TEXT_TRUNCATED # termTruncatedText
| TERM_NORMAL # termText
;
qualified : TERM_NORMAL ':' unqualified
;
fragment TERM_CHAR : (~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\'' | '\"' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '?' | '*' | '\\' ))
;
fragment TERM_START_CHAR : TERM_CHAR
| ESCAPE
;
fragment ESCAPE : '\\' ~[];
MODIFIER : '-'
| '+'
;
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
TERM_PHRASE_ANYTHING : '"' (ESCAPE|~('\"'|'\\'))+ '"' ;
TERM_PHRASE : '"' (ESCAPE|~('\"'|'\\'|'?'|'*'))+ '"' ;
TERM_TEXT_TRUNCATED : ('*'|'?')(TERM_CHAR+ ('*'|'?'))+ TERM_CHAR*
| TERM_START_CHAR (TERM_CHAR* ('?'|'*'))+ TERM_CHAR+
| ('?'|'*') TERM_CHAR+
;
TERM_NORMAL : TERM_TEXT;
fragment TERM_TEXT : TERM_START_CHAR TERM_CHAR* ;
WS : [ \t\r\n] -> skip ;
When i try to do a visitor and work with the tokens apparently parsing asd [ 10 TO 100 ] { 1 TO 1000 } 100..1000 will throw token recognition error for [ , ] , } and {, and only tries to visit the termRange rule on the third range . do you guys know what i'm missing here ? Thanks in advance
Since you made TERM_RANGE a lexer rule, you must account for everything at a character level. In particular, you forgot to allow whitespace characters in your input.
You would likely be in a much better position if you instead created termRange, a parser rule.
I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;
As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...