ANTLR parse assignments - parsing

I want to parse some assignments, where I only care about the assignment as a whole. Not about whats inside the assignment. An assignment is indiciated by ':='. (EDIT: Before and after the assignments other things may come)
Some examples:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
Currently I make a difference between assignments containing a 'case' and other assignments. For simple assignments I tried something like ~('case' | 'esac' | ';') but then antlr complained about unmatched tokens (like '=').
assignment :
NAME ':='! expression ;
expression :
( simple_expression | case_expression) ;
simple_expression :
((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ;
case_expression :
'case' .+ 'esac' ';'! ;
I tried replacing with the following, because the eclipse-interpreter did not seem to like the ((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ; because of the 'and'.
(~(OPERATOR | ~NAME | ('case' | 'esac')) |
~(~OPERATOR | NAME | ('case' | 'esac')) |
~(~OPERATOR | ~NAME | ('case' | 'esac'))) ';'!
But this does not work. I get
"error(139): /AntlrTutorial/src/foo/NusmvInput.g:78:5: set complement is empty |---> ~(~OPERATOR | ~NAME | ('case' | 'esac'))) EOC! ;"
How can I parse it?

There are a couple of things going wrong here:
you're using & in your grammar while it should be with quotes around it: '&'
unless you know exactly what you're doing, don't use ~ and . (especially not .+ !) inside parser rules: use them in lexer rules only;
create lexer rules instead of defining 'case' and 'esac' in your parser rules (it's safe to use literal tokens in your parser rules if no other lexer rule can potentially match is, but 'case' and 'esac' look a lot like NAME and they could end up in your AST in which case it's better to explicitly define them yourself in the lexer)
Here's a quick demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
CASES;
CASE;
}
parse
: (assignment SCOL)* EOF -> ^(ROOT assignment*)
;
assignment
: NAME ASSIGN^ expression
;
expression
: ternary_expression
;
ternary_expression
: or_expression (QMARK^ ternary_expression COL! ternary_expression)?
;
or_expression
: unary_expression ((AND | OR)^ unary_expression)*
;
unary_expression
: NOT^ atom
| atom
;
atom
: TRUE
| FALSE
| NUMBER
| NAME
| CASE single_case+ ESAC -> ^(CASES single_case+)
| '(' expression ')' -> expression
;
single_case
: expression COL expression SCOL -> ^(CASE expression expression)
;
TRUE : 'TRUE';
FALSE : 'FALSE';
CASE : 'case';
ESAC : 'esac';
ASSIGN : ':=';
AND : '&';
OR : '|';
NOT : '!';
QMARK : '?';
COL : ':';
SCOL : ';';
NAME : ('a'..'z' | 'A'..'Z')+;
NUMBER : ('0'..'9')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse your input:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
as follows:

Related

Parsing Decaf grammar in Antlr4

I am creating parser and lexer rules for Decaf programming language written in ANTLR4. There is a parser test file I am trying to run to get the parser tree for it by printing the visited nodes on the terminal window and paste them into D3_parser_tree.html class. The current parser tree is missing the right square brackets with the number 10 according to this testing file : class program { int i [10]; }
The error I am getting : mismatched input '10' expecting INT_LITERAL
I am not sure why I am getting this error although I have declared a lexer rule for INT_LITERAL and then called it in a parser rule within field_decl according to the given Decaf spec :
** Parser rules **
<program> → class Program ‘{‘ <field_decl>* <method_decl>* ‘}’
<field_decl> → <type> { <id> | <id> ‘[‘ <int_literal> ‘]’ }+, ;
<method_decl> → { <type> | void } <id> ( [ { <type> <id> }+, ] ) <block>
<digit> → 0 | 1 | 2 | … | 9
<block> → ‘{‘ <var_decl>* <statement>* ‘}’
<literal> → <int_literal> | <char_literal> | <bool_literal>
<hex_digit> → <digit> | a | b | c | … | f | A | B | C | … | F
<int_literal> → <decimal_literal> | <hex_literal>
<decimal_literal> → <digit> <digit>*
<hex_literal> → 0x <hex_digit> <hex_digit>*
Related Lexer rules :
NUMBER : [0-9]+;
fragment ALPHA : [_a-zA-Z0-9];
fragment DIGIT : [0-9];
fragment DECIMAL_LITERAL : DIGIT+;
CHAR_LITERAL : '\'' CHAR '\'';
STRING_LITERAL : '"' CHAR+ '"' ;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
Related Parser rules :
program : CLASS VAR LCURLYBRACE field_decl*method_decl* RCURLYBRACE EOF;
field_decl : data_type field ( COMMA field )* SEMICOLON;
Please let me know if you need further details & I appreciate your help a lot.
The following rules conflict:
VAR : ALPHA+;
...
NUMBER : [0-9]+;
...
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
They all match 10, but the lexer will always choose VAR since that is the rule defined first.
This is just how ANTLR's lexer works: it tries to match the most characters as possible, and when two (or more) rules all match the same amount of characters, the one defined first "wins".
You will see that it parses correctly if you change field into:
field : VAR | VAR LSQUAREBRACE VAR RSQUAREBRACE;

antlr4 line 2:0 mismatched input 'if' expecting {'if', OTHER}

I am having a bit of difficulty in my g4 file. Below is my grammar:
// Define a grammar called Hello
grammar GYOO;
program : 'begin' block+ 'end';
block
: statement+
;
statement
: assign
| print
| add
| ifstatement
| OTHER {System.err.println("unknown char: " + $OTHER.text);}
;
assign
: 'let' ID 'be' expression
;
print
: 'print' (NUMBER | ID)
;
ifstatement
: 'if' condition_block (ELSE IF condition_block)* (ELSE stat_block)?
;
add
: (NUMBER | ID) OPERATOR (NUMBER | ID) ASSIGN ID
;
stat_block
: OBRACE block CBRACE
| statement
;
condition_block
: expression stat_block
;
expression
: NOT expression //notExpr
| expression (MULT | DIV | MOD) expression //multiplicationExpr
| expression (PLUS | MINUS) expression //additiveExpr
| expression (LTEQ | GTEQ | LT | GT) expression //relationalExpr
| expression (EQ | NEQ) expression //equalityExpr
| expression AND expression //andExpr
| expression OR expression //orExpr
| atom //atomExpr
;
atom
: (NUMBER | FLOAT) //numberAtom
| (TRUE | FALSE) //booleanAtom
| ID //idAtom
| STRING //stringAtom
| NULL //nullAtom
;
ID : [a-z]+ ;
NUMBER : [0-9]+ ;
OPERATOR : '+' | '-' | '*' | '/';
ASSIGN : '=';
WS : (' ' | '\t' | '\r' | '\n') + -> skip;
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
IF : 'if';
ELSE : 'else';
OR : 'or';
AND : 'and';
EQ : 'is'; //'=='
NEQ : 'is not'; //'!='
GT : 'greater'; //'>'
LT : 'lower'; //'<'
GTEQ : 'is greater'; //'>='
LTEQ : 'is lower'; //'<='
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
MOD : '%';
POW : '^';
NOT : 'not';
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(HIDDEN)
;
OTHER
: .
;
When i try to -gui tree from antlr it shows me this error:
line 2:3 missing OPERATOR at 'a'
This error is given from this code example:
begin
let a be true
if a is true
print a
end
Basically it does not recognizes the ifstatement beggining with IF 'if' and it shows the tree like i am making an assignment.
How can i fix this?
P.S. I also tried to reposition my statements. Also tried to remove all statements and leave only ifstatement, and same thing happens.
Thanks
There is at least one issue:
ID : [a-z]+ ;
...
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
IF : 'if';
ELSE : 'else';
OR : 'or';
...
NOT : 'not';
Since ID is placed before TRUE .. NOT, those tokens will never be created since ID has precedence over them (and ID matches these tokens as well).
Start by moving ID beneath the NOT token.

ANTLR 'or' regular expression

I have a serious problem about | expression.
My grammar contains expression like this.
...ifelse : 'IF' condition 'THEN' dosomething+ 'ENDIF'
...dosomething : assign | print | input;
but dosomething becomes constant. For example :
IF a > 3 THEN
PRINT "HEllo"
b = a
ENDIF
so first dosomething is print and grammar can't read assing, input.
If statements become like this, it works correct
IF a > 3 THEN
PRINT "HEllo"
PRINT myName
ENDIF
So i mean 'or' ( | | )+ expression becomes constants same as first occured expression.
grammar hellog;
prog : command+;
command : maincommand
| expressioncommand
| flowcommand
;
//main
maincommand : printcommand
| inputcommand
;
printcommand : 'PRINT' (IDINT | IDSTR | STRING) NL
| 'PRINT' (IDINT | IDSTR | STRING) (',' (IDINT | IDSTR | STRING))* NL
;
inputcommand : 'INPUT' (IDINT | IDSTR) NL
| 'INPUT' STRING? (IDINT | IDSTR) NL
;
//expression
expressioncommand : intexpression
| strexpression
;
intexpression : IDINT '=' (IDINT | INT) NL
| IDINT '=' (IDINT | INT) (OPERATORMATH (IDINT | INT))* NL
;
strexpression : IDSTR '=' (IDSTR | STRING) NL
| IDSTR '=' (IDSTR | STRING) ('+' (IDSTR | STRING))* NL
;
//flow
flowcommand : ifelseflow
| whileflow
;
ifelseflow : 'IF' conditionflow 'THEN' NL dosomething+ ('ELSEIF' conditionflow 'THEN' NL dosomething+)* ('ELSE' NL dosomething+)? 'ENDIF' NL;
whileflow : 'WHILE' conditionflow NL (dosomething)+ 'WEND' NL;
dosomething : command;
conditionflow : (INT | IDINT) OPERATORBOOL (INT | IDINT)
| (STRING | IDSTR) '=' (STRING | IDSTR)
;
INT : [0-9]+;
STRING : '"' .*? '"';
IDINT : [a-zA-Z]+;
IDSTR : [a-zA-Z]+'$';
NL : '\n';
WS : [ \t\r]+ -> skip;
OPERATORMATH : '+' | '-' | '*' | '/';
OPERATORBOOL : '=' | '>' | '<' | '>=' | '<=';
I just need a grammar to run these expression:
PRINT "Your name"
INPUT name
PRINT "HELLO" name
a = 6
IF a > 3 THEN
PRINT a
a = a -1
END IF
WHILE b = 3
PRINT b
a = b
WEND
My answer isn't exactly about the | alternatives, but please keep reading, because like you, I found implementation of if..else constructs in a BASIC-like language a real challenge to implement. I found some good resources online. When I got it right, many, many problems disappeared all at once and it just started to work. Please take a look at my grammar snip:
ifstmt
: IF condition_block (ELSE IF condition_block)* (ELSE stmt_block)?
;
condition_block
: expr stmt_block
;
stmt_block
: OBRACE statement+ CBRACE
| statement
;
And my implementation (in C# visitor pattern):
public override MuValue VisitIfstmt(LISBASICParser.IfstmtContext context)
{
LISBASICParser.Condition_blockContext[] conditions = context.condition_block();
bool evaluatedBlock = false;
foreach (LISBASICParser.Condition_blockContext condition in conditions)
{
MuValue evaluated = Visit(condition.expr());
if (evaluated.AsBoolean())
{
evaluatedBlock = true;
Visit(condition.stmt_block());
break;
}
}
if (!evaluatedBlock && context.stmt_block() != null)
{
Visit(context.stmt_block());
}
return MuValue.Void;
}
Much borrowed from Bart Kiers's excellent implementation of his Mu demonstration language. Lots of great ideas in that project of his. It really showed me the light and this code I've shown handles if statements great, nested arbitrarily deep if you need that. This is production code running a critical domain-specific language.

Antlr parsing matching fixed string length instead of rule

Below is a cut down version of a grammar that is parsing an input assembly file. Everything in my grammar is fine until i use labels that have 3 characters (i.e. same length as an OPCODE in my grammar), so I'm assuming Antlr is matching it as an OPCODE rather than a LABEL, but how do I say "in this position, it should be a LABEL, not an OPCODE"?
Trial input:
set a, label1
set b, abc
Output from a standard rig gives:
line 2:5 missing EOF at ','
(OP_BAS set a (REF label1)) (OP_SPE set b)
When I step debug through ANTLRWorks, I see it start down instruction rule 2, but at the reference to "abc" jumps to rule 3 and then fail at the ",".
I can solve this with massive left factoring, but it makes the grammar incredibly unreadable. I'm trying to find a compromise (there isn't so much input that the global backtrack is a hit on performance) between readability and functionality.
grammar TestLabel;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens {
NEGATION;
OP_BAS;
OP_SPE;
OP_CMD;
REF;
DEF;
}
program
: instruction* EOF!
;
instruction
: LABELDEF -> ^(DEF LABELDEF)
| OPCODE dst_op ',' src_op -> ^(OP_BAS OPCODE dst_op src_op)
| OPCODE src_op -> ^(OP_SPE OPCODE src_op)
| OPCODE -> ^(OP_CMD OPCODE)
;
operand
: REG
| LABEL -> ^(REF LABEL)
| expr
;
dst_op
: PUSH
| operand
;
src_op
: POP
| operand
;
term
: '('! expr ')'!
| literal
;
unary
: ('+'! | negation^ )* term
;
negation
: '-' -> NEGATION
;
mult
: unary ( ( '*'^ | '/'^ ) unary )*
;
expr
: mult ( ( '+'^ | '-'^ ) mult )*
;
literal
: number
| CHAR
;
number
: HEX
| BIN
| DECIMAL
;
REG: ('A'..'C'|'I'..'J'|'X'..'Z'|'a'..'c'|'i'..'j'|'x'..'z') ;
OPCODE: LETTER LETTER LETTER;
HEX: '0x' ( 'a'..'f' | 'A'..'F' | DIGIT )+ ;
BIN: '0b' ('0'|'1')+;
DECIMAL: DIGIT+ ;
LABEL: ( '.' | LETTER | DIGIT | '_' )+ ;
LABELDEF: ':' ( '.' | LETTER | DIGIT | '_' )+ {setText(getText().substring(1));} ;
STRING: '\"' .* '\"' {setText(getText().substring(1, getText().length()-1));} ;
CHAR: '\'' . '\'' {setText(getText().substring(1, 2));} ;
WS: (' ' | '\n' | '\r' | '\t' | '\f')+ { $channel = HIDDEN; } ;
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT: '0'..'9' ;
fragment PUSH: ('P'|'p')('U'|'u')('S'|'s')('H'|'h');
fragment POP: ('P'|'p')('O'|'o')('P'|'p');
The parser has no influence on what tokens the lexer produces. So, the input "abc" will always be tokenized as a OPCODE, no matter what the parser tries to match.
What you can do is create a label parser rules that matches either a LABEL or OPCODE and then use this label rule in your operand rule:
label
: LABEL
| OPCODE
;
operand
: REG
| label -> ^(REF label)
| expr
;
resulting in the following AST for your example input:
This will only match OPCODE, but will not change the type of the token. If you want the type to be changed as well, add a bit of custom code to the rule that changes it to type LABEL:
label
: LABEL
| t=OPCODE {$t.setType(LABEL);}
;

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

Resources