antlr4 does't parse obvious tree - parsing

I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;

Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.

Related

Problems with precedence in ANTLR4 Grammar

I developed this small grammar here i have an issue with:
grammar test;
term : above_term | below_term;
above_term :
<assoc=right> 'forall' binders ',' forall_term
| <assoc=right> above_term '->' above_term
| <assoc=right> above_term '->' below_term
| <assoc=right> below_term '->' above_term
| <assoc=right> below_term '->' below_term
;
below_term :
<assoc = right> below_term arg (arg)*
| '#' qualid (term)*
| below_term '%' IDENT
| qualid
| sort
| '(' term ')'
;
forall_term : term;
arg : term| '(' IDENT ':=' term ')';
binders : binder (binder)*;
binder : name |<assoc=right>name (name)* ':' term | '(' name (name)* ':' term ')' |<assoc=right> name (':' term)? ':=' term;
name : IDENT | '_';
qualid : IDENT | qualid ACCESS_IDENT;
sort : 'Prop' | 'Set' | 'Type' ;
/**************************************
* LEXER RULES
**************************************/
/*
* STRINGS
*/
STRING : '"' (~["])* '"';
/*
* IDENTIFIER AND ACCESS IDENTIFIER
*/
ACCESS_IDENT : '.' IDENT;
IDENT : FIRST_LETTER (SUBSEQUENT_LETTER)*;
fragment FIRST_LETTER : [a-z] | [A-Z] | '_' | UNICODE_LETTER;
fragment SUBSEQUENT_LETTER : [a-z] | [A-Z] | DIGIT | '_' | '"' | UNICODE_LETTER | UNICODE_ID_PART;
fragment UNICODE_LETTER : '\\' 'u' HEX HEX HEX HEX;
fragment UNICODE_ID_PART : '\\' 'u' HEX HEX HEX HEX;
fragment HEX : [0-9a-fA-F];
/*
* NATURAL NUMBERS AND INTEGERS
*/
NUM : DIGIT (DIGIT)*;
INTEGER : ('-')? NUM;
fragment DIGIT : [0-9];
WS : [ \n\t\r] -> skip;
You can copy this grammar and test it with antlr if you want, it will work. Now for my question:
Let's consider an expression like this: a b -> c d -> forall n:nat, c.
Now according to my grammar the ("->") rule (right after forall rule) has the highest precedence.
As for this I want this term to be parsed so that both ("->") rules are on top of the parse tree. like this: (Please note, that this is an abstract view, i know that there are many above and below terms between the leafs)
However sadly it doesn't get parsed this way but this way:
Howcome the parser doesn't see the (->) rules both on top of the parse tree? Is this a precedence issue?
By changing term to below_term in the (arg) rule we can fix the problem arg : below_term| '(' IDENT ':=' term ')'; .
Lets take this expression as an example: a b c.
Once the parser sees, that the pattern a b matches this rule: below_term arg (arg)* he puts a as a below_term and trys to match b with the arg rule. However since arg points to the below_term rule now, no above_term is alowed except when it is braced. This solved my problem.
The term a b -> a b c -> forall n:nat, n now gets parsed this way:

How to make certain rules mandatory in Antlr

I wrote the following grammar which should check for a conditional expression.
Examples below is what I want to achieve using this grammar:
test invalid
test = 1 valid
test = 1 and another_test>=0.2 valid
test = 1 kasd y = 1 invalid (two conditions MUST be separated by AND/OR)
a = 1 or (b=1 and c) invalid (there cannot be a lonely character like 'c'. It should always be a triplet. i.e, literal operator literal)
grammar expression;
expr
: literal_value
| expr ( '='|'<>'| '<' | '<=' | '>' | '>=' ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
| '(' expr ')'
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| IDENTIFIER
;
keyword
: K_AND
| K_OR
;
name
: any_name
;
function_name
: any_name
;
database_name
: any_name
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
| STRING_LITERAL
| '(' any_name ')'
;
K_AND : A N D;
K_OR : O R;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
WS: [ \n\t\r]+ -> skip;
So my question is, how can I get the grammar to work for the examples mentioned above? Can we make certain words as mandatory between two triplets (literal operator literal)? In a sense I'm just trying to get a parser to validate the where clause condition but only simple condition and functions are permitted. I also want have a visitor that retrieves the values like function, parenthesis, any literal etc in Java, how to achieve that?
Yes and no.
You can change your grammar to only allow expressions that are comparisons and logical operations on the same:
expr
: term ( '='|'<>'| '<' | '<=' | '>' | '>=' ) term
| expr K_AND expr
| expr K_OR expr
| '(' expr ')'
;
term
: literal_value
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
;
The issue comes if you want to allow boolean variables or functions -- you need to classify the functions/vars in your lexer and have a different terminal for each, which is tricky and error prone.
Instead, it is generally better to NOT do this kind of checking in the parser -- have your parser be permissive and accept anything expression-like, and generate an expression tree for it. Then have a separate pass over the tree (called a type checker) that checks the types of the operands of operations and the arguments to functions.
This latter approach (with a separate type checker) generally ends up being much simpler, clearer, more flexible, and gives better error messages (rather than just 'syntax error').

"No viable alternative at input" error for ANTLR 4 JSON grammar

I am trying to adapt the STRING part of Pair in Object to a CamelString, but it fails. and report "No viable alternative at input".
I have tried to used my CamelString as an independent grammar, it works well. I think it means there is ambiguity in my grammar, but I can not understand why.
For the test input
{
'BaaaBcccCdddd':'cc'
}
Ther error is
line 2:2 no viable alternative at input '{'BaaaBcccCdddd''
The following is my grammar. It's almost the same with the standard JSON grammar for ANTLR 4.
/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */
// Derived from http://json.org
grammar Json;
json: object
| array
;
object
: '{' pair (',' pair)* '}'
| '{' '}' // empty object
;
pair : camel_string ':' value;
camel_string : '\'' (camel_body)*? '\'';
STRING
: '\'' (ESC | ~['\\])* '\'';
camel_body: CAMEL_BODY;
CAMEL_START: [a-z] ALPHA_NUM_LOWER*;
CAMEL_BODY: [A-Z] ALPHA_NUM_LOWER*;
CAMEL_END: [A-Z]+;
fragment ALPHA_NUM_LOWER : [0-9a-z];
array
: '[' value (',' value)* ']'
| '[' ']' // empty array
;
value
: STRING
| NUMBER
| object // recursion
| array // recursion
| 'true' // keywords
| 'false'
| 'null'
;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
NUMBER
: '-'? INT '.' [0-9]+ EXP? // 1.35, 1.35E-9, 0.3, -4.5
| '-'? INT EXP // 1e10 -3e4
| '-'? INT // -3, 45
;
fragment INT : '0' | [1-9] [0-9]* ; // no leading zeros
fragment EXP : [Ee] [+\-]? INT ; // \- since - means "range" inside [...]
WS : [ \t\n\r]+ -> skip ;

Context-Free-Grammar for assignment statements in ANTLR

I'm writing an ANTLR lexer/parser for context free grammar.
This is what I have now:
statement
: assignment_statement
;
assignment_statement
: IDENTIFIER '=' expression ';'
;
term
: IDENT
| '(' expression ')'
| INTEGER
| STRING_LITERAL
| CHAR_LITERAL
| IDENT '(' actualParameters ')'
;
negation
: 'not'* term
;
unary
: ('+' | '-')* negation
;
mult
: unary (('*' | '/' | 'mod') unary)*
;
add
: mult (('+' | '-') mult)*
;
relation
: add (('=' | '/=' | '<' | '<=' | '>=' | '>') add)*
;
expression
: relation (('and' | 'or') relation)*
;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
So my assignment statement is identified by the form
IDENTIFIER = expression;
However, assignment statement should also take into account cases when the right hand side is a function call (the return value of the statement). For example,
items = getItems();
What grammar rule should I add for this? I thought of adding a function call to the "expression" rule, but I wasn't sure if function call should be regarded as expression..
Thanks
This grammar looks fine to me. I am assuming that IDENT and IDENTIFIER are the same and that you have additional productions for the remaining terminals.
This production seems to define a function call.
| IDENT '(' actualParameters ')'
You need a production for the actual parameters, something like this.
actualParameters : nothing | expression ( ',' expression )*

Antlr parsing matching fixed string length instead of rule

Below is a cut down version of a grammar that is parsing an input assembly file. Everything in my grammar is fine until i use labels that have 3 characters (i.e. same length as an OPCODE in my grammar), so I'm assuming Antlr is matching it as an OPCODE rather than a LABEL, but how do I say "in this position, it should be a LABEL, not an OPCODE"?
Trial input:
set a, label1
set b, abc
Output from a standard rig gives:
line 2:5 missing EOF at ','
(OP_BAS set a (REF label1)) (OP_SPE set b)
When I step debug through ANTLRWorks, I see it start down instruction rule 2, but at the reference to "abc" jumps to rule 3 and then fail at the ",".
I can solve this with massive left factoring, but it makes the grammar incredibly unreadable. I'm trying to find a compromise (there isn't so much input that the global backtrack is a hit on performance) between readability and functionality.
grammar TestLabel;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens {
NEGATION;
OP_BAS;
OP_SPE;
OP_CMD;
REF;
DEF;
}
program
: instruction* EOF!
;
instruction
: LABELDEF -> ^(DEF LABELDEF)
| OPCODE dst_op ',' src_op -> ^(OP_BAS OPCODE dst_op src_op)
| OPCODE src_op -> ^(OP_SPE OPCODE src_op)
| OPCODE -> ^(OP_CMD OPCODE)
;
operand
: REG
| LABEL -> ^(REF LABEL)
| expr
;
dst_op
: PUSH
| operand
;
src_op
: POP
| operand
;
term
: '('! expr ')'!
| literal
;
unary
: ('+'! | negation^ )* term
;
negation
: '-' -> NEGATION
;
mult
: unary ( ( '*'^ | '/'^ ) unary )*
;
expr
: mult ( ( '+'^ | '-'^ ) mult )*
;
literal
: number
| CHAR
;
number
: HEX
| BIN
| DECIMAL
;
REG: ('A'..'C'|'I'..'J'|'X'..'Z'|'a'..'c'|'i'..'j'|'x'..'z') ;
OPCODE: LETTER LETTER LETTER;
HEX: '0x' ( 'a'..'f' | 'A'..'F' | DIGIT )+ ;
BIN: '0b' ('0'|'1')+;
DECIMAL: DIGIT+ ;
LABEL: ( '.' | LETTER | DIGIT | '_' )+ ;
LABELDEF: ':' ( '.' | LETTER | DIGIT | '_' )+ {setText(getText().substring(1));} ;
STRING: '\"' .* '\"' {setText(getText().substring(1, getText().length()-1));} ;
CHAR: '\'' . '\'' {setText(getText().substring(1, 2));} ;
WS: (' ' | '\n' | '\r' | '\t' | '\f')+ { $channel = HIDDEN; } ;
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT: '0'..'9' ;
fragment PUSH: ('P'|'p')('U'|'u')('S'|'s')('H'|'h');
fragment POP: ('P'|'p')('O'|'o')('P'|'p');
The parser has no influence on what tokens the lexer produces. So, the input "abc" will always be tokenized as a OPCODE, no matter what the parser tries to match.
What you can do is create a label parser rules that matches either a LABEL or OPCODE and then use this label rule in your operand rule:
label
: LABEL
| OPCODE
;
operand
: REG
| label -> ^(REF label)
| expr
;
resulting in the following AST for your example input:
This will only match OPCODE, but will not change the type of the token. If you want the type to be changed as well, add a bit of custom code to the rule that changes it to type LABEL:
label
: LABEL
| t=OPCODE {$t.setType(LABEL);}
;

Resources