Antlr4 parsing issue - parsing

When I try to work my message.expr file with Zmes.g4 grammar file via antlr-4.7.1-complete only first line works and there is no reaction for second one. Grammar is
grammar Zmes;
prog : stat+;
stat : (message|define);
message : 'MSG' MSGNUM TEXT;
define : 'DEF:' ('String '|'int ') ID ( ',' ('String '|'Int ') ID)* ';';
fragment QUOTE : '\'';
MSGNUM : [0-9]+;
TEXT : QUOTE ~[']* QUOTE;
MODULE : [A-Z][A-Z][A-Z] ;
ID : [A-Z]([A-Za-z0-9_])*;
SKIPS : (' '|'\t'|'\r'?'\n'|'\r')+ -> skip;
and message.expr is
MSG 100 'MESSAGE YU';
DEF: String Svar1,Int Intv1;`
On cmd when I run like this
grun Zmes prog -tree message.expr
(prog (stat (message MSG 100 'MESSAGE YU')))
and there is no second reaction. Why can it be.

Your message should include ';' at the end:
message : 'MSG' MSGNUM TEXT ';';
Also, in your define rule you have 'int ', which should probably be 'Int' (no space and a capital i).
I'd go for something like this:
grammar Zmes;
prog : stat+ EOF;
stat : (message | define) SCOL;
message : MSG MSGNUM TEXT;
define : DEF COL type ID (COMMA type ID)*;
type : STRING | INT;
MSG : 'MSG';
DEF : 'DEF';
STRING : 'String';
INT : 'Int';
COL : ':';
SCOL : ';';
COMMA : ',';
MSGNUM : [0-9]+;
TEXT : '\'' ~[']* '\'';
MODULE : [A-Z] [A-Z] [A-Z] ;
ID : [A-Z] [A-Za-z0-9_]*;
SKIPS : (' '|'\t'|'\r'?'\n'|'\r')+ -> skip;
which produces:

You should also add EOF if you want to parse the entire input, e.g.
prog : stat+ EOF;
See here why.

Related

ANTLR : issue with greedy rule

I never worked with ANTLR and generative grammars, so this is my first attempt.
I have a custom language I need to parse.
Here's an example:
-- This is a comment
CMD.CMD1:foo_bar_123
CMD.CMD2
CMD.CMD4:9 of 28 (full)
CMD.NOTES:
This is an note.
A line
(1) there could be anything here foo_bar_123 & $ £ _ , . ==> BOOM
(3) same here
CMD.END_NOTES:
Briefly, there could be 4 types of lines:
1) -- comment
2) <section>.<command>
3) <section>.<command>: <arg>
4) <section>.<command>:
<arg1>
<arg2>
...
<section>.<end_command>:
<section> is the literal "CMD"
<command> is a single word (uppercase, lowercase letters, numbers, '_')
<end_command> is the same word of <command> but preceded by the literal "end_"
<arg> could be any character
Here's what I've done so far:
grammar MyGrammar;
/*
* Parser Rules
*/
root : line+ EOF ;
line : (comment_line | command_line | normal_line) NEWLINE;
comment_line : COMMENT ;
command_line : section '.' command ((COLON WHITESPACE*)? arg)? ;
normal_line : TEXT ;
section : CMD ;
command : WORD ;
arg : TEXT ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
CMD : 'CMD';
COLON : ':' ;
COMMENT : '--' ~[\r\n]*;
WHITESPACE : (' ' | '\t') ;
NEWLINE : ('\r'? '\n' | '\r')+;
WORD : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;
TEXT : ~[\r\n]* ;
This is a test for my grammar:
$antlr4 MyGrammar.g4
warning(146): MyGrammar.g4:45:0: non-fragment lexer rule TEXT can match the empty string
$javac MyGrammar*.java
$grun MyGrammar root -tokens
CMD.NEW
[#0,0:6='CMD.NEW',<TEXT>,1:0]
[#1,7:7='\n',<NEWLINE>,1:7]
[#2,8:7='<EOF>',<EOF>,2:0]
The problem is that "CMD.NEW" gets swallowed by TEXT, because that rule is greedy.
Anyone can help me with this?
Thanks
There is a grammar ambiguity.
In the example you have provided CMD.NEW can match both command_line and normal_line.
Thus, given the expression:
line : (comment_line | command_line | normal_line) NEWLINE;
the parser can not definitely say what rule to accept (command_line or normal_line), so it matches it to normal_line which is actually a simple TEXT.
Consider rewriting your grammar in the way the parser can always say what rule to accept.
UPDATE:
Try this (I did not test that, but it should work):
grammar MyGrammar;
/*
* Parser Rules
*/
root : line+ EOF ;
line : (comment_line | command_line) NEWLINE;
comment_line : COMMENT ;
command_line : CMD '.' (note_cmd | command);
command : command_name ((COLON WHITESPACE*)? arg)? ;
note_cmd : notes .*? (CMD '.' END_NOTES) ;
command_name : WORD ;
arg : TEXT ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
CMD : 'CMD';
COLON : ':' ;
COMMENT : '--' ~[\r\n]*;
WHITESPACE : (' ' | '\t') ;
NEWLINE : ('\r'? '\n' | '\r')+;
WORD : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;
NOTES : 'NOTES';
END_NOTES : 'END_NOTES';
TEXT : ~[\r\n]* ;

Match any printable letter-like characters in ANTLR4 with Go as target

This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>

antlr4 line 2:0 mismatched input 'if' expecting {'if', OTHER}

I am having a bit of difficulty in my g4 file. Below is my grammar:
// Define a grammar called Hello
grammar GYOO;
program : 'begin' block+ 'end';
block
: statement+
;
statement
: assign
| print
| add
| ifstatement
| OTHER {System.err.println("unknown char: " + $OTHER.text);}
;
assign
: 'let' ID 'be' expression
;
print
: 'print' (NUMBER | ID)
;
ifstatement
: 'if' condition_block (ELSE IF condition_block)* (ELSE stat_block)?
;
add
: (NUMBER | ID) OPERATOR (NUMBER | ID) ASSIGN ID
;
stat_block
: OBRACE block CBRACE
| statement
;
condition_block
: expression stat_block
;
expression
: NOT expression //notExpr
| expression (MULT | DIV | MOD) expression //multiplicationExpr
| expression (PLUS | MINUS) expression //additiveExpr
| expression (LTEQ | GTEQ | LT | GT) expression //relationalExpr
| expression (EQ | NEQ) expression //equalityExpr
| expression AND expression //andExpr
| expression OR expression //orExpr
| atom //atomExpr
;
atom
: (NUMBER | FLOAT) //numberAtom
| (TRUE | FALSE) //booleanAtom
| ID //idAtom
| STRING //stringAtom
| NULL //nullAtom
;
ID : [a-z]+ ;
NUMBER : [0-9]+ ;
OPERATOR : '+' | '-' | '*' | '/';
ASSIGN : '=';
WS : (' ' | '\t' | '\r' | '\n') + -> skip;
OPAR : '(';
CPAR : ')';
OBRACE : '{';
CBRACE : '}';
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
IF : 'if';
ELSE : 'else';
OR : 'or';
AND : 'and';
EQ : 'is'; //'=='
NEQ : 'is not'; //'!='
GT : 'greater'; //'>'
LT : 'lower'; //'<'
GTEQ : 'is greater'; //'>='
LTEQ : 'is lower'; //'<='
PLUS : '+';
MINUS : '-';
MULT : '*';
DIV : '/';
MOD : '%';
POW : '^';
NOT : 'not';
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(HIDDEN)
;
OTHER
: .
;
When i try to -gui tree from antlr it shows me this error:
line 2:3 missing OPERATOR at 'a'
This error is given from this code example:
begin
let a be true
if a is true
print a
end
Basically it does not recognizes the ifstatement beggining with IF 'if' and it shows the tree like i am making an assignment.
How can i fix this?
P.S. I also tried to reposition my statements. Also tried to remove all statements and leave only ifstatement, and same thing happens.
Thanks
There is at least one issue:
ID : [a-z]+ ;
...
TRUE : 'true';
FALSE : 'false';
NULL : 'null';
IF : 'if';
ELSE : 'else';
OR : 'or';
...
NOT : 'not';
Since ID is placed before TRUE .. NOT, those tokens will never be created since ID has precedence over them (and ID matches these tokens as well).
Start by moving ID beneath the NOT token.

ANTLR4 parser detection

This is my first try with an ANTLR4-grammar. It should recognize a very easy statement, starting with the command 'label', followed by a colon, then an arbitrary text, ended by semicolon. But the parser does not recognize 'label' as description. Why?
grammar test;
prog: stat+;
stat:
description content
;
description:
'label' COLON
;
content:
TEXT
;
TEXT:
.*? ';'
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
COMMENT
: '//' (~('\n'|'\r'))*
;
COLON : ':' ;
ID: [a-zA-z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n';
WS : [ \t\n\r]+ -> skip ;
An example for the code:
label:
this is an error;
wronglabel:YYY
this should be a error;
The error is:
line 1:0 mismatched input 'label: \nthis is an error;' expecting 'label'
(prog label: \nthis is an error; \n\n\nwronglabel:YYY\nthis should be a error; \n)
This works much better:
grammar test;
prog: stat+;
stat:
description content
;
description:
'label' COLON
;
content:
text
;
text:
.*? ';'
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
COMMENT
: '//' (~('\n'|'\r'))*
;
COLON : ':' ;
ID: [a-zA-z]+;
NEWLINE: '\r'? '\n';
WS : [ \t\n\r]+ -> skip ;
Seems I mixed lexer and parser rules:
lexer rules have to be lower case,
parser rules uppercase.
So I changed the TEXT-rule into a text-rule.

ANTLR grammar: Add "dynamic" proximity operator

For a study project, I am using the following ANTLR grammar to parse query strings containing some simple boolean operators like AND, NOT and others:
grammar SimpleBoolean;
options { language = CSharp2; output = AST; }
tokens { AndNode; }
#lexer::namespace { INR.Infrastructure.QueryParser }
#parser::namespace { INR.Infrastructure.QueryParser }
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : 'AND';
OR : 'OR';
ANDNOT : 'ANDNOT';
NOT : 'NOT';
PROX : **?**
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'ä'|'Ä'|'ü'|'Ü'|'ö'|'Ö');
fragment QUOTE : ('"');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C');
WS : (SPACE) { $channel=Hidden; };
WORD : (~( ' ' | '\t' | '\r' | '\n' | '/' | '(' | ')' ))*;
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
startExpression : andExpression;
andExpression : (andnotExpression -> andnotExpression) (AND? a=andnotExpression -> ^(AndNode $andExpression $a))*;
andnotExpression : orExpression (ANDNOT^ orExpression)*;
proxExpression : **?**
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | LPARENTHESIS! andExpression RPARENTHESIS!;
Now I would like to add an operator for so-called proximity queries. For example, the query "A /5 B" should return everything that contains A with B following within the next 5 words. The number 5 could be any other positive int of course. In other words, a proximity query should result in the following syntax tree:
http://graph.gafol.net/pic/ersaDEbBJ.png
Unfortunately, I don't know how to (syntactically) add such a "PROX" operator to my existing ANTLR grammar.
Any help is appreciated. Thanks!
You could do that like this:
PROX : '/' '0'..'9'+;
...
startExpression : andExpression;
andExpression : (andnotExpression -> andnotExpression) (AND? a=andnotExpression -> ^(AndNode $andExpression $a))*;
andnotExpression : proxExpression (ANDNOT^ proxExpression)*;
proxExpression : orExpression (PROX^ orExpression)*;
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | LPARENTHESIS! andExpression RPARENTHESIS!;
If you parse the input:
A /500 B OR D NOT E AND F
the following AST is created:

Resources