ANTLR4 parser detection - parsing

This is my first try with an ANTLR4-grammar. It should recognize a very easy statement, starting with the command 'label', followed by a colon, then an arbitrary text, ended by semicolon. But the parser does not recognize 'label' as description. Why?
grammar test;
prog: stat+;
stat:
description content
;
description:
'label' COLON
;
content:
TEXT
;
TEXT:
.*? ';'
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
COMMENT
: '//' (~('\n'|'\r'))*
;
COLON : ':' ;
ID: [a-zA-z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n';
WS : [ \t\n\r]+ -> skip ;
An example for the code:
label:
this is an error;
wronglabel:YYY
this should be a error;
The error is:
line 1:0 mismatched input 'label: \nthis is an error;' expecting 'label'
(prog label: \nthis is an error; \n\n\nwronglabel:YYY\nthis should be a error; \n)

This works much better:
grammar test;
prog: stat+;
stat:
description content
;
description:
'label' COLON
;
content:
text
;
text:
.*? ';'
;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
COMMENT
: '//' (~('\n'|'\r'))*
;
COLON : ':' ;
ID: [a-zA-z]+;
NEWLINE: '\r'? '\n';
WS : [ \t\n\r]+ -> skip ;
Seems I mixed lexer and parser rules:
lexer rules have to be lower case,
parser rules uppercase.
So I changed the TEXT-rule into a text-rule.

Related

ANTLR4 parsing a keyword-contained variable name

I'm trying to parse a simple integer declaration in antlr4.
The grammar I'm doing now is:
main : 'int' var '=' NUMBER+ ;
var : LETTER (LETTER | NUMBER)* ;
LETTER: [a-zA-Z_] ;
NUMBER: [0-9] ;
WS : [ \t\r\n]+ -> skip ;
When I tried to test the main rule with int int_A = 0, I got an error:
extraneous input 'int' expecting LETTER.
I know it's because the variable name 'int_A' contains the keyword 'int', but how do I modify my grammar? Thanks.
The lexer creates tokens with as much characters as possible. So int_A is being tokenised as the following 3 tokens:
'int' (int keyword defined in parser)
LETTER (_)
LETTER (A)
So the parser cannot create a var with these tokens.
Instead of a parser rule var, make it a lexer rule:
main : 'int' VAR '=' NUMBER+ ;
VAR : [a-zA-Z_] ([a-zA-Z_] | [0-9])* ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;

ANTLR : issue with greedy rule

I never worked with ANTLR and generative grammars, so this is my first attempt.
I have a custom language I need to parse.
Here's an example:
-- This is a comment
CMD.CMD1:foo_bar_123
CMD.CMD2
CMD.CMD4:9 of 28 (full)
CMD.NOTES:
This is an note.
A line
(1) there could be anything here foo_bar_123 & $ £ _ , . ==> BOOM
(3) same here
CMD.END_NOTES:
Briefly, there could be 4 types of lines:
1) -- comment
2) <section>.<command>
3) <section>.<command>: <arg>
4) <section>.<command>:
<arg1>
<arg2>
...
<section>.<end_command>:
<section> is the literal "CMD"
<command> is a single word (uppercase, lowercase letters, numbers, '_')
<end_command> is the same word of <command> but preceded by the literal "end_"
<arg> could be any character
Here's what I've done so far:
grammar MyGrammar;
/*
* Parser Rules
*/
root : line+ EOF ;
line : (comment_line | command_line | normal_line) NEWLINE;
comment_line : COMMENT ;
command_line : section '.' command ((COLON WHITESPACE*)? arg)? ;
normal_line : TEXT ;
section : CMD ;
command : WORD ;
arg : TEXT ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
CMD : 'CMD';
COLON : ':' ;
COMMENT : '--' ~[\r\n]*;
WHITESPACE : (' ' | '\t') ;
NEWLINE : ('\r'? '\n' | '\r')+;
WORD : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;
TEXT : ~[\r\n]* ;
This is a test for my grammar:
$antlr4 MyGrammar.g4
warning(146): MyGrammar.g4:45:0: non-fragment lexer rule TEXT can match the empty string
$javac MyGrammar*.java
$grun MyGrammar root -tokens
CMD.NEW
[#0,0:6='CMD.NEW',<TEXT>,1:0]
[#1,7:7='\n',<NEWLINE>,1:7]
[#2,8:7='<EOF>',<EOF>,2:0]
The problem is that "CMD.NEW" gets swallowed by TEXT, because that rule is greedy.
Anyone can help me with this?
Thanks
There is a grammar ambiguity.
In the example you have provided CMD.NEW can match both command_line and normal_line.
Thus, given the expression:
line : (comment_line | command_line | normal_line) NEWLINE;
the parser can not definitely say what rule to accept (command_line or normal_line), so it matches it to normal_line which is actually a simple TEXT.
Consider rewriting your grammar in the way the parser can always say what rule to accept.
UPDATE:
Try this (I did not test that, but it should work):
grammar MyGrammar;
/*
* Parser Rules
*/
root : line+ EOF ;
line : (comment_line | command_line) NEWLINE;
comment_line : COMMENT ;
command_line : CMD '.' (note_cmd | command);
command : command_name ((COLON WHITESPACE*)? arg)? ;
note_cmd : notes .*? (CMD '.' END_NOTES) ;
command_name : WORD ;
arg : TEXT ;
/*
* Lexer Rules
*/
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
NUMBER : DIGIT+ ([.,] DIGIT+)? ;
CMD : 'CMD';
COLON : ':' ;
COMMENT : '--' ~[\r\n]*;
WHITESPACE : (' ' | '\t') ;
NEWLINE : ('\r'? '\n' | '\r')+;
WORD : (LOWERCASE | UPPERCASE | NUMBER | '_')+ ;
NOTES : 'NOTES';
END_NOTES : 'END_NOTES';
TEXT : ~[\r\n]* ;

Match any printable letter-like characters in ANTLR4 with Go as target

This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>

antlr4 matching of choice expression

Im writing chrome DEPS file parser. How to match one of either following grammar rule defintion of rightexpr. My grammar is like
the following one:
grammar Depsgrammar;
prog: expr+ EOF;
expr: varline
;
varline:
ID EQ rightexpr
;
rightexpr :
basicvalue | bentukonejsonval| bentuktwojsonval
;
bentukonejsonval :
'[' string? (COMMA string )* COMMA? ']'
;
bentuktwojsonval :
'{' singledictexpr? (COMMA singledictexpr )* COMMA? '}'
;
singledictexpr :
string ':' basicvalue
;
basicvalue :
True
| False
| string
| NUM
| varfunc
;
varfunc :
Var '(' string ')'
;
string :
SIMPLESTRINGEXPRDOUBLEQUOTE
| SIMPLESTRINGEXPRSINGLEQUOTE
;
Var : 'Var' ;
COMMA : ',' ;
NUM : [0-9]+;
ID : [a-zA-Z0-9_]+;
True : [tT] [Rr] [Uu] [Ee];
False: [Ff] [Aa] [Ll] [Ss] [Ee];
fragment SIMPLESTRINGEXPRDOUBLEQUOTEBASE : ~ ( '\n' | '\r' | '"' )* ;
SIMPLESTRINGEXPRDOUBLEQUOTE: '"' SIMPLESTRINGEXPRDOUBLEQUOTEBASE '"' ;
fragment SIMPLESTRINGEXPRSINGLEQUOTEBASE : ~ ( '\n' | '\r' | '\'' )* ;
SIMPLESTRINGEXPRSINGLEQUOTE : '\'' SIMPLESTRINGEXPRSINGLEQUOTEBASE '\'' ;
EQ : '=';
COMMENT:
'#' ~ ( '\n' | '\r' )* '\n' -> skip ;
WS : [ \n\t\r]+ -> skip ;
I want user could enter this input
#adas21 #FS;SFD33
_as= Var('das') # somelongth comment
_as_0= FALSE # somelongth comment
_as_0= 'as!' # somelongth comment
gclient_gn_args = [
#ad as!~;
'checkout_libaom',
'checkout_nacl',
'"{cros_board}" == "amd64-generic"',
'checkout_oculus_sdk',
]
vars = {
'checkout_libaom':1,
'checkout_nacl': "SS",
'checkout_oculus_sdk': FalSe,
'checkout_oculus_sdk':'',
}
s=[
]
whenever I enter simple syntax in grun
sa=true
always give me line 1:3 mismatched input 'true' expecting blah..(rightexpr def). I'm missing in understanding of basic antlr4 choice matching decision. Could you please teach me?
Thanks
Whenever you have an error where the list of expected tokens seemingly includes the unexpected token, it is a good idea to list the generated tokens. You can do that by passing the -tokens option to grun. If you do this for your input, you'll see that true is interpreted as an ID token, not a True token.
The reason for that is that when multiple lexer rules would match on the current input and produce a match of the same size, the one that's defined earlier in the grammar is chosen. So because ID is defined before True, it takes precedence. Generally all keywords should be defined before the ID rule to prevent exactly this issue.
In other words, moving the True and False rules before ID will solve your issue.

Antlr4 parsing issue

When I try to work my message.expr file with Zmes.g4 grammar file via antlr-4.7.1-complete only first line works and there is no reaction for second one. Grammar is
grammar Zmes;
prog : stat+;
stat : (message|define);
message : 'MSG' MSGNUM TEXT;
define : 'DEF:' ('String '|'int ') ID ( ',' ('String '|'Int ') ID)* ';';
fragment QUOTE : '\'';
MSGNUM : [0-9]+;
TEXT : QUOTE ~[']* QUOTE;
MODULE : [A-Z][A-Z][A-Z] ;
ID : [A-Z]([A-Za-z0-9_])*;
SKIPS : (' '|'\t'|'\r'?'\n'|'\r')+ -> skip;
and message.expr is
MSG 100 'MESSAGE YU';
DEF: String Svar1,Int Intv1;`
On cmd when I run like this
grun Zmes prog -tree message.expr
(prog (stat (message MSG 100 'MESSAGE YU')))
and there is no second reaction. Why can it be.
Your message should include ';' at the end:
message : 'MSG' MSGNUM TEXT ';';
Also, in your define rule you have 'int ', which should probably be 'Int' (no space and a capital i).
I'd go for something like this:
grammar Zmes;
prog : stat+ EOF;
stat : (message | define) SCOL;
message : MSG MSGNUM TEXT;
define : DEF COL type ID (COMMA type ID)*;
type : STRING | INT;
MSG : 'MSG';
DEF : 'DEF';
STRING : 'String';
INT : 'Int';
COL : ':';
SCOL : ';';
COMMA : ',';
MSGNUM : [0-9]+;
TEXT : '\'' ~[']* '\'';
MODULE : [A-Z] [A-Z] [A-Z] ;
ID : [A-Z] [A-Za-z0-9_]*;
SKIPS : (' '|'\t'|'\r'?'\n'|'\r')+ -> skip;
which produces:
You should also add EOF if you want to parse the entire input, e.g.
prog : stat+ EOF;
See here why.

Resources