I'm a newbie to antlr4 and trying to make a parser. However, I am stuck on the most basic step of parsing a single word.
My grammar looks like:
grammar test;
program : WORD EOF;
WORD : 'test';
And my input file looks like:
test
The input file is only one line, has no trailing spaces. If I open it in a hex-editor it shows as 5 bytes: test<EOF>.
From what I understand of antlr, this rule should match a WORD token then an EOF token then stop parsing. However, I get line 1:4 missing 'test' at '<EOF>' when parsing the file.
I've worked with lex/yacc before and not encountered an error like this. I understand that antlr works differently, so I am curious why I am encountering this error.
Related
I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.
I am currently trying to improvise/fix bug an existing grammar which someone else has created.
We have our own language for which we have created an editor We are using eclipse ide.
Some grammar examples like
calc : choice INTEGER INTEGER
choice : add|sub|div|mul
INTEGER : ('0'..'9')+
So in my editor, if I type
calc add 2 aaa
So the error parser of antlr recognizes it as an error since it is expecting an integer and we typed string and throws error message such as
extraneous input 'aaa' expecting {'{', INTEGER}"
(I have my class extends BaseErrorListener, where I create markers for these errors )
Similarly, I have such grammar defined for my editor.
Now the question is: for all this, it identifies that something is wrong in the syntax and it throws errors, but what for syntax which is not part of grammar like
If I type any garbage value such as
abc add 2 3
or
just_type_junk_in_editor
it does not throw any error since ‘abc’ or ‘just_type_junk_in_editor‘ is not in my grammar
so is there a way that for keywords which are not part of grammar, the error parser of antlr should parse it as an error.
Without having seen the full grammar I think your problem is the missing EOF token in your main rule. ANTLR4 consumes input as much as it can, but if it doesn't match anything at least in the main rule, it ignores the rest, which explains why you don't see an error. By adding EOF you tell your ANTLR4 that all input must be matched:
calc: choice INTEGER INTEGER EOF;
I'm trying to create a grammar to parse Solr queries (only mildly relevant and you don't need to know anything about solr to answer the question -- just know more than I do about antlr 4.7). I'm basing it on the QueryParser.jj file from solr 6. I looked for an existing one, but there doesn't seem to be one that isn't old and out-of-date.
I'm stuck because when I try to run the parser I get "token recognition error"s.
The lexer I created uses lexer modes which, as I understand it means I need to have a separate lexer grammar file. So, I have a parser and a lexer file.
I whittled it down to a simple example to show I'm seeing. Maybe someone can tell me what I'm doing wrong. Here's the parser (Junk.g4):
grammar Junk;
options {
language = Java;
tokenVocab=JLexer;
}
term : TERM '\r\n';
I can't use an import because of the lexer modes in the lexer file I'm trying to create (the tokens in the modes become "undefined" if I use an import). That's why I reference the lexer file with the tokenVocab parameter (as shown in the XML example in github).
Here's the lexer (JLexer.g4):
lexer grammar JLexer;
TERM : TERM_START_CHAR TERM_CHAR* ;
TERM_START_CHAR : [abc] ;
TERM_CHAR : [efg] ;
WS : [ \t\n\r\u3000]+ -> skip;
If I copy the lexer code into the parser, then things work as expected (e.g., "aeee" is a term). Also, if I run the lexer file with grun (specifying tokens as the target), then the string parses as a TERM (as expected).
If I run the parser ("grun Junk term -tokens"), then I get:
line 1:0 token recognition error at: 'a'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'e'
line 1:3 token recognition error at: 'e'
[#0,4:5='\r\n',<'
'>,1:4]
I "compile" the lexer first, then "compile" the parser and then javac the resulting java files. I do this in a batch file, so I'm pretty confident that I'm doing this every time.
I don't understand what I'm doing wrong. Is it the way I'm running grun? Any suggestions would be appreciated.
Always trust your intuition! There is some convention internal to grun :-) See here TestRig.java c. lines 125, 150. Would have been lot nicer if some additional CLI args were also added.
When lexer and grammar are compiled separately, the grammar name - in your case - would be (insofar as TestRig goes) "Junk" and the two files must be named "JunkLexer.g4" and "JunkParser.g4". Accordingly the headers in parser file JunkParser.g4 should be modified too
parser grammar JunkParser;
options { tokenVocab=JunkLexer; }
... stuff
Now you can run your tests
> antlr4 JunkLexer
> antlr4 JunkParser
> javac Junk*.java
> grun Junk term -tokens
aeee
^Z
[#0,0:3='aeee',<TERM>,1:0]
[#1,6:5='<EOF>',<EOF>,2:0]
>
In ANTLR 4 I try to parse a text file, but some of my defined tokens are constantly ignored in favor of others. I produced a small example to show what I mean:
File to parse:
hello world
hello world
Grammar:
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n]+? '\n';
The ANTLR book explains that 'hello' would become an implicit token, which is placed before the LINE token, and that token order matters. So I'd expect that the parser would NOT match the LINE token, but it does, as the resulting tree shows:
How can I fix this, so that I get the actual implicit tokens?
Btw. I also tried to write explicit tokens before LINE, but that didn't change anything.
Found it myself:
It seems that ANTLR chooses longest tokens first.
So since LINE would always match a whole line it is always preferred.
To still include some "joker" token into a grammar it should be a single symbol.
In my case
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n];
would work.
I'm trying to have a UNICODE grammar in ANTLR, but this always causes error (snippet of grammar):
grammar Expression;
options {
charVocabulary='\u000'..'\uFFFE';
}
parse
: exp EOF
;
exp
: 'a'
;
It always ends up at: '\uFFFE' not expected ';'. How to write correct UNICODE grammars - what's the correct charVocabulary definition?
I'm using ANTLR 3.2, but it causes same error in new versions also.
charVocabulary is an ANTLR v2 option, not available in ANTLR v3 grammars. All lexers generated from ANTLR v3 grammars accept characters in the range \u0000..\uFFFF (be sure to use the proper encoding while creating an ANTLRInputStream!).
When using ANTLRWorks, you can see this by defining a rule, Any, that matches any character:
Any : . ;
and you will see the following diagram being displayed in the lower part of ANTLRWorks: