ANTLR4 no viable alternative at input after adding parser rule - parsing

I'm trying to define the language of XQuery and XPath in test.g4. The part of the file relevant to my question looks like:
grammar test;
ap: 'doc' '(' '"' FILENAME '"' ')' '/' rp
| 'doc' '(' '"' FILENAME '"' ')' '//' rp
;
rp: ...;
f: ...;
xq: STRING
| ...
;
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS: [ \n\t\r]+ -> skip;
I tried to parse something like doc("movies.xml")//TITLE, but it gives
line 1:4 no viable alternative at input 'doc("movies.xml"'
But if I remove the STRING parser rule, it works fine. And since FILENAME appears before STRING, I don't know why it fails to match doc("movies.xml")//TITLE with the FILENAME parser rule. How can I fix this? Thank you!

The literal tokens you have in your grammar, are nothing more than regular tokens. So your lexer will look like this:
TOKEN_1 : 'doc';
TOKEN_2 : '(';
TOKEN_3 : '"';
TOKEN_4 : ')';
TOKEN_5 : '/';
TOKEN_6 : '//';
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS : [ \n\t\r]+ -> skip;
(they're not really called TOKEN_..., but that's unimportant)
Now, the way ANTLR creates tokens is to try to match as much characters as possible. Whenever two (or more) rules match the same amount of characters, the one defined first "wins". Given these 2 rules, the input doc("movies.xml") will be tokenised as follows:
doc → TOKEN_1
( → TOKEN_2
"movies.xml" → STRING
) → TOKEN_4
Since ANTLR tries to match as many characters as possible, "movies.xml" is tokenised as a single token. The lexer does not "listen" to what the parser might need at a given time. This is how ANTLR works, you cannot change this.
FYI, there's a user contributed XPath grammar here: https://github.com/antlr/grammars-v4/blob/master/xpath/xpath.g4

Related

How to parse a single escape character between escape characters using ANTLR

I have a string like RANDOM = "SOMEGIBBERISH ("DOG CAT-DOG","DOG CAT-DOG")". For quoted string literals I use:
StringLiteralSQ : UnterminatedStringLiteralSQ '\'' ;
UnterminatedStringLiteralSQ : '\'' (~['\r\n] | '\\' (. | EOF))* ;
StringLiteralDQ : UnterminatedStringLiteralDQ '"' ;
UnterminatedStringLiteralDQ : '"' (~[\r\n] | '\\' (. | EOF))* ;
This parses the above mentioned String. I need to identify them words as comma separated tokens like this DOG CAT-DOG. for this I use something like
options : name EQUALS value
| OPTIONS L_PAREN (name EQUALS value) (COMMA (name EQUALS value)* R_PAREN
;
However, when I make the string of this format RANDOM = "SOMEGIBBERISH ("DOG CAT-DOG"DOG CAT-DOG")", it fails with an out-of-memory error.
I wanted to parse the strings that I have been parsing before and also parse this kind of string ("DOG CAT-DOG"DOG CAT-DOG") and consider it a single token maybe. How can I do that?
Your question is a bit confusing, so I'm not sure I understand what you are after. You ask for handling escaped characters, but then you don't show any input which uses escapes.
However, I think you are making things way too complicated. Look in other grammars to see how they define string tokens, including escape handling. Here's a typical example:
fragment SINGLE_QUOTE: '\'';
fragment DOUBLE_QUOTE: '"';
DOUBLE_QUOTED_TEXT: (
DOUBLE_QUOTE ('\\'? .)*? DOUBLE_QUOTE
)+
;
SINGLE_QUOTED_TEXT: (
SINGLE_QUOTE ('\\'? .)*? SINGLE_QUOTE
)+
;

antlr4 not parsing according to grammar

I'm trying to parse 'for loop' according to this (partial) grammar:
grammar GaleugParserNew;
/*
* PARSER RULES
*/
relational
: '>'
| '<'
;
varChange
: '++'
| '--'
;
values
: ID
| DIGIT
;
for_stat
: FOR '(' ID '=' values ';' values relational values ';' ID varChange ')' '{' '}'
;
/*
* LEXER RULES
*/
FOR : 'for' ;
ID : [a-zA-Z_] [a-zA-Z_0-9]* ;
DIGIT : [0-9]+ ;
SPACE : [ \t\r\n] -> skip ;
When I try to generate the gui of how it's parsed, it's not following the grammar I provided above. This is what it produces:
I've encountered this problem before, what I did then was simply exit cmd, open it again and compile everything and somehow that worked then. It's not working now though.
I'm not really very knowledgeable about antlr4 so I'm not sure where to look to solve this problem.
Must be a problem of the IDE you are using. The grammar is fine and produces this parse tree in Visual Studio Code:
I guess the IDE is using the wrong parser or lexer (maybe from a different work file?). Print the lexer tokens to see if they are what you expect. Hint: avoid defining implicit lexer tokens (like '(', '}' etc.), which will allow to give the tokens good names.

Unindented code breaks my grammar

I have a .g4 grammar for vba/vb6 a lexer/parser, where the lexer is skipping line continuation tokens - not skipping them breaks the parser and isn't an option. Here's the lexer rule in question:
LINE_CONTINUATION : ' ' '_' '\r'? '\n' -> skip;
The problem this is causing, is that whenever a continued line starts at column 1, the parser blows up:
Sub Test()
Debug.Print "Some text " & _
vbNewLine & "Some more text"
End Sub
I thought "Hey I know! I'll just pre-process the string I'm feeding ANTLR to insert an extra whitespace before the underscore, and change the grammar to accept it!"
So I changed the rule like this:
LINE_CONTINUATION : WS? WS '_' NEWLINE -> skip;
NEWLINE : WS? ('\r'? '\n') WS?;
WS : [ \t]+;
...and the test vba code above gave me this parser error:
extraneous input 'vbNewLine' expecting WS
For now my only solution is to tell my users to properly indent their code. Is there any way I can fix that grammar rule?
(Full VBA.g4 grammar file on GitHub)
You basically want line continuation to be treated like whitespace.
OK, then add the lexical definition of line continuation to the WS token. Then WS will pick up the line continuation, and you don't need the LINECONTINUATION anywhere.
//LINE_CONTINUATION : ' ' '_' '\r'? '\n' -> skip;
NEWLINE : WS? ('\r'? '\n') WS?;
WS : ([ \t]+)|(' ' '_' '\r'? '\n');

Ignoring whitespace (in certain parts) in Antlr4

I am not so familiar with antlr. I am using version 4 and I have a grammar where whitespace is not important in some parts (but it might be in others, or rather its luck).
So say we have the following grammar
grammar Foo;
program : A* ;
A : ID '#' ID '(' IDList ')' ';' ;
ID : [a-zA-Z]+ ;
IDList : ID (',' IDList)* ;
WS : [ \t\r\n]+ -> skip ;
and a test input
foo#bar(X,Y);
foo#baz ( z,Z) ;
The first line is parsed correctly whereas the second one is not.
I don't want to polute my rules with the places where whitespace is not relevant, since my actual grammar is more complicated than the toy example. In case it's not clear the part ID'#'ID should not have a whitespace. Whitespace in any other position shouldn't matter at all.
Even though you are skipping WS, lexer rules are still sensitive to the existence of the whitespace characters. Skip simply means that no token is generated for consumption by the parser. Thus, the lexer Addr rule explicitly does not permit any interior whitespace characters.
Conversely, the a and idList parser rules never see interior whitespace tokens so those rules are insensitive to the occurrence of whitespace characters occurring between the generated tokens.
grammar Foo;
program : a* EOF ; // EOF will require parsing the entire input
a : Addr LParen IDList RParen Semi ;
idList : ID (Comma ID)* ; // simpler equivalent construct
Addr : ID '#' ID ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
Define ID '#' ID as lexer token rather than as parser token.
A : AID '(' IDList ')' ';' ;
AID : [a-zA-Z]+ '#' [a-zA-Z]+;
Other options
enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Help with parsing a log file (ANTLR3)

I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.

Resources