if i use this grammar:
grammar NameValue;
nameValue: (name=ID ':' value=ID)+ EOF;
//idWithSpace : ID (' ' ID)*;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ; // Define whitespace rule, toss it out
and this input:
a:b
a : b
A : B
i get this parse:
(nameValue a : b a : b A : B <EOF>)
but if i uncomment the idWithSpace line, i get this parse:
line 2:1 extraneous input ' ' expecting ':'
line 2:3 extraneous input ' ' expecting ID
(nameValue a : b a : b A : B <EOF>)
why does adding the rule idWithSpace
idWithSpace : ID (' ' ID)*;
that is not referenced, cause the parse to change?
This rule:
idWithSpace : ID (' ' ID)*;
due to the embedded string ' ', implicitly creates a lexer rule that matches a single space character and is placed before all other lexer rules. As a result, it effectively masked your WS rule whenever a single space character is encountered. Hence a single space can no longer be skipped and is tokenized and passed to the parser. But there is no parser rule that allows a single space before the :, so it complains about the extra ' ' input:
line 2:1 extraneous input ' ' expecting ':'
I was bitten a lot by this kind of bugs when I first used ANTLR.
Related
I'm trying to define the language of XQuery and XPath in test.g4. The part of the file relevant to my question looks like:
grammar test;
ap: 'doc' '(' '"' FILENAME '"' ')' '/' rp
| 'doc' '(' '"' FILENAME '"' ')' '//' rp
;
rp: ...;
f: ...;
xq: STRING
| ...
;
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS: [ \n\t\r]+ -> skip;
I tried to parse something like doc("movies.xml")//TITLE, but it gives
line 1:4 no viable alternative at input 'doc("movies.xml"'
But if I remove the STRING parser rule, it works fine. And since FILENAME appears before STRING, I don't know why it fails to match doc("movies.xml")//TITLE with the FILENAME parser rule. How can I fix this? Thank you!
The literal tokens you have in your grammar, are nothing more than regular tokens. So your lexer will look like this:
TOKEN_1 : 'doc';
TOKEN_2 : '(';
TOKEN_3 : '"';
TOKEN_4 : ')';
TOKEN_5 : '/';
TOKEN_6 : '//';
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS : [ \n\t\r]+ -> skip;
(they're not really called TOKEN_..., but that's unimportant)
Now, the way ANTLR creates tokens is to try to match as much characters as possible. Whenever two (or more) rules match the same amount of characters, the one defined first "wins". Given these 2 rules, the input doc("movies.xml") will be tokenised as follows:
doc → TOKEN_1
( → TOKEN_2
"movies.xml" → STRING
) → TOKEN_4
Since ANTLR tries to match as many characters as possible, "movies.xml" is tokenised as a single token. The lexer does not "listen" to what the parser might need at a given time. This is how ANTLR works, you cannot change this.
FYI, there's a user contributed XPath grammar here: https://github.com/antlr/grammars-v4/blob/master/xpath/xpath.g4
I have a .g4 grammar for vba/vb6 a lexer/parser, where the lexer is skipping line continuation tokens - not skipping them breaks the parser and isn't an option. Here's the lexer rule in question:
LINE_CONTINUATION : ' ' '_' '\r'? '\n' -> skip;
The problem this is causing, is that whenever a continued line starts at column 1, the parser blows up:
Sub Test()
Debug.Print "Some text " & _
vbNewLine & "Some more text"
End Sub
I thought "Hey I know! I'll just pre-process the string I'm feeding ANTLR to insert an extra whitespace before the underscore, and change the grammar to accept it!"
So I changed the rule like this:
LINE_CONTINUATION : WS? WS '_' NEWLINE -> skip;
NEWLINE : WS? ('\r'? '\n') WS?;
WS : [ \t]+;
...and the test vba code above gave me this parser error:
extraneous input 'vbNewLine' expecting WS
For now my only solution is to tell my users to properly indent their code. Is there any way I can fix that grammar rule?
(Full VBA.g4 grammar file on GitHub)
You basically want line continuation to be treated like whitespace.
OK, then add the lexical definition of line continuation to the WS token. Then WS will pick up the line continuation, and you don't need the LINECONTINUATION anywhere.
//LINE_CONTINUATION : ' ' '_' '\r'? '\n' -> skip;
NEWLINE : WS? ('\r'? '\n') WS?;
WS : ([ \t]+)|(' ' '_' '\r'? '\n');
I try to write a simple ANTLR4 grammar for parsing SRT subtitles files. I thought it will be an easy, introductory task, but I guess I must miss some point. But first things first --- the grammar:
grammar Srt;
file : subtitle (NL NL subtitle)* EOF;
subtitle: SUBNO NL
TSTAMP ' --> ' TSTAMP NL
LINE (NL LINE)*;
TSTAMP : I99 ':' I59 ':' I59 ',' I999;
SUBNO : D09+;
NL : '\r'? '\n';
LINE : ~('\r'|'\n')+;
fragment I999 : D09 D09 D09;
fragment I99 : D09 D09;
fragment I59 : D05 D09;
fragment D09 : [0-9];
fragment D05 : [0-5];
And here's a beginning of a SRT file where the problem stars:
1
00:00:20,000 --> 00:00:26,000
The error I get is:
line 2:0 mismatched input '00:00:20,000 --> 00:00:26,000' expecting TSTAMP
So it looks like the second line applied to the lexer rule LINE (as this is the longest token it could have been matched), however what I expect is to match the rule TSTAMP (and that's why it's defined before LINE rule in the grammar). My ANTLR4 knowledge is to weak at this point to tweak the grammar in a way, that lexer could try to match a subset on tokens depending on current position in parser rule. What I intend to achieve is to match TSTAMP and not LINE, as TSTAMP is in fact expected input. Maybe I could trick it with some lexer modes, but I can hardly believe it couldn't be written in a simpler way. Can it?
As CoronA suggested the trick was to defer the decision for LINE rule to the parser and this was the clue. I modified the grammar a bit more and now it parser subtitles smoothly:
grammar Srt;
file : subtitle (NL NL subtitle)* EOF;
subtitle: SUBNO NL
TSTAMP ' --> ' TSTAMP NL
lines;
lines : line (NL line)*;
line : (LINECHAR | SUBNO | TSTAMP)*;
TSTAMP : I99 ':' I59 ':' I59 ',' I999;
SUBNO : D09+;
NL : '\r'? '\n';
LINECHAR: ~[\r\n];
fragment I999 : D09 D09 D09?;
fragment I99 : D09 D09;
fragment I59 : D05 D09;
fragment D09 : [0-9];
fragment D05 : [0-5];
Your definition of the token LINE subsumes everything:
LINE : ~('\r'|'\n')+;
Each TSTAMP is also a LINE but a line can match longer lexems. And it does as you can see. ANTLR prefers longest matches.
To make your grammar work, transfer the decision what a line is from the lexer into the parser:
subtitle: SUBNO NL
TSTAMP ' --> ' TSTAMP NL
line*;
line: (LINECHAR | TSTAMP | SUBNO)* NL?;
...
LINECHAR : ~('\r'|'\n' ) ; //remove the '+'
You can see that a line may contain any LINE_CHAR but also TSTAMPs and SUBNOs.
I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.
Suppose I'm having white spaces (WS) in the hidden channel. And
for a particular rule alone, I want white spaces also to be considered, is
it possible to bring WS to the default channel for that particular rule alone in the parser?
Have look at the answer for your path question, notice how I put a '\n' into the parser rule. You should be able to put ' ' as well. Now, do all the options for your WS on the hidden channel need to be in the rule would be the only concern.
eg
rulename : Token1 ' ' Token2 ' ' Token1 {place action here};
Please note that the rule name starts with a lowercase letter and it is a parser rule while the "Token#" start with uppercase letter and are lexer rules. In between the different tokens the rule requires a space in this example, and I suppose you could put something like (' '|'\t'|'\r'|'\n')+ but I have not tried this and will leave that for you to attempt.
You can always query the hidden token stream
ie in C++
myrule: MYTOK { static_cast<antlr::CommonHiddenStreamToken*>(LT(1).get())->getHiddenAfter()->getType() == WS}? MYTOK
The semantic predicate will check to see if there is a whitespace token after matching the lexical token MYTOK
Lexer rules are evaluated in the order they are listed in your grammar file.
This means you can have something like this:
STRING_LITERAL: '"' NONCONTROL_CHAR* '"';
fragment NONCONTROL_CHAR: LETTER | DIGIT | UNDERSCORE | SPACE | BACKSLASH | MINUS | COMMA;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
fragment SPACE: ' ' | '\t';
fragment UNDERSCORE: '_';
fragment MINUS: '-';
fragment BACKSLASH: '\\';
COMMA: ',';
NEWLINE: ('\r'? '\n')+ { $channel = HIDDEN; };
TERMINATOR : ';';
WHITESPACE: SPACE+ { $channel = HIDDEN; };
LINE_COMMENT
:
'//' ~('\n'|'\r')* ('\r\n' | '\r' | '\n')
{
$channel = HIDDEN;
}
|
'//' ~('\n'|'\r')*
{
$channel = HIDDEN;
}
;
As you can see a string literal can have space or tabs in it. However a stand alone space or tab will be sent to the HIDDEN channel.