How to break up a grammar file per statement - parsing

I have the following single-file grammar which will parse a (fake) SELECT statement or ALTER statement:
grammar Calc;
statements
: statement (NEWLINE statement)*
;
statement
: select_statement
| alter_statement
;
select_statement
: SELECT IDENTIFIER
;
alter_statement
: ALTER IDENTIFIER
;
NEWLINE: '\n';
ALTER: 'ALTER';
SELECT: 'SELECT';
IDENTIFIER: 'one' | 'two' | 'three';
WS: [ \t\n\r]+ -> skip; // We put it at the end and ONLY capture single white-spaces,
// so it's always overriden if anything else is provided above it, such as the NEWLINE
And the test input:
SELECT one
ALTER two
I would like to separate this grammar into separate parser files for each statement. This can be done in two steps. In the first part I'll separate the parser and lexer:
parser grammar SQLParser;
options { tokenVocab = SQLLexer; }
program: statements EOF;
statements: statement (NEWLINE statement)*;
statement
: select_statement
| alter_statement
;
select_statement: SELECT IDENTIFIER;
alter_statement: ALTER IDENTIFIER;
lexer grammar SQLLexer;
options { caseInsensitive=true; }
NEWLINE: '\n';
ALTER: 'ALTER';
SELECT: 'SELECT';
IDENTIFIER: 'one' | 'two' | 'three';
WS: [ \t\n\r]+ -> skip;it,
How would I then separate the two statements into their own separate files, such as SQLSelectParser.g4 and SQLAlterParser.g4 ? Are there any downsides of breaking up multiple complex statements into their own file? (of course, in the example it's trivial but it's just to ask this question).
Update, it seems the following approach works fine, though it'd be good to get someone experienced to comment on the approach and if it's even a good idea to do it at all:
# SQLParser.g4
parser grammar SQLParser;
import SQLAlterParser, SQLSelectParser;
options { tokenVocab = SQLLexer; }
program
: statements EOF
;
statements
: statement (NEWLINE statement)*
;
statement
: select_statement
| alter_statement
;
# SQLSelectParser.g4
parser grammar SQLSelectParser;
options { tokenVocab = SQLLexer; }
select_statement
: SELECT IDENTIFIER
;
# SQLAlterParser.g4
parser grammar SQLAlterParser;
options { tokenVocab = SQLLexer; }
alter_statement
: ALTER IDENTIFIER
;
# SQLLexer.g4
lexer grammar SQLLexer;
options { caseInsensitive=true; }
NEWLINE: '\n';
ALTER: 'ALTER';
SELECT: 'SELECT';
IDENTIFIER: 'one' | 'two' | 'three';
WS: [ \t\n\r]+ -> skip;

ANTLR import is not a simple include. Each grammar is treated as its own grammar "object".
When importing you get more of a superclass behavior as explained in the
Grammar Imports documentation.
You may find that it gets tricky to keep everything straight (especially as you find common sub-rules).
As the documentation shows, it can definitely be useful.
You didn't elaborate on your motivation for breaking things up. If it's just to break things up into smaller source files for organizational purposes, you may be incurring more complexity from the way imports work than you deal with by having the grammar in a single file (or just breaking up Lexer and Parser, which is very common).

Related

Differentiate between multiplication and comment

I am writing an ANTLR grammar for SAS and running into an issue where the lexer cannot differentiate between a single line comment and a multiplication operation.
The SAS syntax for comments is regrettably:
*message;
or
/*message*/
I have written a simple test grammar to illustrate the problem:
grammar TEST;
prog: expr* EOF;
expr
: VAR #base
| expr '*' expr #mult
;
VAR: ALPHA+;
fragment ALPHA : [a-zA-Z]+ ;
COMMENT: '*' ~[\r\n];
WS: [ \t\r\n] -> skip;
I'm not sure how I can qualify the lexer to differentiate between these two situations. I am an ANTLR beginner so I may have missed something obvious.

ANTLR4: Unrecognized constant value in a lexer command

I am learning how to use the "more" lexer command. I typed in the lexer grammar shown in the ANTLR book, page 281:
lexer grammar Lexer_To_Test_More_Command ;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \t\r\n]+ -> skip ;
mode STR ;
STRING : '"' -> mode(DEFAULT_MODE) ;
TEXT : . -> more ;
Then I created this simple parser to use the lexer:
grammar Parser_To_Test_More_Command ;
import Lexer_To_Test_More_Command ;
test: STRING EOF ;
Then I opened a DOS window and entered this command:
antlr4 Parser_To_Test_More_Command.g4
That generated this warning message:
warning(155): Parser_To_Test_More_Command.g4:3:29: rule LQUOTE
contains a lexer command with an unrecognized constant value; lexer
interpreters may produce incorrect output
Am I doing something wrong in the lexer or parser?
Combined grammars (which are grammars that start with just grammar, instead of parser grammar or lexer grammar) cannot use lexer modes. Instead of using the import feature¹, you should use the tokenVocab feature like this:
Lexer_To_Test_More_Command.g4:
lexer grammar Lexer_To_Test_More_Command;
// lexer rules and modes here
Parser_To_Test_More_Command.g4:
parser grammar Parser_To_Test_More_Command;
options {
tokenVocab = Lexer_To_Test_More_Command;
}
// parser rules here
¹ I actually recommend avoiding the import statement altogether in ANTLR. The method I described above is almost always preferable.

Overlapping Tokens in ANTLR 4

I have the following ANTLR 4 combined grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
This parses:
field x { A }
field x { B }
This does not:
field a { A }
field b { B }
In the case where parsing fails, I think the lexer is getting confused and emitting a NOTE token where I want it to emit an IDENTIFIER token.
Edit:
In the tokens coming out of the lexer, the 'NOTE' token is showing up where the parser is expecting 'IDENTIFIER'. 'NOTE' has higher precedence because it's shown first in the grammar. So, I can think of two ways to fix this... first, I could alter the grammar to disambiguate 'NOTE' and 'IDENTIFIER' (like adding a '$' in front of 'NOTE'). Or, I could just use 'IDENTIFIER' where I would use note and then deal with detecting issues when I walk the parse tree. Neither of those feel optimal. Surely there must be a way to fix this?
I actually ended up solving it like this:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: IDENTIFIER | NOTE ;
NOTE: [A-Ga-g] ;
IDENTIFIER: [A-Za-z0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
My parse tree still ends up looking how I'd like.
The actual grammar I'm developing is more complicated, as is the workaround based on this approach. But in general, the approach seems to work well.
Quick and dirty fix for your problem can be:
Change IDENTIFIERto match only the complement of NOTE. Then you put them together in identifier.
Resulting grammar:
grammar Example;
fieldList: field* ;
field: 'field' identifier '{' note '}' ;
note: NOTE ;
identifier: (NOTE|IDENTIFIER_C)+ ;
NOTE: [A-Ga-g] ;
IDENTIFIER_C: [H-Zh-z0-9] ;
WS: [ \t\r\n]+ -> skip ;
Disadvantage of this solution is, that you do not get the Identifier as tokens and you tokenize every single Character.

Antlr conditional rewrites

I have the following Antlr grammar rule:
expression1
: e=expression2 (BINOR^ e2=expression2)*
;
However if I have '3 | 1 | 2 | 6' this results in a flat tree, with 3, 1, 2, 6 all children of the BINOR node. What I really want is to be able to pattern match on either
expression2
or
^(BINOR expression2 expression2)
How can I change the rewrite so that these are the 2 patterns?
EDIT:
If I use custom rewrites, I'm thinking along the lines of
expression1
: e=expression2 (BINOR e2=expression2)*
-> {$BINOR != null}? ^(BINOR $e $e2*)
-> $e
But when I do this with '1|2|3' the resulting tree only has one BINOR node with two children which are 1 and 3, so 2 is missing.
Many thanks
You were close, this would work:
expression1
#init{boolean or = false;}
: e=expression2 (BINOR {or=true;} expression2)* -> {or}? ^(BINOR expression2+)
-> $e
;
But this is preferred since it doesn't use any custom code:
grammar T;
options {
output=AST;
}
expression1
: (e=expression2 -> $e) ((BINOR expression2)+ -> ^(BINOR expression2+))?
;
expression2
: NUMBER
;
NUMBER
: '0'..'9'+
;
BINOR
: '|'
;
The parser generated from the grammar above will parse the input "3|1|2|6" into the AST:
and the input "3" into the AST:
But your original try:
expression1
: e=expression2 (BINOR^ e2=expression2)*
;
does not produce a flat tree (assuming you have output=AST; in your options). It generates the following AST for "3|1|2|6":
If you "see" a flat tree, I guess you're using the interpreter in ANTLRWorks, which does not show the AST but the parse tree of your parse. The interpreter is also rather buggy (does not handle predicates and does not evaluate custom code), so best not use it. Use ANTLRWorks debugger instead, which works like a charm (the images from my answer are from the debugger)!

Help with parsing a log file (ANTLR3)

I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.

Resources