Parse (possibly empty) text and variable definitions inside brackets - parsing

I'm trying to parse an input into text and variable items. The input may be empty, have only text items, only variable items, or a mix of them.
The following is some sample input:
<>
<a text>
<an escaped \< and an escaped \>>
<${aVar}>
<a text and ${aVar}>
<${aVar} and a text>
<a text ${aVar} ${someMoreVars} and more text>
I tried to parse it with the following grammar:
grammar Translation;
file : ( translation | comment)* EOF ;
translation : '<' ( text | varDef )* '>' ;
varDef : '${' VARDEF '}';
text : TEXT ;
But whatever I tried for the TEXT rule, I either end up parsing everything as text or I get the nasty problem
non-fragment lexer rule TEXT can match the empty string
Which sends me straight on to a stack overflow.
How can I solve the problem? Do I have to go to an 'island grammar'? I can't see how that would help.
The problem can't be this hard to solve but I am really stuck now.

You didn't post your entire grammar, so I cannot tell you what exactly is wrong with your grammar. You can however do something like this to parse your input:
file
: ( translation | COMMENT )* EOF
;
translation : '<' ( text | var_def )* '>' ;
text
: TEXT+
;
var_def
: VAR_DEF_START text VAR_DEF_END
;
COMMENT
: '//' ~[\r\n]*
;
SPACES
: [ \t\r\n]+ -> skip
;
VAR_DEF_START
: '${'
;
VAR_DEF_END
: '}'
;
TEXT
: '\\' [\\<>]
| ~[\\<> \t\r\n]
;
It is important to match single (!) TEXT characters in the lexer, and "glue" them together inside the text parser rule. If you try to match multiple TEXT chars in the lexer, you will end up matching too much characters.

Related

How to parse a single escape character between escape characters using ANTLR

I have a string like RANDOM = "SOMEGIBBERISH ("DOG CAT-DOG","DOG CAT-DOG")". For quoted string literals I use:
StringLiteralSQ : UnterminatedStringLiteralSQ '\'' ;
UnterminatedStringLiteralSQ : '\'' (~['\r\n] | '\\' (. | EOF))* ;
StringLiteralDQ : UnterminatedStringLiteralDQ '"' ;
UnterminatedStringLiteralDQ : '"' (~[\r\n] | '\\' (. | EOF))* ;
This parses the above mentioned String. I need to identify them words as comma separated tokens like this DOG CAT-DOG. for this I use something like
options : name EQUALS value
| OPTIONS L_PAREN (name EQUALS value) (COMMA (name EQUALS value)* R_PAREN
;
However, when I make the string of this format RANDOM = "SOMEGIBBERISH ("DOG CAT-DOG"DOG CAT-DOG")", it fails with an out-of-memory error.
I wanted to parse the strings that I have been parsing before and also parse this kind of string ("DOG CAT-DOG"DOG CAT-DOG") and consider it a single token maybe. How can I do that?
Your question is a bit confusing, so I'm not sure I understand what you are after. You ask for handling escaped characters, but then you don't show any input which uses escapes.
However, I think you are making things way too complicated. Look in other grammars to see how they define string tokens, including escape handling. Here's a typical example:
fragment SINGLE_QUOTE: '\'';
fragment DOUBLE_QUOTE: '"';
DOUBLE_QUOTED_TEXT: (
DOUBLE_QUOTE ('\\'? .)*? DOUBLE_QUOTE
)+
;
SINGLE_QUOTED_TEXT: (
SINGLE_QUOTE ('\\'? .)*? SINGLE_QUOTE
)+
;

Comment conflict in HQL grammar

I am trying to create the --i; statement.
But my problem lies with the single line comment rule of HQL which states:
L_S_COMMENT : ('--' | '//') .*? '\r'? '\n' -> channel(HIDDEN) ;
I wrote the rules in the lexer:
T_SUB2 : '--' ;
T_SEMICOLON : ';' ;
Rule in parser:
dummy_rule: T_SUB2 'i' T_SEMICOLON ;
When i test the rule it works fine with the parse tree correctly displayed, But when i press ENTER for a new line it shows an error, And it wont accept any more rules, I know its the L_S_COMMENT rule because when i remove it the rules works fine.
But deleting it is not the optimal solution any ideas what might cause this and how to bypass it.
If the relevant statements always have to be terminated in a SEMI, then effectively exclude then from the comment definition:
COMMENT
: ( CMark .*? Vws
| DMark .*? ~[; \t\r\n\f] Hws* Vws
) -> channel(HIDDEN)
;
fragment CMark : '//' ;
fragment DMark : '--' ;
fragment Hws : [ \t] ;
fragment Vws : [\r\n]+ ;
Explanation
The first alt of the rule matches a standard // comment
The second alt will match a -- comment iff the one visible character immediately prior to the terminating whitespace is not a SEMI. The ~ is set negation, while [; \t\r\n\f] is a set of characters. Since there is no operator modifying the set, ~[; \t\r\n\f] will match only a single character that is not one of the named characters.
Hence, the comment rule will not match the terminal portion of a line of code that contains -- and terminates in a SEMI.

ANTLR4 no viable alternative at input after adding parser rule

I'm trying to define the language of XQuery and XPath in test.g4. The part of the file relevant to my question looks like:
grammar test;
ap: 'doc' '(' '"' FILENAME '"' ')' '/' rp
| 'doc' '(' '"' FILENAME '"' ')' '//' rp
;
rp: ...;
f: ...;
xq: STRING
| ...
;
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS: [ \n\t\r]+ -> skip;
I tried to parse something like doc("movies.xml")//TITLE, but it gives
line 1:4 no viable alternative at input 'doc("movies.xml"'
But if I remove the STRING parser rule, it works fine. And since FILENAME appears before STRING, I don't know why it fails to match doc("movies.xml")//TITLE with the FILENAME parser rule. How can I fix this? Thank you!
The literal tokens you have in your grammar, are nothing more than regular tokens. So your lexer will look like this:
TOKEN_1 : 'doc';
TOKEN_2 : '(';
TOKEN_3 : '"';
TOKEN_4 : ')';
TOKEN_5 : '/';
TOKEN_6 : '//';
FILENAME : [a-zA-Z0-9/_]+ '.xml' ;
STRING : '"' [a-zA-Z0-9~!##$%^&*()=+._ -]+ '"';
WS : [ \n\t\r]+ -> skip;
(they're not really called TOKEN_..., but that's unimportant)
Now, the way ANTLR creates tokens is to try to match as much characters as possible. Whenever two (or more) rules match the same amount of characters, the one defined first "wins". Given these 2 rules, the input doc("movies.xml") will be tokenised as follows:
doc → TOKEN_1
( → TOKEN_2
"movies.xml" → STRING
) → TOKEN_4
Since ANTLR tries to match as many characters as possible, "movies.xml" is tokenised as a single token. The lexer does not "listen" to what the parser might need at a given time. This is how ANTLR works, you cannot change this.
FYI, there's a user contributed XPath grammar here: https://github.com/antlr/grammars-v4/blob/master/xpath/xpath.g4

Require newline or EOF after statement match

Just looking for a simple way of getting ANTLR4 to generate a parser that will do the following (ignore anything after the ;):
int #i ; defines an int
int #j ; see how I have to go to another line for another statement?
My parser is as the following:
compilationUnit:
(statement END?)*
statement END?
EOF
;
statement:
intdef |
WS
;
// 10 - 1F block.
intdef:
'intdef' Identifier
;
// Lexer.
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
// Whitespace, fragments and terminals.
WS: [ \t\r\n\u000C]+ -> skip;
//COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
END: (';' ~[\r\n]*) | '\n';
In essence, any time I have a statement, I need it to REQUIRE a newline before another is entered. I don't care if there's 3 new lines and then on the second one a bunch of tabs persist, as long as there's a new line.
The issue is, the ANTLR4 Parse Tree seems to be giving me errors for inputs such as:
.
(Pretend the dot isnt there, its literally no input)
int #i int #j
Woops, we got two on the same line!
Any ideas on how I can achieve this? I appreciate the help.
I've simplified your grammar a bit but made it require an end-of-line sequence after each statement to parse correctly.
grammar Testnl;
program: (statement )* EOF ;
statement: 'int' Identifier EOL;
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
EOL: ';' .*? '\r\n'
| ';' .*? '\n'
;
WS: [ \t\r\n\u000C]+ -> skip;
It parses
int #i ;
int #j;
[#0,0:2='int',<'int'>,1:0]
[#1,4:5='#i',<Identifier>,1:4]
[#2,7:9=';\r\n',<EOL>,1:7]
[#3,10:12='int',<'int'>,2:0]
[#4,14:15='#j',<Identifier>,2:4]
[#5,16:18=';\r\n',<EOL>,2:6]
[#6,19:18='<EOF>',<EOF>,3:0]
It also ignore stuff after the semicolon as just part of the EOL token:
[#0,0:2='int',<'int'>,1:0]
[#1,4:5='#i',<Identifier>,1:4]
[#2,7:20='; ignore this\n',<EOL>,1:7]
[#3,21:23='int',<'int'>,2:0]
[#4,25:26='#j',<Identifier>,2:4]
[#5,27:28=';\n',<EOL>,2:6]
[#6,29:28='<EOF>',<EOF>,3:0]
using either linefeed or carriagereturn-linefeed just fine. Is that what you're looking for?
EDIT
Per OP comment, made a small change to allow consecutive EOL tokens, and also move EOL token to statement to reduce repetition:
grammar Testnl;
program: ( statement EOL )* EOF ;
statement: 'int' Identifier;
Identifier: '#' Letter LetterOrDigit*;
fragment Letter: [a-zA-Z_];
fragment LetterOrDigit: [a-zA-Z0-9$_];
EOL: ';' .*? ('\r\n')+
| ';' .*? ('\n')+
;
WS: [ \t\r\n\u000C]+ -> skip;

Help with parsing a log file (ANTLR3)

I need a little guidance in writing a grammar to parse the log file of the game Aion. I've decided upon using Antlr3 (because it seems to be a tool that can do the job and I figured it's good for me to learn to use it). However, I've run into problems because the log file is not exactly structured.
The log file I need to parse looks like the one below:
2010.04.27 22:32:22 : You changed the connection status to Online.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:22 : You changed the group to the Solo state.
2010.04.27 22:32:28 : Legion Message: www.xxxxxxxx.com (forum)
ventrillo: 19x.xxx.xxx.xxx
Port: 3712
Pass: xxxx (blabla)
4/27/2010 7:47 PM
2010.04.27 22:32:28 : You have item(s) left to settle in the sales agency window.
As you can see, most lines start with a timestamp, but there are exceptions. What I'd like to do in Antlr3 is write a parser that uses only the lines starting with the timestamp while silently discarding the others.
This is what I've written so far (I'm a beginner with these things so please don't laugh :D)
grammar Antlr;
options {
language = Java;
}
logfile: line* EOF;
line : dataline | textline;
dataline: timestamp WS ':' WS text NL ;
textline: ~DIG text NL;
timestamp: four_dig '.' two_dig '.' two_dig WS two_dig ':' two_dig ':' two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
text: ~NL+;
/* Whitespace */
WS: (' ' | '\t')+;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
So what I need is an example of how to parse this without generating errors for lines without the timestamp.
Thanks!
No one is going to laugh. In fact, you did a pretty good job for a first try. Of course, there's room for improvement! :)
First some remarks: you can only negate single characters. Since your NL rule can possibly consist of two characters, you can't negate it. Also, when negating from within your parser rule(s), you don't negate single characters, but you're negating lexer rules. This may sound a bit confusing so let me clarify with an example. Take the combined (parser & lexer) grammar T:
grammar T;
// parser rule
foo
: ~A
;
// lexer rules
A
: 'a'
;
B
: 'b'
;
C
: 'c'
;
As you can see, I'm negating the A lexer-rule in the foo parser-rule. The foo rule does now not match any character except the 'a', but it matches any lexer rule except A. In other words, it will only match a 'b' or 'c' character.
Also, you don't need to put:
options {
language = Java;
}
in your grammar: the default target is Java (it does not hurt to leave it in there of course).
Now, in your grammar, you can already make a distinction between data- and text-lines in your lexer grammar. Here's a possible way to do so:
logfile
: line+
;
line
: dataline
| textline
;
dataline
: DataLine
;
textline
: TextLine
;
DataLine
: TwoDigits TwoDigits '.' TwoDigits '.' TwoDigits Space+ TwoDigits ':' TwoDigits ':' TwoDigits Space+ ':' TextLine
;
TextLine
: ~('\r' | '\n')* (NewLine | EOF)
;
fragment
NewLine
: '\r'? '\n'
| '\r'
;
fragment
TwoDigits
: '0'..'9' '0'..'9'
;
fragment
Space
: ' '
| '\t'
;
Note that the fragment part in the lexer rules mean that no tokens are being created from those rules: they are only used in other lexer rules. So the lexer will only create two different type of tokens: DataLine's and TextLine's.
Trying to keep your grammar as close as possible, here is how I was able to get it to work based on the example input. Because whitespace is being passed to the parser from the lexer, I did move all your tokens from the parser into actual lexer rules. The main change is really just adding another line option and then trying to get it to match your test data and not the actual other good data, I also assumed that a blank line should be discarded as you can tell by the rule. So here is what I was able to get working:
logfile: line* EOF;
//line : dataline | textline;
line : dataline | textline | discardline;
dataline: timestamp WS COLON WS text NL ;
textline: ~DIG text NL;
//"new"
discardline: (WS)+ discardtext (text|DIG|PERIOD|COLON|SLASH|WS)* NL
| (WS)* NL;
discardtext: (two_dig| DIG) WS* SLASH;
// two_dig SLASH four_dig;
timestamp: four_dig PERIOD two_dig PERIOD two_dig WS two_dig COLON two_dig COLON two_dig ;
four_dig: DIG DIG DIG DIG;
two_dig: DIG DIG;
//Following is very different
text: CHAR (CHAR|DIG|PERIOD|COLON|SLASH|WS)*;
/* Whitespace */
WS: (' ' | '\t')+ ;
/* New line goes to \r\n or EOF */
NL: '\r'? '\n' ;
/* Digits */
DIG : '0'..'9';
//new lexer rules
CHAR : 'a'..'z'|'A'..'Z';
PERIOD : '.';
COLON : ':';
SLASH : '/' | '\\';
Hopefully that helps you, good luck.

Resources