ANTLR4 - parse a file line-by-line - parsing

I try to write a grammar to parse a file line by line.
My grammar looks like this:
grammar simple;
parse: (line NL)* EOF
line
: statement1
| statement2
| // empty line
;
statement1 : KW1 (INT|FLOAT);
statement2 : KW2 INT;
...
NL : '\r'? '\n';
WS : (' '|'\t')-> skip; // toss out whitespace
If the last line in my input file does not have a newline, I get the following error message:
line xx:37 missing NL at <EOF>
Can somebody please explain, how to write a grammar that actually accepts the last line without newline

Simply don't require NL to fall after the last line. This form is efficient, and simplified based on the fact that line can match the empty sequence (essentially the last line will always be the empty line).
// best if `line` can be empty
parse
: line (NL line)* EOF
;
If line was not allowed to be empty, the following rule is efficient and performs the same operation:
// best if `line` cannot be empty
parse
: (line (NL line)* NL?)? EOF
;
The following rule is equivalent for the case where line cannot be empty, but is much less efficient. I'm including it because I frequently see people write rules in this form where it's easy to see that the specific NL token which was made optional is the one following the last line.
// INEFFICIENT!
parse
: (line NL)* (line NL?)? EOF
;

Related

Xtext terminal overlapping

I am new to Xtext and I am facing following issue:
Under every "error id :" line i can expect every printable character with spaces/tabs between. My language is indent-based so this "terminal" cannot start with space character.
Edit/:
Example code for this language would look like this:
package somepkg:
error UNKNOWN:
Unknown error.
error ZERO_DIVISION:
Do not divide by zero you {0} donkey!.
Closest i get to this language specification is this:
grammar com.example.lang.ermsglang.Ermsglang with org.eclipse.xtext.xbase.Xbase hidden(WS)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate ermsglang "http://www.example.com/lang/ermsglang/Ermsglang"
Model:
{Model}
'package' name=ENAME ':'
(BEGIN
(expressions+=Error)+
END)?
;
Error:
{Error}
'error' name=ENAME ':'
(BEGIN
(expressions+=Anything)+
END)?
;
Anything:
(ENAME|EMSG|INT)
;
//Terminals must be disjunctive
terminal ENAME:
('_'|'A'..'Z') ('_'|'A'..'Z')*
;
terminal EMSG:
('!'..'/'|':'..'#'|'['..'~')+
;
terminal SL_COMMENT:
'#' !('\n'|'\r')* ('\r'? '\n')?
;
// The following synthetic tokens are used for the indentation-aware blocks
terminal BEGIN: 'synthetic:BEGIN'; // increase indentation
terminal END: 'synthetic:END'; // decrease indentation
But still, this allows either ENAME or EMSG or INT terminals, so you cant mix for example numbers with characters. Problem is terminals have to be disjunctive so if i modify rule "ANYTHING" like this:
terminal ANYTHING:
(ENAME|EMSG|INT)+
;
or
Anything:
(ENAME|EMSG|INT)+
;
will be a problem with lexer/parser which cannot determine which terminal is which. How to deal with this situation? Thanks.
//Edit: Thank to Christian for working example, there is still one problem with SL_COMMENT, in this example second error keyword is highlighted with message
missing RULE_END at 'error'
package A :
error B :
a
#bopsa Akfkfndsfio
error A_C_S :
:aasdasdasd
the follwoing grammar works for me
grammar org.xtext.example.mydsl3.MyDsl hidden (WS, SL_COMMENT)
generate myDsl "http://www.xtext.org/example/mydsl3/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
Model:
{Model}
'package' name=ENAME ':'
(BEGIN
(expressions+=Error)+
END)?
;
Error:
{Error}
'error' name=ENAME ':'
(BEGIN
(expressions+=Anything)+
END)?
;
Anything:
(ENAME|EMSG|INT|':')
;
//Terminals must be disjunctive
terminal ENAME:
('_'|'A'..'Z'|'a'..'z') ('_'|'A'..'Z'|'a'..'z')*
;
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal EMSG:
('!'..'/'|';'..'#'|'['..'~')+
;
terminal SL_COMMENT:
'#' !('\n'|'\r')* ('\r'? '\n')?
;
// The following synthetic tokens are used for the indentation-aware blocks
terminal BEGIN: 'synthetic:BEGIN'; // increase indentation
terminal END: 'synthetic:END'; // decrease indentation
terminal WS : (' '|'\t'|'\r'|'\n')+;
terminal ANY_OTHER: .;

Antlr4 token existence messing up parsing

first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.
The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.

Antlr4 ignores tokens

In ANTLR 4 I try to parse a text file, but some of my defined tokens are constantly ignored in favor of others. I produced a small example to show what I mean:
File to parse:
hello world
hello world
Grammar:
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n]+? '\n';
The ANTLR book explains that 'hello' would become an implicit token, which is placed before the LINE token, and that token order matters. So I'd expect that the parser would NOT match the LINE token, but it does, as the resulting tree shows:
How can I fix this, so that I get the actual implicit tokens?
Btw. I also tried to write explicit tokens before LINE, but that didn't change anything.
Found it myself:
It seems that ANTLR chooses longest tokens first.
So since LINE would always match a whole line it is always preferred.
To still include some "joker" token into a grammar it should be a single symbol.
In my case
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n];
would work.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Antlr4 lexer predicate needed?

I try to parse this piece of text
:20: test :254:
aapje
:21: rest
...
:20: and :21: are special tags, because they start the line. :254: should be 'normal' text, as it does not start on a newline.
I would like the result to be
(20, 'test :254: \naapje')
(21, 'rest')
Lines are terminated using either \r\n or '\n'
I started out trying to ignore the whitespace, but then I match the ':254:' tag as well. So I have to create something that uses the whitespace information.
What I would like to be able to do is something like this:
lexer grammar MT9740_lexer;
InTagNewLine : '\r\n' ~':';
ReadNewLine :'\r\n' ;
But the first would consume the : How can I still generate these tokens? Or is there a smarted approach?
What I understand is that you're looking for some lexer rules that match the start of a line. This lexer rule should tokenize your :20: or :21: appearing at the start of a line only
SOL : {getCharPositionInLine() == 0}? ':' [0-9]+ ':' ;
Your parser rules can then look for this SOL token before parsing the rest of the line.

Resources