Antlr4 token existence messing up parsing - parsing

first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.

The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.

Related

How to match [BOF]"Begin of file" in Antlr4 Lexer?

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line.
The following grammar works fine for most cases.
grammar com;
comment: COMMENT;
COMMENT
: '\n' '//' .*? '\n'
;
By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?
You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)
For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?
If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.
Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".
If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.
That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:
COMMENT
: {getCharPositionInLine() == 0}? '//' ~[\r\n]*
;
OTHER
: .
;
If you now tokenize the input:
// start
// middle
?//...
// end
with the Java code:
String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-10s'%s'%n",
comLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
the following will be printed to your console:
COMMENT '// start'
OTHER '\n'
COMMENT '// middle'
OTHER '\n'
OTHER '?'
OTHER '/'
OTHER '/'
OTHER '.'
OTHER '.'
OTHER '.'
OTHER '\n'
COMMENT '// end'
EOF '<EOF>'
Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.
EDIT
How I can do it with JavaScript? I cannot find good examples on internet.
By looking at the Javascript source, it looks like {this.column === 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?
By the way, does the Intellij Plugin support predict? If it does, does it support only Java?
No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

Does -> skip change the behavior of the lexer rule precedence?

I am writing a grammar to parse a configuration export file from a closed system. when a parameter identified in the export file has a particularly long string value assigned to it, the export file inserts "\r\n\t" (double quotes included) every so often in the value. In the file I'll see something like:
"stuff""morestuff""maybesomemorestuff"\r\n\t"morestuff""morestuff"...etc."
In that line, "" is the way the export file escapes a " that is part of the actual string value - vs. a single " which indicates the end of the string value.
my current approach for the grammar to get this string value to the parser is to grab "stuff" as a token and \r\n\t as a token. So I have rules like:
quoted_value : (QUOTED_PART | QUOTE_SEPARATOR)+ ;
QUOTED_PART : '"' .*? '"';
QUOTE_SEPARATOR : '\r\n\t';
WS : [ \t\r\n] -> skip; //note - just one char at a time
I get no errors when I lex or parse a sample string. However, in the token stream - no QUOTE_SEPARATOR tokens show up and there is literally nothing in the stream where they should have been.
I had expected that since QUOTE_SEPARATOR is longer than WS and that it is first in the grammar that it would be selected, but it is behaving as if WS was matched and the characters were skipped and not send to the token string.
Does the -> skip do something to change how rule precedence works?
I am also open to a different approach to the lexing that completely removes the "\r\n\t" (all five characters) - this way just seemed easier and it should be easy enough for the program that will process the parse tree to deal with as other manipulations to the data will be done there anyway (my first grammar - teach me;) ).
No, skip does not affect rule precedence.
Change the QUOTE_SEPARATOR rule to
QUOTE_SEPARATOR : '\\r\\n\\t' ;
in order to match the actual textual content of the source string.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Antlr4 lexer predicate needed?

I try to parse this piece of text
:20: test :254:
aapje
:21: rest
...
:20: and :21: are special tags, because they start the line. :254: should be 'normal' text, as it does not start on a newline.
I would like the result to be
(20, 'test :254: \naapje')
(21, 'rest')
Lines are terminated using either \r\n or '\n'
I started out trying to ignore the whitespace, but then I match the ':254:' tag as well. So I have to create something that uses the whitespace information.
What I would like to be able to do is something like this:
lexer grammar MT9740_lexer;
InTagNewLine : '\r\n' ~':';
ReadNewLine :'\r\n' ;
But the first would consume the : How can I still generate these tokens? Or is there a smarted approach?
What I understand is that you're looking for some lexer rules that match the start of a line. This lexer rule should tokenize your :20: or :21: appearing at the start of a line only
SOL : {getCharPositionInLine() == 0}? ':' [0-9]+ ':' ;
Your parser rules can then look for this SOL token before parsing the rest of the line.

Resources