Antlr4 lexer predicate needed? - parsing

I try to parse this piece of text
:20: test :254:
aapje
:21: rest
...
:20: and :21: are special tags, because they start the line. :254: should be 'normal' text, as it does not start on a newline.
I would like the result to be
(20, 'test :254: \naapje')
(21, 'rest')
Lines are terminated using either \r\n or '\n'
I started out trying to ignore the whitespace, but then I match the ':254:' tag as well. So I have to create something that uses the whitespace information.
What I would like to be able to do is something like this:
lexer grammar MT9740_lexer;
InTagNewLine : '\r\n' ~':';
ReadNewLine :'\r\n' ;
But the first would consume the : How can I still generate these tokens? Or is there a smarted approach?

What I understand is that you're looking for some lexer rules that match the start of a line. This lexer rule should tokenize your :20: or :21: appearing at the start of a line only
SOL : {getCharPositionInLine() == 0}? ':' [0-9]+ ':' ;
Your parser rules can then look for this SOL token before parsing the rest of the line.

Related

How to match [BOF]"Begin of file" in Antlr4 Lexer?

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line.
The following grammar works fine for most cases.
grammar com;
comment: COMMENT;
COMMENT
: '\n' '//' .*? '\n'
;
By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?
You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)
For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?
If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.
Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".
If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.
That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:
COMMENT
: {getCharPositionInLine() == 0}? '//' ~[\r\n]*
;
OTHER
: .
;
If you now tokenize the input:
// start
// middle
?//...
// end
with the Java code:
String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-10s'%s'%n",
comLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
the following will be printed to your console:
COMMENT '// start'
OTHER '\n'
COMMENT '// middle'
OTHER '\n'
OTHER '?'
OTHER '/'
OTHER '/'
OTHER '.'
OTHER '.'
OTHER '.'
OTHER '\n'
COMMENT '// end'
EOF '<EOF>'
Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.
EDIT
How I can do it with JavaScript? I cannot find good examples on internet.
By looking at the Javascript source, it looks like {this.column === 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?
By the way, does the Intellij Plugin support predict? If it does, does it support only Java?
No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

How to get a quote-enclosed string and discard the quote itself

I would like to parse text enclosed in single quotes as just the text itself. For example:
"hello" --> hello
I'm able to parse the string with the quotes in antlr using the following rule:
grammar Test;
root
: string EOF
;
string
: QUOTE WORD QUOTE
;
WORD
: [a-zA-Z0-9-]+
;
QUOTE
: '"'
;
And with the input text "mrrogers":
However I'm not sure how to 'discard' the S_QUOTE values, I've tried doing the following two items and it seems like I'm on the wrong course:
fragment QUOTE
: '\''
;
And:
QUOTE
: '\'' -> skip
;
What would be the proper way to do this?
IMO it's better to just keep the quotes in the token and eliminate them at the stage where you need the text. This also allows you to handle special case, like the conversion of double-quotes (two consecutive quote char) to single quotes or handle escape codes (if this is something you want to support).
If you separate your grammar into a Lexer grammar and a Parser grammar, you can use lexical modes to control which tokens are emitted for use by the parser.
As #MikeCargal has intimated, this is a simplistic solution but may help you see how your grammars may be structured.
TestLexer.g4
lexer grammar TestLexer;
tokens { WORD }
NEWLINE
: [\n\r]
->channel(HIDDEN)
;
QUOTE
: '"'
->pushMode(STRING_MODE),channel(HIDDEN)
;
mode STRING_MODE;
WORD
: [a-zA-Z0-9-]+
;
STRING_MODE_QUOTE
: '"'
->popMode,channel(HIDDEN)
;
The tokens { WORD } is an instruction to ANTLR to include the WORD token in the generated TestLexer.tokens file. This makes it visible to the parser.
The pushMode(STRING_MODE) is a way to have the lexer only emit tokens defined in the section mode STRING_MODE. ANTLR maintains a stack of modes, lexing starts out in DEFAULT_MODE and each pushMode() pushes a new mode onto the stack as the current mode governing which tokens will be emitted. Each popMode pops the current mode off the stack and the next one on the stack takes precedence.
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: string EOF
;
string
: WORD
;
These are very simplistic rules for a String type. How would you handle a string that should contain a ', for example. (Maybe you're string content is as simple as your word rule, but it looks more like this is just a starting point.
It will probably serve you well to take a look at the String rules in a grammar for a language with strings similar to what you want to allow. (You can find many grammars here. (It's pretty common to need to use Lexer modes to properly parse a String token)
I think you'll find that you need to capture the initial and terminal ' (or ") characters in the Lexer rule, so they will, necessarily be a part of the token. It's really trivial to strip thee first and last character from your token to get the string content from the token in your ParseTree.

Antlr4 token existence messing up parsing

first time poster so my greatest apologies if I break the rules.
I'm using Antlr4 to create a log parser and I'm running into some issues that I don't understand.
I'm trying to parse the following input log sequence:
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
With the following grammar:
grammar Juniper;
WS : (' '|'\t')+ -> skip ;
NL : '\r'? '\n' -> skip ;
fragment DIGIT : '0'..'9' ;
NUMBER : DIGIT+ ;
IPADDRESS : NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ;
SLASH : '/' -> skip ;
RIGHTARROW : '->' -> skip ;
CREATED: 'created' -> skip ;
HOSTNAME : [a-zA-Z0-9\-]+ ;
/* Input sample for rule: USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443 */
testcase : HOSTNAME WS CREATED WS IPADDRESS SLASH NUMBER RIGHTARROW IPADDRESS SLASH NUMBER NL;
It's failing and I can't for the life of me figure out why. I know that the token recognition error has something to do with the token that I've defined for HOSTNAME containing the dash in the character class but I'm not sure how to fix it.
$ antlr4 Juniper.g4 && javac Juniper*.java && grun Juniper testcase -tree
USA1-RR-SRX240-EDGE-01 created 10.20.30.40/50985->11.12.13.14/443
line 1:48 token recognition error at: '>'
line 1:30 mismatched input '10.20.30.40' expecting WS
(testcase SA1-RR-SRX240-EDGE-01 10.20.30.40 50985- 11.12.13.14 443)
Please note the second line of the above output is data that I paste into grun and then hit enter and hit control+D.
Any assistance on this would be highly appreciated, been banging me head against the keyboard on this for a bit now.
The problem with recognizing -> is that HOSTNAME matches any sequence of letters, numbers and dashes, and that includes 50985-. Since that match is longer than what NUMBER would match (50985), HOSTNAME wins. That's evidently not what you want.
Parsing log lines generally requires a context-sensitive scanner, and standard parser generators -- which are more oriented towards parsing programming languages -- are not always the ideal tool. In this case, for example, HOSTNAME cannot appear in the context in which it is being recognized, so it shouldn't even be in the list of possible tokens.
Of course, you could define a token which consisted of an ip number and port separated by a slash, which would solve the ambiguity, but (in my opinion) that would be suboptimal because you'll end up rescanning that token to parse it.

Antlr4 ignores tokens

In ANTLR 4 I try to parse a text file, but some of my defined tokens are constantly ignored in favor of others. I produced a small example to show what I mean:
File to parse:
hello world
hello world
Grammar:
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n]+? '\n';
The ANTLR book explains that 'hello' would become an implicit token, which is placed before the LINE token, and that token order matters. So I'd expect that the parser would NOT match the LINE token, but it does, as the resulting tree shows:
How can I fix this, so that I get the actual implicit tokens?
Btw. I also tried to write explicit tokens before LINE, but that didn't change anything.
Found it myself:
It seems that ANTLR chooses longest tokens first.
So since LINE would always match a whole line it is always preferred.
To still include some "joker" token into a grammar it should be a single symbol.
In my case
grammar TestLexer;
file : line line;
line : 'hello' ' ' 'world' '\n';
LINE : ~[\n];
would work.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Resources