I am using ANTLR4 to try and parse the following text:
ex1, ex2: examples
var1,var2,var3: variables
Since the second line does not have whitespace after the commas, it doesn't parse correctly. If I add in the whitespace, then it works. The rules I am currently using to parse this:
line : list ':' name;
list : listitem (',' listitem)*;
listitem : [a-zA-Z0-9]+;
name : [a-zA-Z0-9]+;
This works perfectly for lines like line 1, but fails on lines like line 2, if there are parenthesis or pretty much any punctuation, it wants some whitespace after the punctuation and I can't always guarantee that about the input.
Does anyone know how to fix this?
First add explicit lexer rules (starting with a capital letter). Then add a lexer rule for whitespace and ignore the whitespace:
line : list ':' name;
list : listitem (',' listitem)*;
listitem : Identifier;
name : Identifier;
Identifier : [a-zA-Z0-9]+; // only one lexer rule for name and listitem, since and Identifier may be a name or listitem depending only on the position
WhiteSpace : (' '|'\t') -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip; // or don't skip if you need it as a statement terminator
Related
I would like to parse text enclosed in single quotes as just the text itself. For example:
"hello" --> hello
I'm able to parse the string with the quotes in antlr using the following rule:
grammar Test;
root
: string EOF
;
string
: QUOTE WORD QUOTE
;
WORD
: [a-zA-Z0-9-]+
;
QUOTE
: '"'
;
And with the input text "mrrogers":
However I'm not sure how to 'discard' the S_QUOTE values, I've tried doing the following two items and it seems like I'm on the wrong course:
fragment QUOTE
: '\''
;
And:
QUOTE
: '\'' -> skip
;
What would be the proper way to do this?
IMO it's better to just keep the quotes in the token and eliminate them at the stage where you need the text. This also allows you to handle special case, like the conversion of double-quotes (two consecutive quote char) to single quotes or handle escape codes (if this is something you want to support).
If you separate your grammar into a Lexer grammar and a Parser grammar, you can use lexical modes to control which tokens are emitted for use by the parser.
As #MikeCargal has intimated, this is a simplistic solution but may help you see how your grammars may be structured.
TestLexer.g4
lexer grammar TestLexer;
tokens { WORD }
NEWLINE
: [\n\r]
->channel(HIDDEN)
;
QUOTE
: '"'
->pushMode(STRING_MODE),channel(HIDDEN)
;
mode STRING_MODE;
WORD
: [a-zA-Z0-9-]+
;
STRING_MODE_QUOTE
: '"'
->popMode,channel(HIDDEN)
;
The tokens { WORD } is an instruction to ANTLR to include the WORD token in the generated TestLexer.tokens file. This makes it visible to the parser.
The pushMode(STRING_MODE) is a way to have the lexer only emit tokens defined in the section mode STRING_MODE. ANTLR maintains a stack of modes, lexing starts out in DEFAULT_MODE and each pushMode() pushes a new mode onto the stack as the current mode governing which tokens will be emitted. Each popMode pops the current mode off the stack and the next one on the stack takes precedence.
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
root
: string EOF
;
string
: WORD
;
These are very simplistic rules for a String type. How would you handle a string that should contain a ', for example. (Maybe you're string content is as simple as your word rule, but it looks more like this is just a starting point.
It will probably serve you well to take a look at the String rules in a grammar for a language with strings similar to what you want to allow. (You can find many grammars here. (It's pretty common to need to use Lexer modes to properly parse a String token)
I think you'll find that you need to capture the initial and terminal ' (or ") characters in the Lexer rule, so they will, necessarily be a part of the token. It's really trivial to strip thee first and last character from your token to get the string content from the token in your ParseTree.
I several projects I have run into a similar effect in my grammars.
I have the need to parse something like Key="Value"
So I create a grammar (simplest I could make to show the effect):
grammar test;
KEY : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;
DOUBLEQUOTE : '"' ;
EQUALS : '=' ;
entry : key=KEY EQUALS value=VALUE;
I can now parse thing="One Two Three" and in my code I receive
key = thing
value = "One Two Three"
In all of my projects I end up with an extra step to strip those " from the value.
Usually something like this (I use Java)
String value = ctx.value.getText();
value = value.substring(1, value.length()-1);
In my real grammars I find it very hard to move the check of the surrounding " into the parser.
Is there a clean way to already drop the " by doing something in the lexer/parser?
Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".
Update:
I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for.
By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
;
STRING
: [ _a-zA-Z0-9.-]+
;
and
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value=STRING ;
Try this:
VALUE
: DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE
{setText(getText().substring(1, getText().length()-1));}
;
Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.
EDIT
Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.
A quick demo:
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> type(DOUBLEQUOTE), popMode
;
STRING_ATOM
: [ _a-zA-Z0-9.-]
;
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value;
value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;
string_atoms : STRING_ATOM*;
If you now run the Java code:
Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());
this will be printed:
One Two Three
I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"ยง$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.
I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!
What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.
I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.
I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.