How do I fix my ANTLR Parser to separate comments from multiplication? - parsing

I'm using ANTLR4 to try to parse code that has asterisk-leading comments, like:
* This is a comment
I was initially having issues with multiplication expressions getting mistaken for these comments, so decided to make my lexer rule:
LINE_COMMENT : '\r\n' '*' ~[\r\n]* ;
This forces there to be a newline so it doesn't see 2 * 3, with '* 3' being a comment.
This worked just fine until I had code that starts with a comment on the first line, which does not have a newline to begin with. For example:
* This is the first line of the code's file\r\n
* This is the second line of the codes's file\r\n
I have also tried the {getCharPositionInLine==x}? to make sure that it only recognizes a comment if there is an asterisk or spaces/tabs coming first in the current line. This works when using
antlr4 *.g4
, but will not work with my JavaScript parser generated using
antlr4 -Dlanguage=JavaScript *.g4
Is there a way to get the same results of {getCharPositionInLine==x}? with my JavaScript parser or some way to prevent multiplication from being recognized as a comment? I should also mention that this coding language doesn't use semicolons at the end of lines.
I've tried playing around with this simple grammar, but I haven't had any luck.
grammar wow;
program : expression | Comment ;
expression : expression '*' expression
| NUMBER ;
Comment : '*' ~[\r\n]*;
NUMBER : [0-9]+ ;
Asterisk : '*' ;
Space : ' ' -> skip;
and using a test file: test.txt
5 * 5

Make the comment rule match at least one more non-whitespace character, otherwise it could match the same content as the Asterisk rule, like so:
Comment: '*' ' '* ~[\r\n]+;

Do comments have to be at the beginning of line?
If so you can check it with this._tokenStartCharPositionInLine == 0 and have lexer rule like this
Comment : '*' ~[\r\n]* {this._tokenStartCharPositionInLine == 0}?;
If not, you should gather information about previous tokens, which could allow us to have multiplication (for example your NUMBER rule), so you should write something like (java code)
#lexer::members {
private static final Set<Integer> MULTIPLIABLE_TOKENS = new HashSet<>();
static {
MULTIPLIABLE_TOKENS.add(NUMBER);
}
private boolean canBeMultiplied = false;
#Override
public void emit(final Token token) {
final int type = token.getType();
if (type != Whitespace && type != Newline) { // skip ws tokens from consideration
canBeMultiplied = MULTIPLIABLE_TOKENS.contains(type);
}
super.emit(token);
}
}
Comment : {!canBeMultiplied}? '*' ~[\r\n]*;
UPDATE
If you need function analogs for JavaScript, take a look into the sources -> Lexer.js

Related

Drop the required surrounding quotes in the lexer/parser

I several projects I have run into a similar effect in my grammars.
I have the need to parse something like Key="Value"
So I create a grammar (simplest I could make to show the effect):
grammar test;
KEY : [a-zA-Z0-9]+ ;
VALUE : DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE ;
DOUBLEQUOTE : '"' ;
EQUALS : '=' ;
entry : key=KEY EQUALS value=VALUE;
I can now parse thing="One Two Three" and in my code I receive
key = thing
value = "One Two Three"
In all of my projects I end up with an extra step to strip those " from the value.
Usually something like this (I use Java)
String value = ctx.value.getText();
value = value.substring(1, value.length()-1);
In my real grammars I find it very hard to move the check of the surrounding " into the parser.
Is there a clean way to already drop the " by doing something in the lexer/parser?
Essentially I want ctx.value.getText() to return One Two Three instead of "One Two Three".
Update:
I have been playing with the excellent answer provided by Bart Kiers and found this variation which does exactly what I was looking for.
By putting the DOUBLEQUOTE on a hidden channel they are used by the lexer and hidden from the parser.
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> channel(HIDDEN), pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> channel(HIDDEN), type(DOUBLEQUOTE), popMode
;
STRING
: [ _a-zA-Z0-9.-]+
;
and
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value=STRING ;
Try this:
VALUE
: DOUBLEQUOTE [ _a-zA-Z0-9.-]+ DOUBLEQUOTE
{setText(getText().substring(1, getText().length()-1));}
;
Needless to say: this ties your grammar to Java, and (depending how many embedded Java code you have) your grammar will be hard to port to some other target language.
EDIT
Once a token is created, there is no built-in way to separate it (other than doing so in embedded actions, as I demonstrated). What you're looking for can be done, but that means rewriting your grammar so that a string literal is not constructed as a single token. This can be done by using lexical modes so that the string can be constructed in the parser.
A quick demo:
TestLexer.g4
lexer grammar TestLexer;
KEY : [a-zA-Z0-9]+;
DOUBLEQUOTE : '"' -> pushMode(STRING_MODE);
EQUALS : '=';
mode STRING_MODE;
STRING_DOUBLEQUOTE
: '"' -> type(DOUBLEQUOTE), popMode
;
STRING_ATOM
: [ _a-zA-Z0-9.-]
;
TestParser.g4
parser grammar TestParser;
options { tokenVocab=TestLexer; }
entry : key=KEY EQUALS value;
value : DOUBLEQUOTE string_atoms DOUBLEQUOTE;
string_atoms : STRING_ATOM*;
If you now run the Java code:
Lexer lexer = new TestLexer(CharStreams.fromString("Key=\"One Two Three\""));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
TestParser.EntryContext entry = parser.entry();
System.out.println(entry.value().string_atoms().getText());
this will be printed:
One Two Three

How can I get mandatory whitespace in a specific rule while having other whitespace ignored?

The relevant part of my grammar is structured like this:
someRule: subrule1 | WS sign=('+' | '-') subrule2 ; // whitespace required here
// ... etc
WS: [ \t\r\n]+ -> channel(HIDDEN) ; // whitespace is usually ignored
I want to ignore whitespace, but require it on a specific rule. I'm pretty sure there was a way to do it in a previous ANTLR version (though I don't remember exactly, I think there was a syntax allowing to not hide them on a specific rule). I don't know how to do it in ANTLR4, of if it can be done at all without using language-specific actions.
I thought about making WS a parser rule somehow, but I don't think that's the right approach...
(and obviously I don't want to put WS? everywhere in the grammar)
Is there a (preferably language-independent) way to either (a) ensure that a specific point has whitespace, or (b) ensure both ends on a specific point are not "touching" on that channel, or (c) selectively choose the WS channel (default or hidden) depending on which rule it appears in somehow?
I'm guessing (c) is impossible and (a|b) would require language-dependent actions, unless I'm missing something?
I don't believe there is any way to have parser rules evaluate tokens on the HIDDEN channel (or any channel other than 0). Maybe I'm missing something but i couldn't find it.
The question I can't answer from your excerpts is whether there is another parser rule that should match if there is NOT a WS before your sign. That makes a big difference.
I tend to think of a successful grammar as one that will produce a parse tree that represents the correct way to interpret the input stream. IMHO, too many people complicate grammars by trying to encode "all the rules" into the grammar. If you have an accurate tree of the only way to interpret the input (whether it's "error free" or not), then you can write a Listener (maybe a visitor) that visits the tree and performs edits for additional rules (such as "the 'sign' much be preceded by whitespace).
This accomplishes a couple of things:
keeps the grammar simpler
allows you to be very specific in your error messages.
ANTLR is pretty good about error messages, for what information it has, but "expected WS, but saw '+'", is just not going to be as good an error message as "signs must follow whitespace".
With that in mind, you can get to the HIDDEN channel inside a listener.
First of all you'll need to make the token Stream available in your Listener:
class TestListener extends TestBaseListener {
BufferedTokenStream tokens;
public TestListener(BufferedTokenStream tokens) {
this.tokens = tokens;
}
// ...
}
and pass it to the constructor of your listener:
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestListener listener = new TestListener(tokens);
then, in the enter* method for whatever rule you need to add this to, you can do something like the following:
const int HIDDEN = 1;
#Override
public void enterAddSub(TxlParser.AddSubContext ctx) {
Token op = ctx.op;
int opIndex = op.getTokenIndex();
List<Token> hiddenChannel = tokens.getHiddenTokensToLeft(opIndex, HIDDEN);
if (hiddenChannel != null) {
Token ws = hiddenChannel.get(0);
if (ws != null) {
System.out.println("Found Ws (" + ws.getText() + ")");
}
} else {
System.out.println("There was no WS to the left of the operator");
// Your code here to add an error
}
}
for reference, this was the rule for I used with AddSub
expr:
expr (MULT | DIV) expr # MulDiv
| lExpr = expr op = (PLUS | MINUS) rExpr = expr # AddSub
// ...
;
If I run this with input a=x+y I get:
There was no WS to the left of the operator
But the input a=x +y gives me:
Found Ws ( )

ANTLR: Help on Lexing Errors for a custom grammar example

What approach would allow me to get the most on reporting lexing errors?
For a simple example I would like to write a grammar for the following text
(white space is ignored and string constants cannot have a \" in them for simplicity):
myvariable = 2
myvariable = "hello world"
Group myvariablegroup {
myvariable = 3
anothervariable = 4
}
Catching errors with a lexer
How can you maximize the error reporting potential of a lexer?
After reading this post: Where should I draw the line between lexer and parser?
I understood that the lexer should match as much as it can with regards to the parser grammar but what about lexical error reporting strategies?
What are the ordinary strategies for catching lexing errors?
I am imagining a grammar which would have the following "error" tokens:
GROUP_OPEN: 'Group' WS ID WS '{';
EMPTY_GROUP: 'Group' WS ID WS '{' WS '}';
EQUALS: '=';
STRING_CONSTANT: '"~["]+"';
GROUP_CLOSE: '}';
GROUP_ERROR: 'Group' .; // the . character is an invalid token
// you probably meant '{'
GROUP_ERROR2: .'roup' ; // Did you mean 'group'?
STRING_CONSTANT_ERROR: '"' .+; // Unterminated string constant
ID: [a-z][a-z0-9]+;
WS: [ \n\r\t]* -> skip();
SINGLE_TOKEN_ERRORS: .+?;
There are clearly some problems with your approach:
You are skipping WS (which is good), but yet you're using it in your other rules. But you're in the lexer, which leads us to...
Your groups are being recognized by the lexer. I don't think you want them to become a single token. Your groups belong in the parser.
Your grammar, as written, will create specific token types for things ending in roup, so croup for instance may never match an ID. That's not good.
STRING_CONSTANT_ERROR is much too broad. It's able to glob the entire input. See my UNTERMINATED_STRING below.
I'm not quite sure what happens with SINGLE_TOKEN_ERRORS... See below for an alternative.
Now, here are some examples of error tokens I use, and this works very well for error reporting:
UNTERMINATED_STRING
: '"' ('\\' ["\\] | ~["\\\r\n])*
;
UNTERMINATED_COMMENT_INLINE
: '/*' ('*' ~'/' | ~'*')*? EOF -> channel(HIDDEN)
;
// This should be the LAST lexer rule in your grammar
UNKNOWN_CHAR
: .
;
Note that these unterminated tokens represent single atomic values, they don't span logical structures.
Also, UNKNOWN_CHAR will be a single char no matter what, if you define it as .+? it will always match exactly one char anyway, since it will be trying to match as few chars as possible, and that minimum is one char.
Non-greedy quantifiers make sense when something follows them. For instance in the expression .+? '#', the .+? will be forced to consume characters until it encounters a # sign. If the .+? expression is alone, it won't have to consume more than a single character to match, and therefore will be equivalent to ..
I use the following code in the lexer (.NET ANTLR):
partial class MyLexer
{
public override IToken Emit()
{
CommonToken token;
RecognitionException ex;
switch (Type)
{
case UNTERMINATED_STRING:
Type = STRING;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_STRING, Line, Column, "Unterminated string: " + GetTokenTextForDisplay(token), ex);
return token;
case UNTERMINATED_COMMENT_INLINE:
Type = COMMENT_INLINE;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_COMMENT_INLINE, Line, Column, "Unterminated comment: " + GetTokenTextForDisplay(token), ex);
return token;
default:
return base.Emit();
}
}
// ...
}
Notice that when the lexer encounters a bad token type, it explicitly changes it it to a valid token, so the parser can actually make sense of it.
Now, it is the job of the parser to identify bad structure. ANTLR is smart enough to perform single-token deletion and single-token insertion while trying to resynchronize itself with an invalid input. This is also the reason why I'm letting UNKNOWN_CHAR slip though to the parser, so it can discard it with an error message.
Just take the errors it generates and alter them in order to present something nicer to the user.
So, just make your groups into a parser rule.
An example:
Consider the following input:
Group ,ygroup {
Here, the , is clearly a typo (user pressed , instead of m).
If you use UNKNOWN_CHAR: .; you will get the following tokens:
Group of type GROUP
, of type UNKNOWN_CHAR
ygroup of type ID
{ of type '{ '
The parser will be able to figure out the UNKNOWN_CHAR token needs to be deleted and will correctly match a group (defined as GROUP ID '{' ...).
ANTLR will insert so-called error nodes at the points where it finds unexpected tokens (in this case between GROUP and ID). These nodes are then ignored for the purposes of parsing, but you can retrieve them with your visitors/listeners to handle them (you can use a visitor's VisitErrorNode method for instance).

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

bison error recovery

I have found out that I can use 'error' in the grammar rule as a mechanism for error recovery. So if there was an error, the parser must discard the current line and resume parsing from the next line. An example from bison manual to achieve this could be something like this:
stmts:
exp
|stmts exp
| error '\n'
But I cannot use that; because I had to make flex ignores '\n' in my scannar, so that an expression is not restricted to be expressed in one line. How can I make the parser -when encountering an error- continue parsing to the following line, given that there is no special character (i.e. semicolon) to indicate an end of expression and there is no 'newline' token?
Thanks..
Since you've eliminated the marker used by the example, you're going to have to pull a stunt to get the equivalent effect.
I think you can use this:
stmts:
exp
| stmts exp
| error { eat_to_newline(); }
Where eat_to_newline() is a function in the scanner (source file) that arranges to discard any saved tokens and read up to the next newline.
extern void eat_to_newline(void);
void eat_to_newline(void)
{
int c;
while ((c = getchar()) != EOF && c != '\n')
;
}
It probably needs to be a little more complex than that, but not a lot more complex than that. You might need to use yyerrok; (and, as the comment reminds me, yyclearin; too) after calling eat_to_newline().

Resources