Can a lexer rule be applied in only one parser rule? - parsing

The issue we're having with ANTLR is that we have a grammar that's parsing something like this:
Hello, my name is bob.
bob offset: 5
Keep in mind that the "bob." in the first line is dynamic, and could be anything. One of those things is "bob". The "bob offset" line is not dynamic, and is in every file of the type that we are parsing.
So, to parse this, we have a couple of rules:
greeting: 'Hello, my name is' id1=IDENT '.' NEWLINE
{ System.out.println("Name: " + $id1.text"); }
;
bob_offset: 'bob offset:' id1=5 NEWLINE
{ System.out.println("bob offset: " + $id1.text); }
;
So, the issue is that 'bob offset:' is a token that the lexer reads. Now, when the greeting rule goes, an error is thrown because it's trying to match 'bob' to 'bob offset:', but it can't.
The solution that would be ideal is if ANTLR had some way to specify context- or parser rule-specific lexer rules. This way, the 'bob offset:' token wouldn't be mistaken anywhere else in the grammar.
Any thoughts on this issue would be appreciated.

We ended up having to work around this with more parser rules to flesh it out more specifically for ANTLR.

Related

How to match [BOF]"Begin of file" in Antlr4 Lexer?

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line.
The following grammar works fine for most cases.
grammar com;
comment: COMMENT;
COMMENT
: '\n' '//' .*? '\n'
;
By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?
You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)
For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?
If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.
Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".
If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.
That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:
COMMENT
: {getCharPositionInLine() == 0}? '//' ~[\r\n]*
;
OTHER
: .
;
If you now tokenize the input:
// start
// middle
?//...
// end
with the Java code:
String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-10s'%s'%n",
comLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
the following will be printed to your console:
COMMENT '// start'
OTHER '\n'
COMMENT '// middle'
OTHER '\n'
OTHER '?'
OTHER '/'
OTHER '/'
OTHER '.'
OTHER '.'
OTHER '.'
OTHER '\n'
COMMENT '// end'
EOF '<EOF>'
Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.
EDIT
How I can do it with JavaScript? I cannot find good examples on internet.
By looking at the Javascript source, it looks like {this.column === 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?
By the way, does the Intellij Plugin support predict? If it does, does it support only Java?
No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

Jison Lexer - Detect Certain Keyword as an Identifier at Certain Times

"end" { return 'END'; }
...
0[xX][0-9a-fA-F]+ { return 'NUMBER'; }
[A-Za-z_$][A-Za-z0-9_$]* { return 'IDENT'; }
...
Call
: IDENT ArgumentList
{{ $$ = ['CallExpr', $1, $2]; }}
| IDENT
{{ $$ = ['CallExprNoArgs', $1]; }}
;
CallArray
: CallElement
{{ $$ = ['CallArray', $1]; }}
;
CallElement
: CallElement "." Call
{{ $$ = ['CallElement', $1, $3]; }}
| Call
;
Hello! So, in my grammar I want "res.end();" to not detect end as a keyword, but as an ident. I've been thinking for a while about this one but couldn't solve it. Does anyone have any ideas? Thank you!
edit: It's a C-like programming language.
There's not quite enough information in the question to justify the assumptions I'm making here, so this answer may be inexact.
Let's suppose we have a somewhat Lua-like language in which a.b is syntactic sugar for a["b"]. Furthermore, since the . must be followed by a lexical identifier -- in other words, it is never followed by a syntactic keyword -- we'd like to inhibit keyword recognition in this context.
That's a pretty simple rule. It's simple enough that the lexer could implement it without any semantic information at all; all that it says is that the token which follows a . must be an identifier. In this context, keywords should be treated as identifiers, and anything else other than an identifier is an error.
We can do this with start conditions. Specifically, we define a start condition which is only used after a . token:
%x selector
%%
/* White space and comment rules need to explicitly include
* the selector condition
*/
<INITIAL,selector>\s+ ;
/* Other rules, including keywords, are unmodified */
"end" return "END";
/* The dot rule triggers a new start condition */
"." this.begin("selector"); return ".";
/* Outside of the start condition, identifiers don't change state. */
[A-Za-z_]\w* yylval = yytext; return "ID";
/* Only identifiers are valid in this start condition, and if found
* the start condition is changed back. Anything else is an error.
*/
<selector>[A-Za-z_]\w* yylval = yytext; this.popState(); return "ID";
<selector>. parse_error("Expecting identifier");
Modify your parser, so it always knows what it is expecting to read next (that will be some set of tokens, you can compute this using the notion of First(x) for x being any nonterminal).
When lexing, have the lexer ask the parser what set of tokens it expects next.
Your keywork reconizer for 'end' asks the parser, and it either ways "expecting 'end'" at which pointer the lexer simply hands on the 'end' lexeme, or it says "expecting ID" at which point it hands the parser an ID with name text "end".
This may or may not be convenient to get your parser to do. But you need something like this.
We use a GLR parser; our parser accepts multiple tokens in the same place. Our solution is to generate both the 'end' keyword and and the identifier with text "end" and shove them both into the GLR parser. It can handle local ambiguity; the multiple parses caused by this proceed until the parser with the wrong assumption encounters a syntax error, and then it just vanishes, by fiat. The last standing parser is the one with the right set of assumptions. This scheme is somewhat like the first one, just that we hand the parser the choices and it decides rather than making the lexer decide.
You might be able to send your parser a "two-interpretation" lexeme, e.g., a keyword-in-context lexeme, which in essence claims it it both a keyword and/or an identifier. With a single token lookahead internally, the parser can likely decide easily and restamp the lexeme. Not as general as the GLR solution, but probably works in a lot of cases.

ANTLR: Help on Lexing Errors for a custom grammar example

What approach would allow me to get the most on reporting lexing errors?
For a simple example I would like to write a grammar for the following text
(white space is ignored and string constants cannot have a \" in them for simplicity):
myvariable = 2
myvariable = "hello world"
Group myvariablegroup {
myvariable = 3
anothervariable = 4
}
Catching errors with a lexer
How can you maximize the error reporting potential of a lexer?
After reading this post: Where should I draw the line between lexer and parser?
I understood that the lexer should match as much as it can with regards to the parser grammar but what about lexical error reporting strategies?
What are the ordinary strategies for catching lexing errors?
I am imagining a grammar which would have the following "error" tokens:
GROUP_OPEN: 'Group' WS ID WS '{';
EMPTY_GROUP: 'Group' WS ID WS '{' WS '}';
EQUALS: '=';
STRING_CONSTANT: '"~["]+"';
GROUP_CLOSE: '}';
GROUP_ERROR: 'Group' .; // the . character is an invalid token
// you probably meant '{'
GROUP_ERROR2: .'roup' ; // Did you mean 'group'?
STRING_CONSTANT_ERROR: '"' .+; // Unterminated string constant
ID: [a-z][a-z0-9]+;
WS: [ \n\r\t]* -> skip();
SINGLE_TOKEN_ERRORS: .+?;
There are clearly some problems with your approach:
You are skipping WS (which is good), but yet you're using it in your other rules. But you're in the lexer, which leads us to...
Your groups are being recognized by the lexer. I don't think you want them to become a single token. Your groups belong in the parser.
Your grammar, as written, will create specific token types for things ending in roup, so croup for instance may never match an ID. That's not good.
STRING_CONSTANT_ERROR is much too broad. It's able to glob the entire input. See my UNTERMINATED_STRING below.
I'm not quite sure what happens with SINGLE_TOKEN_ERRORS... See below for an alternative.
Now, here are some examples of error tokens I use, and this works very well for error reporting:
UNTERMINATED_STRING
: '"' ('\\' ["\\] | ~["\\\r\n])*
;
UNTERMINATED_COMMENT_INLINE
: '/*' ('*' ~'/' | ~'*')*? EOF -> channel(HIDDEN)
;
// This should be the LAST lexer rule in your grammar
UNKNOWN_CHAR
: .
;
Note that these unterminated tokens represent single atomic values, they don't span logical structures.
Also, UNKNOWN_CHAR will be a single char no matter what, if you define it as .+? it will always match exactly one char anyway, since it will be trying to match as few chars as possible, and that minimum is one char.
Non-greedy quantifiers make sense when something follows them. For instance in the expression .+? '#', the .+? will be forced to consume characters until it encounters a # sign. If the .+? expression is alone, it won't have to consume more than a single character to match, and therefore will be equivalent to ..
I use the following code in the lexer (.NET ANTLR):
partial class MyLexer
{
public override IToken Emit()
{
CommonToken token;
RecognitionException ex;
switch (Type)
{
case UNTERMINATED_STRING:
Type = STRING;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_STRING, Line, Column, "Unterminated string: " + GetTokenTextForDisplay(token), ex);
return token;
case UNTERMINATED_COMMENT_INLINE:
Type = COMMENT_INLINE;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_COMMENT_INLINE, Line, Column, "Unterminated comment: " + GetTokenTextForDisplay(token), ex);
return token;
default:
return base.Emit();
}
}
// ...
}
Notice that when the lexer encounters a bad token type, it explicitly changes it it to a valid token, so the parser can actually make sense of it.
Now, it is the job of the parser to identify bad structure. ANTLR is smart enough to perform single-token deletion and single-token insertion while trying to resynchronize itself with an invalid input. This is also the reason why I'm letting UNKNOWN_CHAR slip though to the parser, so it can discard it with an error message.
Just take the errors it generates and alter them in order to present something nicer to the user.
So, just make your groups into a parser rule.
An example:
Consider the following input:
Group ,ygroup {
Here, the , is clearly a typo (user pressed , instead of m).
If you use UNKNOWN_CHAR: .; you will get the following tokens:
Group of type GROUP
, of type UNKNOWN_CHAR
ygroup of type ID
{ of type '{ '
The parser will be able to figure out the UNKNOWN_CHAR token needs to be deleted and will correctly match a group (defined as GROUP ID '{' ...).
ANTLR will insert so-called error nodes at the points where it finds unexpected tokens (in this case between GROUP and ID). These nodes are then ignored for the purposes of parsing, but you can retrieve them with your visitors/listeners to handle them (you can use a visitor's VisitErrorNode method for instance).

Antlr4 lexer predicate needed?

I try to parse this piece of text
:20: test :254:
aapje
:21: rest
...
:20: and :21: are special tags, because they start the line. :254: should be 'normal' text, as it does not start on a newline.
I would like the result to be
(20, 'test :254: \naapje')
(21, 'rest')
Lines are terminated using either \r\n or '\n'
I started out trying to ignore the whitespace, but then I match the ':254:' tag as well. So I have to create something that uses the whitespace information.
What I would like to be able to do is something like this:
lexer grammar MT9740_lexer;
InTagNewLine : '\r\n' ~':';
ReadNewLine :'\r\n' ;
But the first would consume the : How can I still generate these tokens? Or is there a smarted approach?
What I understand is that you're looking for some lexer rules that match the start of a line. This lexer rule should tokenize your :20: or :21: appearing at the start of a line only
SOL : {getCharPositionInLine() == 0}? ':' [0-9]+ ':' ;
Your parser rules can then look for this SOL token before parsing the rest of the line.

How to match any symbol in ANTLR parser (not lexer)?

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Resources