How to match [BOF]"Begin of file" in Antlr4 Lexer? - parsing

In one Antlr4 syntax, I need the comment (// xxxx) to be always at the start of a line.
The following grammar works fine for most cases.
grammar com;
comment: COMMENT;
COMMENT
: '\n' '//' .*? '\n'
;
By design, it will match \n//comment\n but not //comment\n. But I also want it to match <BOF>//comment\n. How can I implement it?

You may find that this edit is better handled post-parsing, in a semantic validation pass of your parseTree. (NOTE: It's not a requirement that a parser ONLY recognize valid input, just that it correctly interprets the only way to understand that input.)
For example, does // might be a comment have some other, alternate interpretation if it's not at the beginning of the line?
If not, I would probably just accept the // comment ...\n as a token regardless of it's position in the line.
Then, once you have the parse tree, you can check that you comments always have a column of 0. Doing it this way, your grammar is not tied to a particular target language, and, perhaps more importantly, you can give a "nice" error message like "Comments must begin in the first column of a line".
If you try to handle this in the Lexer (or parser), then, if it's NOT in the correct column, you'll get a much more obtuse recognition error that will be more difficult for users to understand.

That is not possible in a language agnostic way. You will have to add target specific code in your grammar and use a predicate to check if the char position is 0:
COMMENT
: {getCharPositionInLine() == 0}? '//' ~[\r\n]*
;
OTHER
: .
;
If you now tokenize the input:
// start
// middle
?//...
// end
with the Java code:
String input = "// start\n// middle\n?//...\n// end";
comLexer lexer = new comLexer(CharStreams.fromString(input));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-10s'%s'%n",
comLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
the following will be printed to your console:
COMMENT '// start'
OTHER '\n'
COMMENT '// middle'
OTHER '\n'
OTHER '?'
OTHER '/'
OTHER '/'
OTHER '.'
OTHER '.'
OTHER '.'
OTHER '\n'
COMMENT '// end'
EOF '<EOF>'
Note that I also removed the \n at the end of the COMMENT, otherwise a comment at the end of the input would not be matched.
EDIT
How I can do it with JavaScript? I cannot find good examples on internet.
By looking at the Javascript source, it looks like {this.column === 0}? is the Javascript equivalent of {getCharPositionInLine() == 0}?
By the way, does the Intellij Plugin support predict? If it does, does it support only Java?
No, the IntelliJ plugin ignores predicates. After all, the code inside a predicate can be any arbitrary chunk of code, making it quite hard to support.

Related

Tatsu Parser, unclear why it isn't moving to the next rule in the line?

I am writing a code parser/formatter for a language that doesn't have one, OSTW (Overwatch higher level language for workshop code). So that I can be lazy and have pretty code.
I am pretty new to this idea, so if tatsu is a poor choice for this usecase, please let me know, I am rather ignorant. I have been going back and forth between the grammar syntax and some of the tutorials and it isn't clicking for me yet.
My sample document:
doSomething(param1,param2,arg=stuff,arg2=stuff2);
My EBNF:
##grammar::Ostw
##eol_comments :: /\/\/.*?$/
start = statement $ ;
statement = func:alpha '(' ','%{param:alpha}* [',' ','%{kwarg}*] ')' eol ;
eol = ';' ;
kwarg = key:alpha '=' val:alpha ;
alpha = (numbers|letters) ;
numbers = /\d+/ ;
letters = /\w+/ ;
The grammar compiles successfully, but when I attempt to parse my code, I get this output:
FailedToken: (1:30) expecting ')' :
doSomething(param1,param2,arg=stuff,arg2=stuff2);
^
statement
start
My expectation would be that, since the = is not a valid character for the alpha rule, it would go to the next thing in the list, since it is an unknown number of entries of either types.
My intention is to have my parser expect similarly to Python, params then keyword arguments.
I think I missed a paragraph somewhere in something basic is what it feels like!
Thanks for any help!
Mriswithe

ANTLR4 match any not-matched sections into one single STRING token

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"ยง$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

ANTLR: Help on Lexing Errors for a custom grammar example

What approach would allow me to get the most on reporting lexing errors?
For a simple example I would like to write a grammar for the following text
(white space is ignored and string constants cannot have a \" in them for simplicity):
myvariable = 2
myvariable = "hello world"
Group myvariablegroup {
myvariable = 3
anothervariable = 4
}
Catching errors with a lexer
How can you maximize the error reporting potential of a lexer?
After reading this post: Where should I draw the line between lexer and parser?
I understood that the lexer should match as much as it can with regards to the parser grammar but what about lexical error reporting strategies?
What are the ordinary strategies for catching lexing errors?
I am imagining a grammar which would have the following "error" tokens:
GROUP_OPEN: 'Group' WS ID WS '{';
EMPTY_GROUP: 'Group' WS ID WS '{' WS '}';
EQUALS: '=';
STRING_CONSTANT: '"~["]+"';
GROUP_CLOSE: '}';
GROUP_ERROR: 'Group' .; // the . character is an invalid token
// you probably meant '{'
GROUP_ERROR2: .'roup' ; // Did you mean 'group'?
STRING_CONSTANT_ERROR: '"' .+; // Unterminated string constant
ID: [a-z][a-z0-9]+;
WS: [ \n\r\t]* -> skip();
SINGLE_TOKEN_ERRORS: .+?;
There are clearly some problems with your approach:
You are skipping WS (which is good), but yet you're using it in your other rules. But you're in the lexer, which leads us to...
Your groups are being recognized by the lexer. I don't think you want them to become a single token. Your groups belong in the parser.
Your grammar, as written, will create specific token types for things ending in roup, so croup for instance may never match an ID. That's not good.
STRING_CONSTANT_ERROR is much too broad. It's able to glob the entire input. See my UNTERMINATED_STRING below.
I'm not quite sure what happens with SINGLE_TOKEN_ERRORS... See below for an alternative.
Now, here are some examples of error tokens I use, and this works very well for error reporting:
UNTERMINATED_STRING
: '"' ('\\' ["\\] | ~["\\\r\n])*
;
UNTERMINATED_COMMENT_INLINE
: '/*' ('*' ~'/' | ~'*')*? EOF -> channel(HIDDEN)
;
// This should be the LAST lexer rule in your grammar
UNKNOWN_CHAR
: .
;
Note that these unterminated tokens represent single atomic values, they don't span logical structures.
Also, UNKNOWN_CHAR will be a single char no matter what, if you define it as .+? it will always match exactly one char anyway, since it will be trying to match as few chars as possible, and that minimum is one char.
Non-greedy quantifiers make sense when something follows them. For instance in the expression .+? '#', the .+? will be forced to consume characters until it encounters a # sign. If the .+? expression is alone, it won't have to consume more than a single character to match, and therefore will be equivalent to ..
I use the following code in the lexer (.NET ANTLR):
partial class MyLexer
{
public override IToken Emit()
{
CommonToken token;
RecognitionException ex;
switch (Type)
{
case UNTERMINATED_STRING:
Type = STRING;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_STRING, Line, Column, "Unterminated string: " + GetTokenTextForDisplay(token), ex);
return token;
case UNTERMINATED_COMMENT_INLINE:
Type = COMMENT_INLINE;
token = (CommonToken)base.Emit();
ex = new UnterminatedTokenException(this, (ICharStream)InputStream, token);
ErrorListenerDispatch.SyntaxError(this, UNTERMINATED_COMMENT_INLINE, Line, Column, "Unterminated comment: " + GetTokenTextForDisplay(token), ex);
return token;
default:
return base.Emit();
}
}
// ...
}
Notice that when the lexer encounters a bad token type, it explicitly changes it it to a valid token, so the parser can actually make sense of it.
Now, it is the job of the parser to identify bad structure. ANTLR is smart enough to perform single-token deletion and single-token insertion while trying to resynchronize itself with an invalid input. This is also the reason why I'm letting UNKNOWN_CHAR slip though to the parser, so it can discard it with an error message.
Just take the errors it generates and alter them in order to present something nicer to the user.
So, just make your groups into a parser rule.
An example:
Consider the following input:
Group ,ygroup {
Here, the , is clearly a typo (user pressed , instead of m).
If you use UNKNOWN_CHAR: .; you will get the following tokens:
Group of type GROUP
, of type UNKNOWN_CHAR
ygroup of type ID
{ of type '{ '
The parser will be able to figure out the UNKNOWN_CHAR token needs to be deleted and will correctly match a group (defined as GROUP ID '{' ...).
ANTLR will insert so-called error nodes at the points where it finds unexpected tokens (in this case between GROUP and ID). These nodes are then ignored for the purposes of parsing, but you can retrieve them with your visitors/listeners to handle them (you can use a visitor's VisitErrorNode method for instance).

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

Create a Print Function

I'm learning Bison and at this time the only thing that I do was the rpcalc example, but now I want to implement a print function(like printf of C), but I don't know how to do this and I'm planning to have a syntax like this print ("Something here");, but I don't know how to build the print function and I don't know how to create that ; as a end of line. Thanks for your help.
You first need to ask yourself:
What are the [sub-]parts of my 'print ("something");' syntax ?
Once you identify these parts, "simply" describe them in the form of grammar syntax rules, along with applicable production rules. And then let Bison generate the parser for you; that's about it.
To put you on your way:
The semi-column is probably a element you will use to separate statemements (such a one "call" to print from another).
'print' itself is probably a keyword, or preferably a native function name of your language.
The print statement appears to take a literal string as [one of] its arguments. a literal string starts and ends with a double quote (and probably allow for escaped quotes within itself)
etc.
The bolded and italic expressions above are some of the entities (the 'symbols' in parser lingo) you'll likely need to define in the syntax for your language. For that you'll use Bison grammar rules, such as
stmt : print_stmt ';' | input_stmt ';'| some_other_stmt ';' ;
prnt_stmt : print '(' args ')'
{ printf( $3 ); }
;
args : arg ',' args;
...
Since the question asked about the semi-column, maybe some confusion was from the different uses thereof; see for example above how the ';' belong to your language's syntax whereby the ; (no quotes) at the end of each grammar rule are part of Bison's language.
Note: this is of course a simplistic implementation, aimed at showing the essential. Also the Bison syntax may be a tat off (been there / done it, but a long while back ;-) I then "met" ANTLR never to return to Bison, although I do see how its lightweight and fully self contained nature can make it appropriate in some cases)

Resources