How to catch skipped tokens with antlr?

How to catch skipped tokens with antlr? - parsing

I've been playing around with antlr to do a kind of excel formula validation. Antlr looks pretty nice, however, I have some doubts about the way it works.
Imagine I have a grammar that already knows about all kind of tokens needed to perform an excel formula validation (rules references, operations, etc). In this grammar, there is no valid token for currency symbols (€,£, etc), though I have an 'ERROR_CHAR' token that matches anything: ERROR_CHAR: .;
Here's what I want to know about an example input: =€€€+SUM(1,2)
The formula is not valid
All the tokens after €€€ are valid and there are rules for them -> +SUM(1,2)
My parser only knows that € is invalid, but don't know about a sequence of ERROR_CHAR, just like €€€, and so, all the input is wrong and all subsequent tokens are caught by the error listener. I assume that this is because, based on my parser rules, I am not saying that ERROR_CHAR could be present anywhere in the input.
I don't want to skip those tokens, because I'd like to highlight the position of the error and I am already skipping whitespaces.
Do you have any idea how could I handle this?

If you just want to highlight the position of an error, let ANTLR detect errors and their positions. It does it quite well. No grammar changes required.
Use ErrorListener to detect errors and handle them.
You can find more information here: Handling errors in ANTLR4

Related

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!

Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.

This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

Do we have to assume a parser is pre-reading tokens ahead of time?

Coming back to lexers and parsers after many years away, I find myself puzzled over the concept of a state change, for the purposes of context. I'm using Lemon as a parser and putting together my own lexer.
Let's take an example input like this one:
[groups]
syscon:
0x000 sysmemremap
0x004 presetctrl
[registers]
sysmemremap:
map 1-0
rsvd 31-2
presetctrl:32
mux 2-0
..etc...
So "syscon:" and "sysmemremap:" look the same but one is a GROUPNAME and the other is a REGISTERNAME. There's a context change between [groups] and [registers] that determines what each token is in reality.
Is it the parser that is in the best position to make that contexual change? As the parser doesn't have a sectional grammar, where one set of grammar applies in one set of circumstances and another in a different set, I presume the lexer should be the one deciding that "syscon:" generates a GROUPNAME if the mode is such that it should.
EDIT: Just spotted "the lexer hack" entry in Wikipedia that summarises the issue:
Without added context, the lexer cannot distinguish type identifiers
from other identifiers because all identifiers have the same format.
.... The solution generally consists of feeding information from the
semantic symbol table back into the lexer. That is, rather than
functioning as a pure one-way pipeline from the lexer to the parser,
there is a backchannel from semantic analysis back to the lexer.
Except (and this is the question I have) what can you assume about the parser's pre-reading of tokens? If the parser is bashing ahead and reading more tokens to do a better match - which I would expect it to do to some extent at least, it could well run into the situation that a state change in the parser is too late for the lexer as it already met and processed that token!
Or am I overthinking this?

I may be able to answer my own question. I think that any sort of reliance on what the parser is doing internally is likely a Bad Plan.
Now that I've found my Lex and Yacc book (the O'Reilly one), one of the examples in the Lex section is a state change one - If it see the word "verb" it starts defining verbs as opposed to looking them up. That work is done in the lexer so I guess that's the way it is - do it in the lexer.

Improving errors output by Grako-generated parser

I'm trying to figure out the best approach to improving the errors displayed to a user of a Grako-generated parser. It seems like the default parse errors displayed by the Grako-generated parser when it hits some parsing issue in the input file are not helpful. The errors often seem to imply the issue is in one part of the input file when the true error is somewhere different.
I've been looking into the Grako Semantics class to put in some checks which would display better error messages if the checks fail, but it also seems like there could be tons of edge cases that must be specified to be able to catch all of the possible ways the parsing of a rule can fail.
Does anyone have any recommendations or examples I can view?

A PEG parser will exhaust all options, sometimes leaving you at a failure corresponding to the last, and least likely option.
With Grako, you can add cut elements (~) to the grammar to have the parser commit to certain options when it can be sure they are the ones to match.
term = '(' ~ expression ')' | int ;
Cut elements also prune the memoization cache, which improves parser performance.

Encode FIRST and FOLLOW sets into a recursive descent parser

This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive

You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue

Parsing code with syntax errors

Parsing techniques are well described in CS literature. But the algorithms I know of require that the source is syntactically correct. If a syntax error is encountered, parsing is immediately aborted.
But IDE's (like Visual Studio) are typically able to provide meaningful code completion and other hints while typing, which mean the syntax is often not in a valid state. E.g. you type an opening parenthesis in a function call, and the IDE provide parameter hints for the function, even though the syntax is invalid until the closing parenthesis is typed.
It seems to me this must rely on some kind of guessing or error-tolerant parser. Anyone know what techniques or algorithms are used for this?

The standard trick is to do some kind of error repair using the parsing machinery to help make predictions.
For table-based parsers (such as LALR or GLR), when a syntax error occurs, the parser was recently in some state in which the error had not yet happened. One can record the parse stack to remember this before each shift (or alternatively record reductions before the error). Given that an error as been encountered, one can inspect the parse state for the saved stack to determine which tokens might be next (this is also how one can do code completion in terms of syntax tokens). A more sophisticated technique can invent the smallest possible sequence of tokens that allow a shift by the error token, or the smallest possible tree that could replace the error token and allow a shift on the next.
This isn't so easy with recursive descent parsers because there isn't a lot of information lying around with which make a predication. For error recovery, a cheesy trick is define error recovery points (e.g., where a "stmt" might be accepted) and continue scanning until a ";" is found and accept and "error stmt". This doesn't help if you want code completion.

Packrat is promising - it provides information on both successful and failed parsing attempt at key points, which can be recovered and used for smart error reporting, completion, hints and so on. For example, if the cursor is at a point where all the parsing attempts are marked as failed in a cache, a list of tokens tried can be given for completion options.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart