Ambiguity error while rendering parse tree - parsing

In Rascal, when rendering a parse tree, on an ambiguous grammar, why do I sometimes get an error message stating "Ambiguity" at some location instead of Rascal just rendering a parse forest and showing the ambiguity?
I always just call render(renderParsetree(parse(SomeSymbol, SomeLocation))); but I have no clue about when this just renders a parse forest and when it presents an error message about ambiguity. In my opinion the parse forests display ambiguity much clearer and I would like to know if a way exists to show it instead, when Rascal presents the error message.
Edit: Not just rendering a parse tree but even 'Dr. Ambiguity' (diagnose) fails with an Ambiguity error in these cases so this is no way to find the cause of the ambiguity either.

I received the following answer from jurgenv by email: by a recent change you need to set allowAmbiguity=true when calling parse to allow ambiguity. The behaviour of this method was changed to avoid the parser takes a very long time to process a file that is accidently very ambiguous and allows one to discover ambiguity faster.

Related

Recognize specific grammatical mistakes with Bison

I'm attempting to use Bison to develop my own programming language. I've got the .y file written for my grammar. However, I'm wondering if there's a way, in the case that the user attempts to parse source code with invalid grammar, to have Bison give a useful error message. For example, suppose I have the following rule in my grammar:
if_statement: IF expr '{' statement_list '}' {$$=createNode(IF,$2,$4);}
;
Suppose the source code left out the closing brace. According to my understanding, Bison would report that it was unable to find a rule to reduce the code. Could Bison be made to recognize that there is an unfinished if which begins on line such-and-such and report that to the user?
Missing braces are very rarely detected where they happen, because it is usually the case that whatever follows the missing brace could just as well have come before it. That's particularly clear if the missing close brace is immediately followed by another closing brace, but it could simply be followed (in this case) by another statement:
function some_function() {
....
while (some_condition) {
...
if (some_other_condition) {
...
break;
// } /* Commented out by mistake */
a = 3;
...
}
return a;
}
function another_function() {
...
}
If your language doesn't allow nested function definitions then the definition of another_function will trigger an error; if it does allow nested function definitions, then another_function will just be defined in an unexpected scope and the parse will continue, perhaps until the end of file.
One way of detecting errors like this is to check indentation of every line with the expected indentation. However, unless your language has some concept of correct indentation (like, for example, Python), you cannot flag misleading indentation as an error. So the best you can do is record the unexpected indentation, in order to use it as a clue when a syntax error is finally encountered (if there is a syntax error, since it might just be that the programmer doesn't care to make their programmes human-readable). The complications in this approach to error detection are probably why it is so uncommon in mainstream languages, although personally I think it's an approach with a lot of potential.
I usually advocate parsing erroneous programs twice. The first parse is optimised for correct programs, which means that it doesn't need any of the overhead required for good error messages, such as tracking the position of every token. If the program turns out to be syntactically correct, you can then move on to turning the AST into compiled code. If the program turns out to have an syntax error, you can restart the parse at the beginning, and then you are certainly free to use heuristics like indentation checks to attempt to better localise errors.
Having said all that, you may well do better to move on to implementation of your language and return to the problem of producing better diagnostics later.
Bison does offer a mechanism for producing more useful error messages in some cases.
First, you should at least enable line number tracking from Flex, which is almost zero effort. You might also want to track precise token position, which is a bit more work but not too much. (See Character Position from starting of a line, https://stackoverflow.com/a/48879103/1566221 and yyllocp->first_line returns uninitialized value in second iteration of a reEntrant Bison parser (among others) for sample code.)
Second, ask bison to produce verbose error messages. That only requires two extra lines in your bison prologue:
%define parse.error verbose
%define parse.lac full
Please do read the bison manual for some important caveats. In particular, LAC may involve significant overhead. But the error messages produced are often helpful.
Finally, use bison's error recovery mechanism to continue the parse after the first syntax error is detected, thus allowing you to report several syntax errors in a single run. That's usually less frustrating for a user, although you should terminate the parse at some threshold error count, because really high error counts after error recovery usually mean that the error recovery itself failed and that many of the subsequent error messages were bogus.
Again, the bison manual has some useful suggestions about how to use the error facilities.
Bison manual table of contents

How to catch skipped tokens with antlr?

I've been playing around with antlr to do a kind of excel formula validation. Antlr looks pretty nice, however, I have some doubts about the way it works.
Imagine I have a grammar that already knows about all kind of tokens needed to perform an excel formula validation (rules references, operations, etc). In this grammar, there is no valid token for currency symbols (€,£, etc), though I have an 'ERROR_CHAR' token that matches anything: ERROR_CHAR: .;
Here's what I want to know about an example input: =€€€+SUM(1,2)
The formula is not valid
All the tokens after €€€ are valid and there are rules for them -> +SUM(1,2)
My parser only knows that € is invalid, but don't know about a sequence of ERROR_CHAR, just like €€€, and so, all the input is wrong and all subsequent tokens are caught by the error listener. I assume that this is because, based on my parser rules, I am not saying that ERROR_CHAR could be present anywhere in the input.
I don't want to skip those tokens, because I'd like to highlight the position of the error and I am already skipping whitespaces.
Do you have any idea how could I handle this?
If you just want to highlight the position of an error, let ANTLR detect errors and their positions. It does it quite well. No grammar changes required.
Use ErrorListener to detect errors and handle them.
You can find more information here: Handling errors in ANTLR4

Improving errors output by Grako-generated parser

I'm trying to figure out the best approach to improving the errors displayed to a user of a Grako-generated parser. It seems like the default parse errors displayed by the Grako-generated parser when it hits some parsing issue in the input file are not helpful. The errors often seem to imply the issue is in one part of the input file when the true error is somewhere different.
I've been looking into the Grako Semantics class to put in some checks which would display better error messages if the checks fail, but it also seems like there could be tons of edge cases that must be specified to be able to catch all of the possible ways the parsing of a rule can fail.
Does anyone have any recommendations or examples I can view?
A PEG parser will exhaust all options, sometimes leaving you at a failure corresponding to the last, and least likely option.
With Grako, you can add cut elements (~) to the grammar to have the parser commit to certain options when it can be sure they are the ones to match.
term = '(' ~ expression ')' | int ;
Cut elements also prune the memoization cache, which improves parser performance.

What is the strategy for adding error productions to a grammar?

How are error productions typically added? I'm encountering the issue that my error productions are too shallow: when the parser starts popping states on an error in a statement, it pops until it hits the error production for the section in which it is located, and prints out an invalid error message.
Is it a good idea to just add some descriptive error production to every nonterminal?
Error productions are about recovering from an error in order to attempt to continue processing the input, not about printing reasonable or useful error messages. Therefor they should be used at points in the grammar where its likely that you can recognize and resynchronize the input stream properly. For example, if your language consists of a sequence of constructs ending with ; characters, a good error production is something like construct: error ';', which will recover from errors in a construct (whatever that is) by skipping forward in the input to a ; and attempting to go on from there.
Putting many error recovery rules is generaly a bad idea, since the parser will only recover to the closest one, and its often the most global ones at the top level that are most likely to be useful and trying to use a finer granularity will just lead to error cascades as the error recovery rules can't resync with the input properly.

Parsing code with syntax errors

Parsing techniques are well described in CS literature. But the algorithms I know of require that the source is syntactically correct. If a syntax error is encountered, parsing is immediately aborted.
But IDE's (like Visual Studio) are typically able to provide meaningful code completion and other hints while typing, which mean the syntax is often not in a valid state. E.g. you type an opening parenthesis in a function call, and the IDE provide parameter hints for the function, even though the syntax is invalid until the closing parenthesis is typed.
It seems to me this must rely on some kind of guessing or error-tolerant parser. Anyone know what techniques or algorithms are used for this?
The standard trick is to do some kind of error repair using the parsing machinery to help make predictions.
For table-based parsers (such as LALR or GLR), when a syntax error occurs, the parser was recently in some state in which the error had not yet happened. One can record the parse stack to remember this before each shift (or alternatively record reductions before the error). Given that an error as been encountered, one can inspect the parse state for the saved stack to determine which tokens might be next (this is also how one can do code completion in terms of syntax tokens). A more sophisticated technique can invent the smallest possible sequence of tokens that allow a shift by the error token, or the smallest possible tree that could replace the error token and allow a shift on the next.
This isn't so easy with recursive descent parsers because there isn't a lot of information lying around with which make a predication. For error recovery, a cheesy trick is define error recovery points (e.g., where a "stmt" might be accepted) and continue scanning until a ";" is found and accept and "error stmt". This doesn't help if you want code completion.
Packrat is promising - it provides information on both successful and failed parsing attempt at key points, which can be recovered and used for smart error reporting, completion, hints and so on. For example, if the cursor is at a point where all the parsing attempts are marked as failed in a cache, a list of tokens tried can be given for completion options.

Resources