Recognize specific grammatical mistakes with Bison - parsing

I'm attempting to use Bison to develop my own programming language. I've got the .y file written for my grammar. However, I'm wondering if there's a way, in the case that the user attempts to parse source code with invalid grammar, to have Bison give a useful error message. For example, suppose I have the following rule in my grammar:
if_statement: IF expr '{' statement_list '}' {$$=createNode(IF,$2,$4);}
;
Suppose the source code left out the closing brace. According to my understanding, Bison would report that it was unable to find a rule to reduce the code. Could Bison be made to recognize that there is an unfinished if which begins on line such-and-such and report that to the user?

Missing braces are very rarely detected where they happen, because it is usually the case that whatever follows the missing brace could just as well have come before it. That's particularly clear if the missing close brace is immediately followed by another closing brace, but it could simply be followed (in this case) by another statement:
function some_function() {
....
while (some_condition) {
...
if (some_other_condition) {
...
break;
// } /* Commented out by mistake */
a = 3;
...
}
return a;
}
function another_function() {
...
}
If your language doesn't allow nested function definitions then the definition of another_function will trigger an error; if it does allow nested function definitions, then another_function will just be defined in an unexpected scope and the parse will continue, perhaps until the end of file.
One way of detecting errors like this is to check indentation of every line with the expected indentation. However, unless your language has some concept of correct indentation (like, for example, Python), you cannot flag misleading indentation as an error. So the best you can do is record the unexpected indentation, in order to use it as a clue when a syntax error is finally encountered (if there is a syntax error, since it might just be that the programmer doesn't care to make their programmes human-readable). The complications in this approach to error detection are probably why it is so uncommon in mainstream languages, although personally I think it's an approach with a lot of potential.
I usually advocate parsing erroneous programs twice. The first parse is optimised for correct programs, which means that it doesn't need any of the overhead required for good error messages, such as tracking the position of every token. If the program turns out to be syntactically correct, you can then move on to turning the AST into compiled code. If the program turns out to have an syntax error, you can restart the parse at the beginning, and then you are certainly free to use heuristics like indentation checks to attempt to better localise errors.
Having said all that, you may well do better to move on to implementation of your language and return to the problem of producing better diagnostics later.
Bison does offer a mechanism for producing more useful error messages in some cases.
First, you should at least enable line number tracking from Flex, which is almost zero effort. You might also want to track precise token position, which is a bit more work but not too much. (See Character Position from starting of a line, https://stackoverflow.com/a/48879103/1566221 and yyllocp->first_line returns uninitialized value in second iteration of a reEntrant Bison parser (among others) for sample code.)
Second, ask bison to produce verbose error messages. That only requires two extra lines in your bison prologue:
%define parse.error verbose
%define parse.lac full
Please do read the bison manual for some important caveats. In particular, LAC may involve significant overhead. But the error messages produced are often helpful.
Finally, use bison's error recovery mechanism to continue the parse after the first syntax error is detected, thus allowing you to report several syntax errors in a single run. That's usually less frustrating for a user, although you should terminate the parse at some threshold error count, because really high error counts after error recovery usually mean that the error recovery itself failed and that many of the subsequent error messages were bogus.
Again, the bison manual has some useful suggestions about how to use the error facilities.
Bison manual table of contents

Related

Parsing expressions - unary, binary and incrementing/decrementing operators

I am trying to write a parser for my language (for learning and fun). The problem is that I don't know how to parse expressions like a--b or a----b. When there are multiple operators made from - character - unary (-x), binary (x-y), pre-decrement (--x, like in C) and post-decrement (x--). Both a--b and a----b should be valid and produce:
a--b -> sub a----b -> sub
+-+-+ +--+--+
a neg decr neg
| | |
b a b
When lexer tokenizes a--b it does not know if it is decrement or minus sign repeated two times, so the parser must find out which one is it.
How could I determine if - is part of decrement operator or just minus sign?
The problem is not really parsing so much as deciding the rules.
Why should a----b be a-- - -b and not a - -(--b)? For that matter, should a---b be a-- - b or a - --b?
And what about a---3 or 3---a? Neither 3-- nor --3 make any sense, so if the criteria were "choose (one of) the sensible interpretations", you'd end up with a-- - 3 and 3 - --a. But even if that were implementable without excess effort, it would place a huge cognitive load on coders and code readers.
Once upon a time, submitting a program for execution was a laborious and sometimes bureaucratic process, and having a run cancelled because the compiler couldn't find the correct interpretation was enormously frustrating. . I still carry the memories of my student days, waiting in a queue to hand my programs to an computer operator and then in another queue to receive the printed results.
So it became momentarily popular to create programming languages which went to heroic lengths to find a valid interpretation of what they were given. But that effort also meant that some bugs passed without error, because the programmer and the programming language having different understandings of what the "natural interpretation" might be.
If you program in C/C++, you may well have at some time written a & 3 == 0 instead of (a & 3) == 0. Fortunately, modern compilers will warn you about this bug, if warnings are enabled. But it's at least reasonable to ask whether the construct should even be permitted. Although it's a little annoying to have to add parentheses and recompile, it's not nearly as frustrating as trying to debug the obscure behaviour which results. Or to have accepted the code in a code review without noticing the subtle error.
These days, the compile / test / edit cycle is much quicker, so there's little excuse fir insisting on clarity. If I were writing a compiler today, I'd probably flag as an error any potentially ambiguous sequence of operator characters. But that might be going too far.
In most languages, a relatively simple rule is used: at each point in the program, the lexical analysis chooses the longest possible token, whether or not it "makes sense". That's what C and C++ do (mostly) and it has the advantage of being easy to implement and also easy to verify. (Even so, in a code review I would insist that a---b be written as a-- -b.)
You could slightly modify this rule so that only the first pair of -- is taken as a token, which would capture some of your desired parses without placing too much load on the code reader.
You could use even more complicated rules, but remember that whatever you implement, you have to document. If it's too hard to document clearly, it's probably unsuitable.
Once you articulate your list of rules, you work on the implementation. In many cases, the simplest is to just try the possibilities in order, with backtracking or in parallel. Or you might be able to just precompute the possible parses.
Alternatively, you could use a GLR or GLL parser generator capable of finding all parses in an ambiguous grammar, and then select the "best" parse based on whatever criteria you prefer.

Improving errors output by Grako-generated parser

I'm trying to figure out the best approach to improving the errors displayed to a user of a Grako-generated parser. It seems like the default parse errors displayed by the Grako-generated parser when it hits some parsing issue in the input file are not helpful. The errors often seem to imply the issue is in one part of the input file when the true error is somewhere different.
I've been looking into the Grako Semantics class to put in some checks which would display better error messages if the checks fail, but it also seems like there could be tons of edge cases that must be specified to be able to catch all of the possible ways the parsing of a rule can fail.
Does anyone have any recommendations or examples I can view?
A PEG parser will exhaust all options, sometimes leaving you at a failure corresponding to the last, and least likely option.
With Grako, you can add cut elements (~) to the grammar to have the parser commit to certain options when it can be sure they are the ones to match.
term = '(' ~ expression ')' | int ;
Cut elements also prune the memoization cache, which improves parser performance.

Encode FIRST and FOLLOW sets into a recursive descent parser

This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive
You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue

What is the strategy for adding error productions to a grammar?

How are error productions typically added? I'm encountering the issue that my error productions are too shallow: when the parser starts popping states on an error in a statement, it pops until it hits the error production for the section in which it is located, and prints out an invalid error message.
Is it a good idea to just add some descriptive error production to every nonterminal?
Error productions are about recovering from an error in order to attempt to continue processing the input, not about printing reasonable or useful error messages. Therefor they should be used at points in the grammar where its likely that you can recognize and resynchronize the input stream properly. For example, if your language consists of a sequence of constructs ending with ; characters, a good error production is something like construct: error ';', which will recover from errors in a construct (whatever that is) by skipping forward in the input to a ; and attempting to go on from there.
Putting many error recovery rules is generaly a bad idea, since the parser will only recover to the closest one, and its often the most global ones at the top level that are most likely to be useful and trying to use a finer granularity will just lead to error cascades as the error recovery rules can't resync with the input properly.

Parsing code with syntax errors

Parsing techniques are well described in CS literature. But the algorithms I know of require that the source is syntactically correct. If a syntax error is encountered, parsing is immediately aborted.
But IDE's (like Visual Studio) are typically able to provide meaningful code completion and other hints while typing, which mean the syntax is often not in a valid state. E.g. you type an opening parenthesis in a function call, and the IDE provide parameter hints for the function, even though the syntax is invalid until the closing parenthesis is typed.
It seems to me this must rely on some kind of guessing or error-tolerant parser. Anyone know what techniques or algorithms are used for this?
The standard trick is to do some kind of error repair using the parsing machinery to help make predictions.
For table-based parsers (such as LALR or GLR), when a syntax error occurs, the parser was recently in some state in which the error had not yet happened. One can record the parse stack to remember this before each shift (or alternatively record reductions before the error). Given that an error as been encountered, one can inspect the parse state for the saved stack to determine which tokens might be next (this is also how one can do code completion in terms of syntax tokens). A more sophisticated technique can invent the smallest possible sequence of tokens that allow a shift by the error token, or the smallest possible tree that could replace the error token and allow a shift on the next.
This isn't so easy with recursive descent parsers because there isn't a lot of information lying around with which make a predication. For error recovery, a cheesy trick is define error recovery points (e.g., where a "stmt" might be accepted) and continue scanning until a ";" is found and accept and "error stmt". This doesn't help if you want code completion.
Packrat is promising - it provides information on both successful and failed parsing attempt at key points, which can be recovered and used for smart error reporting, completion, hints and so on. For example, if the cursor is at a point where all the parsing attempts are marked as failed in a cache, a list of tokens tried can be given for completion options.

Resources