If lexer can pass the bad token to the parser? - parsing

During the lexical analysis phase of compiler if a bad token is encountered then the lexer will go into the error recovery mode and, suppose say, it discards the token until the next semicolon is seen and starts its analysis again. Then overall tokens generated is passed to the parser ?.
What I mean to say if lexer has encountered an error then does the compilation stops at this point or it continues and goes to the parsing phase ?

During the lexical analysis phase of compiler if a bad token is encountered then the lexer will go into the error recovery mode and, suppose say, it discards the token until the next semicolon is seen and starts its analysis again.
That's only one way of doing it, and not the best.
Then overall tokens generated is passed to the parser?
No, only the next legal token.
What I mean to say if lexer has encountered an error then does the compilation stops at this point or it continues and goes to the parsing phase?
It continues.
But for several decades I've been practising the opposite. Instead of having the lexer try to do its own error recovery I just return the offending character to the parser. As the parser is usually equipped with much better error recovery, this leads to a much more fault-tolerant parse.
Sample lex/flex implementation:
. return yytext[0];

Related

Recognize specific grammatical mistakes with Bison

I'm attempting to use Bison to develop my own programming language. I've got the .y file written for my grammar. However, I'm wondering if there's a way, in the case that the user attempts to parse source code with invalid grammar, to have Bison give a useful error message. For example, suppose I have the following rule in my grammar:
if_statement: IF expr '{' statement_list '}' {$$=createNode(IF,$2,$4);}
;
Suppose the source code left out the closing brace. According to my understanding, Bison would report that it was unable to find a rule to reduce the code. Could Bison be made to recognize that there is an unfinished if which begins on line such-and-such and report that to the user?
Missing braces are very rarely detected where they happen, because it is usually the case that whatever follows the missing brace could just as well have come before it. That's particularly clear if the missing close brace is immediately followed by another closing brace, but it could simply be followed (in this case) by another statement:
function some_function() {
....
while (some_condition) {
...
if (some_other_condition) {
...
break;
// } /* Commented out by mistake */
a = 3;
...
}
return a;
}
function another_function() {
...
}
If your language doesn't allow nested function definitions then the definition of another_function will trigger an error; if it does allow nested function definitions, then another_function will just be defined in an unexpected scope and the parse will continue, perhaps until the end of file.
One way of detecting errors like this is to check indentation of every line with the expected indentation. However, unless your language has some concept of correct indentation (like, for example, Python), you cannot flag misleading indentation as an error. So the best you can do is record the unexpected indentation, in order to use it as a clue when a syntax error is finally encountered (if there is a syntax error, since it might just be that the programmer doesn't care to make their programmes human-readable). The complications in this approach to error detection are probably why it is so uncommon in mainstream languages, although personally I think it's an approach with a lot of potential.
I usually advocate parsing erroneous programs twice. The first parse is optimised for correct programs, which means that it doesn't need any of the overhead required for good error messages, such as tracking the position of every token. If the program turns out to be syntactically correct, you can then move on to turning the AST into compiled code. If the program turns out to have an syntax error, you can restart the parse at the beginning, and then you are certainly free to use heuristics like indentation checks to attempt to better localise errors.
Having said all that, you may well do better to move on to implementation of your language and return to the problem of producing better diagnostics later.
Bison does offer a mechanism for producing more useful error messages in some cases.
First, you should at least enable line number tracking from Flex, which is almost zero effort. You might also want to track precise token position, which is a bit more work but not too much. (See Character Position from starting of a line, https://stackoverflow.com/a/48879103/1566221 and yyllocp->first_line returns uninitialized value in second iteration of a reEntrant Bison parser (among others) for sample code.)
Second, ask bison to produce verbose error messages. That only requires two extra lines in your bison prologue:
%define parse.error verbose
%define parse.lac full
Please do read the bison manual for some important caveats. In particular, LAC may involve significant overhead. But the error messages produced are often helpful.
Finally, use bison's error recovery mechanism to continue the parse after the first syntax error is detected, thus allowing you to report several syntax errors in a single run. That's usually less frustrating for a user, although you should terminate the parse at some threshold error count, because really high error counts after error recovery usually mean that the error recovery itself failed and that many of the subsequent error messages were bogus.
Again, the bison manual has some useful suggestions about how to use the error facilities.
Bison manual table of contents

Context dependent lexer

I see the following in bash's parse.y. This means that the lexical analysis will be context dependent. How to use flex to do such kind of context depdendent analysis? Will this kind of context depdedent requirement make the flex code too messy? Thanks.
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n3006
/* Handle special cases of token recognition:
IN is recognized if the last token was WORD and the token
before that was FOR or CASE or SELECT.
DO is recognized if the last token was WORD and the token
before that was FOR or SELECT.
ESAC is recognized if the last token caused `esacs_needed_count'
to be set
`{' is recognized if the last token as WORD and the token
before that was FUNCTION, or if we just parsed an arithmetic
`for' command.
`}' is recognized if there is an unclosed `{' present.
`-p' is returned as TIMEOPT if the last read token was TIME.
`--' is returned as TIMEIGN if the last read token was TIMEOPT.
']]' is returned as COND_END if the parser is currently parsing
a conditional expression ((parser_state & PST_CONDEXPR) != 0)
`time' is returned as TIME if and only if it is immediately
preceded by one of `;', `\n', `||', `&&', or `&'.
*/
(F)lex provides start conditions to allow for context-dependent lexical analysis.
If you avoid the temptation to reproduce the parsing logic as a hand-written state machine in the lexical scanner, then start conditions can certainly simplify the implementation of context-dependent scanners.
For the particular application of conditionally-recognised keywords -- often called "semi-reserved words" -- context-dependent lexical analysis is often not the best solution. Instead, consider writing the scanner to always recognise the keywords and then add rules in the grammar to treat the words as identifiers in contexts in which the keyword is not possible. See this answer for an example.

What is the strategy for adding error productions to a grammar?

How are error productions typically added? I'm encountering the issue that my error productions are too shallow: when the parser starts popping states on an error in a statement, it pops until it hits the error production for the section in which it is located, and prints out an invalid error message.
Is it a good idea to just add some descriptive error production to every nonterminal?
Error productions are about recovering from an error in order to attempt to continue processing the input, not about printing reasonable or useful error messages. Therefor they should be used at points in the grammar where its likely that you can recognize and resynchronize the input stream properly. For example, if your language consists of a sequence of constructs ending with ; characters, a good error production is something like construct: error ';', which will recover from errors in a construct (whatever that is) by skipping forward in the input to a ; and attempting to go on from there.
Putting many error recovery rules is generaly a bad idea, since the parser will only recover to the closest one, and its often the most global ones at the top level that are most likely to be useful and trying to use a finer granularity will just lead to error cascades as the error recovery rules can't resync with the input properly.

Parsing code with syntax errors

Parsing techniques are well described in CS literature. But the algorithms I know of require that the source is syntactically correct. If a syntax error is encountered, parsing is immediately aborted.
But IDE's (like Visual Studio) are typically able to provide meaningful code completion and other hints while typing, which mean the syntax is often not in a valid state. E.g. you type an opening parenthesis in a function call, and the IDE provide parameter hints for the function, even though the syntax is invalid until the closing parenthesis is typed.
It seems to me this must rely on some kind of guessing or error-tolerant parser. Anyone know what techniques or algorithms are used for this?
The standard trick is to do some kind of error repair using the parsing machinery to help make predictions.
For table-based parsers (such as LALR or GLR), when a syntax error occurs, the parser was recently in some state in which the error had not yet happened. One can record the parse stack to remember this before each shift (or alternatively record reductions before the error). Given that an error as been encountered, one can inspect the parse state for the saved stack to determine which tokens might be next (this is also how one can do code completion in terms of syntax tokens). A more sophisticated technique can invent the smallest possible sequence of tokens that allow a shift by the error token, or the smallest possible tree that could replace the error token and allow a shift on the next.
This isn't so easy with recursive descent parsers because there isn't a lot of information lying around with which make a predication. For error recovery, a cheesy trick is define error recovery points (e.g., where a "stmt" might be accepted) and continue scanning until a ";" is found and accept and "error stmt". This doesn't help if you want code completion.
Packrat is promising - it provides information on both successful and failed parsing attempt at key points, which can be recovered and used for smart error reporting, completion, hints and so on. For example, if the cursor is at a point where all the parsing attempts are marked as failed in a cache, a list of tokens tried can be given for completion options.

GLR parser with error recovery: too much alternatives when there are errors in input

Preamble
I have written GLR-parser with error recovery. When it encounters an error it splits into following alternatives:
Insert the expected element into input (may be the user just missed it) and proceed as usual.
Replace the erroneous element with expected one (may be the user just made a mistype) and proceed as usual.
Skip erroneous element and if next element is also erroneous then go to #2.
But if input has a lot of errors (for example, user has by mistake given JPEG file to the parser) a number of alternatives grows exponentially.
Example
Such a parser corresponding to the following grammar:
Program -> Identifier WS Identifier WS '=' WS Identifier
Identifier -> ('a'..'z' | 'A'..'Z' | '0'..'9')*
WS -> ' '*
applied to the following text:
x = "abc\"def"; y = "ghi\"jkl";
fails with "out of memory" on moderately modern desktop computer.
Question
How to reduce number of alternatives in case of errors in input?
Doing GLR (parsing therefore) error correction at the character level is possible but aggravates your problem.
The GLR error recovery procedure we use operates on tokens, so it isn't as bad.
But when the input has a huge number of errors, it is pretty hard to recover. More sophisticated error recovery schemes basically use the parser to identify valid substrings of the language in the input, and then attempts to patch the substrings together to get the result. That's pretty ambitious.
I've built GLR parsers with error recovery. I wasn't that ambitious. In general
the parser mostly just aborts when the number of live parsers gets above "a large number" (e.g., 10,000) or the number of syntax errors encountered exceed a threshold (e.g, 10 or 20). You might consider aborting the parser if it hasn't advanced the input stream in the last second, which is an indirect sign it has too many live parsers.

Resources