Flex/bison scanner fails on unput() when loading text with yyscan_bytes() - flex-lexer

I am adapting an already-fully-functional parser to read from a client-provided text buffer, so I am following the examples I found on this site, which have me load the buffer using yyscan_bytes(). Unfortunately, this leads to a fatal error using the existing grammar due to this (long-existing) Flex rule:
.|"\n" { BEGIN INIT; unput(yytext[0]); }
This rule is the first one hit when I parse any input, and the unput() always fails with the "flex scanner push-back overflow" error. I am not quite sure what this all-purpose rule is doing, but taking it out causes everything to fail in other ways. Any ideas or enlightment would be appreciated.

That action unconditionally sets the start condition to INIT. (Actually, it's not quite unconditional. It requires that the input contain at least one byte; otherwise, the EOF action will be performed instead.)
However, unput(yytext[0]); is really not a very good idea, although I'm a little surprised that it doesn't work. Much better is yyless(0);, which in this case does exactly the same thing (arranges for the character just scanned to be rescanned in a different start condition), but without doing nearly as much work. In particular, it does not need to modify the input buffer, so it will not fail in the same way that unput does.
The problem with unput appears to be that flex cannot relocate the unconsumed input in the current buffer because the current buffer is exactly the size needed to hold the input. It's not clear to me why it feels that it needs to relocated the input, though.

Related

Recognize specific grammatical mistakes with Bison

I'm attempting to use Bison to develop my own programming language. I've got the .y file written for my grammar. However, I'm wondering if there's a way, in the case that the user attempts to parse source code with invalid grammar, to have Bison give a useful error message. For example, suppose I have the following rule in my grammar:
if_statement: IF expr '{' statement_list '}' {$$=createNode(IF,$2,$4);}
;
Suppose the source code left out the closing brace. According to my understanding, Bison would report that it was unable to find a rule to reduce the code. Could Bison be made to recognize that there is an unfinished if which begins on line such-and-such and report that to the user?
Missing braces are very rarely detected where they happen, because it is usually the case that whatever follows the missing brace could just as well have come before it. That's particularly clear if the missing close brace is immediately followed by another closing brace, but it could simply be followed (in this case) by another statement:
function some_function() {
....
while (some_condition) {
...
if (some_other_condition) {
...
break;
// } /* Commented out by mistake */
a = 3;
...
}
return a;
}
function another_function() {
...
}
If your language doesn't allow nested function definitions then the definition of another_function will trigger an error; if it does allow nested function definitions, then another_function will just be defined in an unexpected scope and the parse will continue, perhaps until the end of file.
One way of detecting errors like this is to check indentation of every line with the expected indentation. However, unless your language has some concept of correct indentation (like, for example, Python), you cannot flag misleading indentation as an error. So the best you can do is record the unexpected indentation, in order to use it as a clue when a syntax error is finally encountered (if there is a syntax error, since it might just be that the programmer doesn't care to make their programmes human-readable). The complications in this approach to error detection are probably why it is so uncommon in mainstream languages, although personally I think it's an approach with a lot of potential.
I usually advocate parsing erroneous programs twice. The first parse is optimised for correct programs, which means that it doesn't need any of the overhead required for good error messages, such as tracking the position of every token. If the program turns out to be syntactically correct, you can then move on to turning the AST into compiled code. If the program turns out to have an syntax error, you can restart the parse at the beginning, and then you are certainly free to use heuristics like indentation checks to attempt to better localise errors.
Having said all that, you may well do better to move on to implementation of your language and return to the problem of producing better diagnostics later.
Bison does offer a mechanism for producing more useful error messages in some cases.
First, you should at least enable line number tracking from Flex, which is almost zero effort. You might also want to track precise token position, which is a bit more work but not too much. (See Character Position from starting of a line, https://stackoverflow.com/a/48879103/1566221 and yyllocp->first_line returns uninitialized value in second iteration of a reEntrant Bison parser (among others) for sample code.)
Second, ask bison to produce verbose error messages. That only requires two extra lines in your bison prologue:
%define parse.error verbose
%define parse.lac full
Please do read the bison manual for some important caveats. In particular, LAC may involve significant overhead. But the error messages produced are often helpful.
Finally, use bison's error recovery mechanism to continue the parse after the first syntax error is detected, thus allowing you to report several syntax errors in a single run. That's usually less frustrating for a user, although you should terminate the parse at some threshold error count, because really high error counts after error recovery usually mean that the error recovery itself failed and that many of the subsequent error messages were bogus.
Again, the bison manual has some useful suggestions about how to use the error facilities.
Bison manual table of contents

Is there a situation in delphi where a GOTO is the only solution?

The goto statement is taboo at my work.
So the following question is born...
Is there a situation possible where a goto is the only valid solution?
Originally GOTO was added to Pascal for error handling, including inter procedural forms that Borland(/Embarcadero) never implemented (example: GOTOing from a inner procedure to the parent), just like Borland never implemented other inner function functionality like passing inner functions to procedure-typed parameters.(*)
In that way GOTO can be considered the precursor to exceptions.
There still some practical uses: The last time I checked, jumping out of a nested IF statement with goto was still faster in Delphi then letting the code exit from a nested if naturally.
Optimizations like these are sometimes used in e.g. compression code, and other complex tree processing code with deeply nested loops or conditional statements.
Such routines often still use goto for errorhandling, because it is faster. (exceptions are not only slow, but their border conditions inhibit some optimizations).
One could see this as part of the plain Pascal level of Object Pascal, just like C++ still allows plain C nearly completely.
(of course, since the optimized compression code in Delphi is only delivered in .o form, it is hard to find examples in the Delphi codebase. The JPEG code has some, but that is a C translation)
(*) Original pascal, and IIRC even Turbo Pascal doesn't allow prematurely exiting a procedure with EXIT. Same for CONTINUE and BREAK.
Is there a situation possible where a GOTO is the only valid solution?
I suppose it depends on what you mean by valid. I suppose you are asking if there exists a program that can only be written with the use of the goto statement. In which case the answer is that there is no such program. Delphi is Turing complete with or without the goto statement.
However, if we are prepared to widen the discussion to include other languages, there are situations where goto is a good solution, even the best solution. The scenario that most commonly comes to mind is implementing tidy-up and error handling in languages without structured exception handling. If you peruse the Linux source code you will find that goto is widely used. I expect that the same is true of the Windows source code.
Goto is very old. It predates sub-routines like functions and procedures! It is also very dangerous and can make your code less readable (to others, or to yourself a few months later).
In theory it's not possible to have a situation where goto is required. I won't repeat the theory about Turing tape machines here, but using selection and iteration, you can re-order the code so in all possible input values the same output comes about.
In practice though, it's sometimes 'handy' and 'better readable' to 'jump away' from the flow of code in certain conditions, and that's where Exceptions come in. raise breaks away from the current execution, and jump to the closest finally or except section. This is safer because they work cascaded, and provide a better way to handle the context in case of one of these border conditions. (And there's also breakand abort and exit)
GOTO is never necessary. Any computable algorithm can be expressed with assignment and the combination of IF...THEN, BEGIN...END, and your choice of WHILE...DO...END or REPEAT...UNTIL. You don't even need subroutines. :)
This is known as the structured program theorem.
For a proof, see the 1966 paper, Flow Diagrams, Turing Machines and Languages with Only Two Formation Rules (PDF) by Corrado Böhm and Giuseppe Jacopini.
Something like 15 years ago I used the goto statement in Delphi to convert one of Bob Jenkins's hash functions from C to Pascal. The C function has a switch() statement without breaks after each case, and you can't do that with Pascal's case statement. So I converted it into a bunch of Pascal labels and gotos. I guess you would still have to do it the same way with the newest Delphi versions.
Edit: I guess using gotos would still be a reasonable way to do this. Gets the job done, easy to understand, limited to a short block of code, not dangerous.

Incremental Parsing from Handle in Haskell

I'm trying to interface Haskell with a command line program that has a read-eval-print loop. I'd like to put some text into an input handle, and then read from an output handle until I find a prompt (and then repeat). The reading should block until a prompt is found, but no longer. Instead of coding up my own little state machine that reads one character at a time until it constructs a prompt, it would be nice to use Parsec or Attoparsec. (One issue is that the prompt changes over time, so I can't just check for a constant string of characters.)
What is the best way to read the appropriate amount of data from the output handle and feed it to a parser? I'm confused because most of the handle-reading primatives require me to decide beforehand how much data I want to read. But it's the parser that should decide when to stop.
You seem to have two questions wrapped up in here. One is about incremental parsing, and one is about incremental reading.
Attoparsec supports incremental parsing directly. See the IResult type in Data.Attoparsec.Text. Parsec, alas, doesn't. You can run your parser on what you have, and if it gives an error, add more input and try again, but you really don't know if the error was an unrecoverable parse error, or just needing for more input.
In your case, usualy REPLs read one line at a time. Hence you can use hGetLine to read a line - pass it to Attoparsec, and if it parses evaluate it, and if not, get another line.
If you want to see all this in action, I do this kind of thing in Plush.Job.Output, but with three small differences: 1) I'm parsing byte streams, not strings. 2) I've set it up to pull as much as is available from the input and parse as many items as I can. 3) I'm reading directly from file descriptos. But the same structure should help you do it in your situation.

How to "disable" a file output stream

I'm working on some legacy code in which there are a lot of WriteLn(F, '...') commands scattered pretty much all over the place. There is some useful information in these commands (what information variables contain, etc), so I'd prefer not to delete it or comment it out, but I want to stop the program from writing the file.
Is there any way that I can assign the F variable so that anything written to it is ignored? We use the console output, so that's not an option.
Going back a long long time to the good old days of DOS - If you assign 'f' to the device 'nul', then there should be no output.
assign (f, 'nul')
I don't know whether this still works in Windows.
Edit:
You could also assign 'f' to a file - assignfile (f, 'c:\1.txt') - for example.
Opening the null device and letting output go there would probably work. Under DOS, the performance of the NUL device was astonishingly bad IIRC (from what I understand, it wasn't buffered, so the system had to look up NUL in the device table when processing each byte) but I would not be at all surprised if it's improved under newer systems. In any case, that's probably the easiest thing you can do unless you really need to maximize performance. If performance is critical, it might in theory be possible to override the WriteLn function so it does nothing for certain files, but unfortunately I believe it allows syntax forms that were not permissible for any user-defined functions.
Otherwise, I would suggest doing a regular-expression find/replace to comment out the WriteLn statements in a fashion that can be mechanically restored.

Parsing code with syntax errors

Parsing techniques are well described in CS literature. But the algorithms I know of require that the source is syntactically correct. If a syntax error is encountered, parsing is immediately aborted.
But IDE's (like Visual Studio) are typically able to provide meaningful code completion and other hints while typing, which mean the syntax is often not in a valid state. E.g. you type an opening parenthesis in a function call, and the IDE provide parameter hints for the function, even though the syntax is invalid until the closing parenthesis is typed.
It seems to me this must rely on some kind of guessing or error-tolerant parser. Anyone know what techniques or algorithms are used for this?
The standard trick is to do some kind of error repair using the parsing machinery to help make predictions.
For table-based parsers (such as LALR or GLR), when a syntax error occurs, the parser was recently in some state in which the error had not yet happened. One can record the parse stack to remember this before each shift (or alternatively record reductions before the error). Given that an error as been encountered, one can inspect the parse state for the saved stack to determine which tokens might be next (this is also how one can do code completion in terms of syntax tokens). A more sophisticated technique can invent the smallest possible sequence of tokens that allow a shift by the error token, or the smallest possible tree that could replace the error token and allow a shift on the next.
This isn't so easy with recursive descent parsers because there isn't a lot of information lying around with which make a predication. For error recovery, a cheesy trick is define error recovery points (e.g., where a "stmt" might be accepted) and continue scanning until a ";" is found and accept and "error stmt". This doesn't help if you want code completion.
Packrat is promising - it provides information on both successful and failed parsing attempt at key points, which can be recovered and used for smart error reporting, completion, hints and so on. For example, if the cursor is at a point where all the parsing attempts are marked as failed in a cache, a list of tokens tried can be given for completion options.

Resources