As part of a toy project I've been trying to make a small modification of someone else's parser based on flex/bison. I'm really not experienced with either. You can find the original parser here.
I've been trying to put together a simple function that accepts a string and returns a parse tree, so I can expose this via FFI for use in another programming language. What I have is mostly based on the main() function in the original program, my butchered version is below:
TreeNode* parse_string(char *s)
{
FILE *in = fmemopen(s, strlen(s), "r");
lex2_initialise();
parse_file(in);
fclose(in);
preprocess_tokens();
yyparse();
return top;
}
This actually works fine, at least the first time I call it. The second time it complains about misparsed tokens, and the error reporting function used appears to be called from somewhere inside a maze of goto statements within the generated parser during the call to yyparse(), at which point I don't understand what's going on anymore.
The original program itself only appears to be designed to take all its input upfront and then exit, so it doesn't leave me with much clue of what I'm missing. Putting aside the not-altogether-outlandish idea some old state is being retained elsewhere in the rest of the program, my main questions are:
Do either Flex or Bison maintain global state between calls to yyparse()
Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?
Do either Flex or Bison maintain global state between calls to yyparse()
Flex maintains information about the current input stream. If the parse does not consume the entire input stream (which is quite common for parsers which terminate abnormally on errors), then the next call to yyparse will continue reading from where the previous one left off. Providing a new input buffer will (mostly) reset the lexer's state, but there may be some aspects which have not been reset, notably the current start condition, and the condition stack if that option has been enabled.
The bison-generated parser does not rely on global state. It is designed to clear its internal state prior to returning from yyparse. However, if a parser action executes a return statement directly (this is not recommended), then the cleanup will be bypassed, which is likely to create a memory leak. Actions which prematurely terminate the parse should use the macros YYACCEPT or YYABORT rather than a return statement.
Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?
The default flex-generated parser, which is designed to be called every time a token is required, is heavily reliant on global variables. Most, but not all, of the flex state is maintained in the current YY_BUFFER_STATE (which is kept in a global variable), and that object can be reset by the yyrestart function, or any of the functions which provide a character buffer as lexer input. However, these functions do not reset the start condition nor do they flush the condition stack (if enabled), or the buffer stack. If you want to reset the state completely, you need to flush the stacks manually, and reset the start condition with BEGIN(INITIAL).
One approach to making a more easily restartable scanner is to build a reentrant scanner. A reentrant scanner keeps all of its state (including start conditions and buffer stack) in a scanner structure, which means that you can completely reset the scanner state simply by creating a new scanner structure (and, of course, destroying the old one to avoid leaking memory.)
There are lots of good reasons to use reentrant scanners [Note 1]. For one thing, it allows you to have more than one parser active at the same time, and it eliminates a reliance on global state. But unfortunately, it's not as simple as just setting a flex options.
Reentrant scanners have a different API (which includes a pointer to the scanner state structure). This state structure needs to be passed into yyparse and yyparse needs to pass it to yylex; all of this requires some modifications to the bison options. Also, reentrant scanners cannot use the global yylval to communicate the semantic value of a token to the parser [Note 2].
If you use the %bison-bridge option and tell bison to generate a reentrant parser, then yylex will expect to be called with another additional parameter (or two, if you use locations), and the reentrant bison parser will supply the additional parameters. That all works fine, but it has the effect of changing yylval (and yylloc, if used) to a pointer, which means that you need to go through all the scanner actions changing yylval.something to yylval->something.
Notes
You can also create a reentrant parser, using some additional bison options. Normally, the only mutable globals used by a bison-generated parser are yylval and yylloc (if you use location reporting). (And yynerrs, but it is rare to refer to that variable outside of a parser action.) Specifying a reentrant parser turns those globals into lexer arguments, but it does not create an externally visible parser state structure. But it also gives you the option of using a "push parser", which does have a persistent parser state structure. In some cases, the flexibility of push parsers can significantly simplify scanners.
Strictly speaking, nothing stops you from creating a reentrant scanner which still uses globals to communicate with the parser, except that it is not really reentrant any more. I wouldn't recommend this option for obvious reasons, but you might want to do it as a transitional strategy, since it requires less modification to the parser and to scanner actions.
Even if you are using a non-reentrant parser, you can use yylex_destroy (without arguments) after lexing to force an initialisation, the next time the the lexer is invoked:
extern int yylex_destroy(void);
...
// do parsing here
...
yylex_destroy()
For reentrant parsers see here.
Related
I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer.
For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?
I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
operators, e.g * (usually multiply), '(', ')', ;
keywords, e.g., "IF", "GOTO"
identifiers, e.g. FOO, count, ...
numbers, e.g. 0, -527.23E-41
comments, e.g., /* this text is ignored in your file */
whitespace, e.g., sequences of blanks, tabs and newlines, that are ignored
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.
Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.
I am new to flex/bison. Reading books, it seems that in nearly all compiler implementations, the parser interacts with the scanner in a "coroutine" manner, that whenever the parser needs a token, it calls the scanner to get one, and left the scanner aside when it's busy on shift/reduce. A natural question is that why not let the scanner produces the token-stream (from the input byte-stream) as a whole, and then pass the entire token-stream to the parser, thus there is no explicit interaction betw. the two? Well, I can image that there are some drawbacks in this manner, and I can also see some benefits of doing so.
My question is, is there a sort of "comprehensive" discussion on that aspect, or is there any compiler implementation uses different scanner/parser interaction scheme other than "coroutine" manner?
In the traditional arrangement, the parser calls the scanner whenever it needs a token.
That's the same logic as used in the scanner (or many other programs) which call the I/O library every time they need more input. That's not usually described as a coroutine, and I'm not convinced it's an accurate description of the parser/scanner interaction either.
In coroutine control flow, two functions call each other in tandem. That's not usually the way I/O is handled. The fread() interface does maintain state for the next call (the file position, at least, and maybe a buffer) but it the calls are self contained.
In a sense, there is no difference between calling yylex() to get the next token and calling scanf() to get the next data value.
This is not always the most convenient architecture for a scanner. Sometimes, it would be convenient for the scanner to be able to feed tokens into the parser. A typical use case is when the scanner is generating tokens, for exanple through macro expansion, but sometimes it is just that the match of a single scanner pattern contains more than one token.
Many parser generators, including Bison, can generate callable parsers, usually called "push parsers". In this model, the scanner calls the parser with each succesive token. This is still not a coroutine model, really; it is just control-flow inversion. In the analogy with ordinary I/O, it's the equivalent of taking a data processor which called fgets() to read each input line and rewriting it as a process_line() function which is given a line of data to process (and thus does not interact with the I/O library). An early implementation of push parsing can be found in the Lemon parser generator.
Coroutine-like control flow could be useful for creating a parser whose eventual input stream must be handled asynchronously. But that doesn't really require coroutining between the parser and the scanner; rather, it requires coroutining between the scanner and the input stream. Again, coroutining is not really necessary and might be overkill: inverting control flow should suffice. Flex does not provide a "push scanner" interface, but other scanner generators do. I believe this feature is supported by Re2c, for example.
I'm using lex & yacc to write a VHDL parser. VHDL has some languages features which make it context sensitive in a manner similar to C. For example, typedef-like constructs which impact whether the parser should tokenize something as an IDENTIFIER vs. TYPEDEF_NAME.
The difficulty comes in when you need to build a symbol table based on another file which is referenced by "use" statements (similar to "import" in Java or Python).
library ieee;
use ieee.std_logic_1164.all;
-- code which uses something defined in ieee.std_logic_1164 package
In C, this is fairly straight-forward because the preprocessor has already combined all of the header files into a single translation unit which can be scanned from top to bottom. But 'use' statements in VHDL are not preprocessor commands.
So, somehow, as I'm parsing the file, I have to recognize when I see a use statement and then go off and parse the relevant file, and then continue parsing the original file with that symbol table.
Is there an elegant way to do this with lex/yacc? I know there is yyrestart but I'm not sure if that's going down the right track.
If you are using flex, then it is pretty easy.
The basic mechanism (including two functioning code samples) is described in the "Multiple Input Buffers" chapter of the flex manual. You can also take a glance at this question on SO.
The parser (yacc/bison) reduction which recognizes the use construction can include the code which calls yy_push_buffer. In the example code, the end of the included file is recognized by the scanner (lex/flex), which simply pops the buffer stack.
Depending on the formal rules of file inclusion, you might want the parser to know that the included file has finished, in order to avoid having syntactic constructs which start in the included file and continue in the includer. (C allows this, even though it is almost always an error; I don't know anything about VHDL, but there are definitely languages which do not allow it.) One possibility is to recursively call the parser in order to parse the included file, which will require a re-entrant ("pure") parser. In that case, the scanner should return an end-of-included-file token when it hits the end of the included file, because your included file grammar production will need to be terminated with such a token.
You may need to worry about the possibility that the parser has already requested the next input token. Most LALR(1) grammars do not depend on the lookahead token for semi-colon terminated statements, and bison usually doesn't request a lookahead token in a context in which it doesn't need it. But that behaviour is not guaranteed by all Posix-compatible yacc implementations and you might be using one which doesn't.
In that case, you would have to preserve the lookahead token so that you can reread it after the included file has been parsed. That would most conveniently be done by stashing the lookahead token somewhere the scanner can see it, and having the scanner return that token (if set) when it sees the end of the included file. In a bison action, you can find the lookahead token in yychar and its semantic value and location (if locations are enabled) are in yylval and yylloc. If bison has not read the lookahead token, the value of yychar will be YYEMPTY, and the simplest possible bison implementation would assert(yychar == YYEMPTY) when it is about to push the input buffer. If the assert fails, you'll need to implement a more sophisticated strategy.
I want to build a one-token-per-call ragel grammar / thing.
I'm relatively new to Ragel (but not new to compilers, etc).
I've written a grammar for a json-like notation (three levels deep). It emits C code.
My input comes in complete strings (no need to cross buffer boundaries).
I want to call my grammar with the input string, have the grammar return one token. Then call it again and have it return the next token and so on. Until end of string. Then, call again with a new string.
One would think that a state machine would be perfectly suited to this kind of behaviour, but I haven't yet been able to figure how to accomplish this in Ragel.
Your best bet is probably to call fbreak after each token, then call the machine again without re-initializing p or cs.
From the (Ragel 6.9) manual:
fbreak; – Advance p, save the target state to cs and immediately break out of the execute loop. This statement is useful in conjunction with the noend write option. Rather than process input until pe is arrived at, the fbreak statement can be used to stop processing from an action. After an fbreak statement the p variable will point to the next character in the input. The current state will be the target of the current transition. Note that fbreak causes the target state’s to-state actions to be skipped.
Note that you don't actually need the noend option. That option is for ignoring pe, which is probably not what you want to do in this case, since you want the parser to be able to detect the end of the string it's parsing.
Suppose I already have a complete YACC grammar. Let that be C grammar for example. Now I want to create a separate parser for domain-specific language, with simple grammar, except that it still needs to parse complete C type declarations. I wouldn't like to duplicate long rules from the original grammar with associated handling code, but instead would like to call out to the original parser to handle exactly one rule (let's call it "declarator").
If it was a recursive descent parser, there would be a function for each rule, easy to call in. But what about YACC with its implicit stack automaton?
Basically, no. Composing LR grammars is not easy, and bison doesn't offer much help.
But all is not lost. Nothing stops you from including the entire grammar (except the %start declaration), and just using part of it, except for one little detail: bison will complain about useless productions.
If that's a show-stopper for you, then you can use a trick to make it possible to create a grammar with multiple start rules. In fact, you can create a grammar which lets you specify the start symbol every time you call the parser; it doesn't even have to be baked in. Then you can tuck that into a library and use whichever parser you want.
Of course, this also comes at a cost: the cost is that the parser is bigger than it would otherwise need to be. However, it shouldn't be any slower, or at least not much -- there might be some cache effects -- and the extra size is probably insignificant compared to the rest of your compiler.
The hack is described in the bison FAQ in quite a lot of detail, so I'll just do an outline here: for each start production you want to support, you create one extra production which starts with a pseudo-token (that is, a lexical code which will never be generated by the lexer). For example, you might do the following:
%start meta_start
%token START_C START_DSL
meta_start: START_C c_start | START_DSL dsl_start;
Now you just have to arrange for the lexer to produce the appropriate START token when it first starts up. There are various ways to do that; the FAQ suggests using a global variable, but if you use a re-entrant flex scanner, you can just put the desired start token in the scanner state (along with a flag which is set when the start token has been sent).