Ragel - how to return one token at a time - parsing

I want to build a one-token-per-call ragel grammar / thing.
I'm relatively new to Ragel (but not new to compilers, etc).
I've written a grammar for a json-like notation (three levels deep). It emits C code.
My input comes in complete strings (no need to cross buffer boundaries).
I want to call my grammar with the input string, have the grammar return one token. Then call it again and have it return the next token and so on. Until end of string. Then, call again with a new string.
One would think that a state machine would be perfectly suited to this kind of behaviour, but I haven't yet been able to figure how to accomplish this in Ragel.

Your best bet is probably to call fbreak after each token, then call the machine again without re-initializing p or cs.
From the (Ragel 6.9) manual:
fbreak; – Advance p, save the target state to cs and immediately break out of the execute loop. This statement is useful in conjunction with the noend write option. Rather than process input until pe is arrived at, the fbreak statement can be used to stop processing from an action. After an fbreak statement the p variable will point to the next character in the input. The current state will be the target of the current transition. Note that fbreak causes the target state’s to-state actions to be skipped.
Note that you don't actually need the noend option. That option is for ignoring pe, which is probably not what you want to do in this case, since you want the parser to be able to detect the end of the string it's parsing.

Related

Why would I use a lexer and not directly parse code?

I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer.
For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?
I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
operators, e.g * (usually multiply), '(', ')', ;
keywords, e.g., "IF", "GOTO"
identifiers, e.g. FOO, count, ...
numbers, e.g. 0, -527.23E-41
comments, e.g., /* this text is ignored in your file */
whitespace, e.g., sequences of blanks, tabs and newlines, that are ignored
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.
Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.

Resetting the state of flex and/or bison

As part of a toy project I've been trying to make a small modification of someone else's parser based on flex/bison. I'm really not experienced with either. You can find the original parser here.
I've been trying to put together a simple function that accepts a string and returns a parse tree, so I can expose this via FFI for use in another programming language. What I have is mostly based on the main() function in the original program, my butchered version is below:
TreeNode* parse_string(char *s)
{
FILE *in = fmemopen(s, strlen(s), "r");
lex2_initialise();
parse_file(in);
fclose(in);
preprocess_tokens();
yyparse();
return top;
}
This actually works fine, at least the first time I call it. The second time it complains about misparsed tokens, and the error reporting function used appears to be called from somewhere inside a maze of goto statements within the generated parser during the call to yyparse(), at which point I don't understand what's going on anymore.
The original program itself only appears to be designed to take all its input upfront and then exit, so it doesn't leave me with much clue of what I'm missing. Putting aside the not-altogether-outlandish idea some old state is being retained elsewhere in the rest of the program, my main questions are:
Do either Flex or Bison maintain global state between calls to yyparse()
Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?
Do either Flex or Bison maintain global state between calls to yyparse()
Flex maintains information about the current input stream. If the parse does not consume the entire input stream (which is quite common for parsers which terminate abnormally on errors), then the next call to yyparse will continue reading from where the previous one left off. Providing a new input buffer will (mostly) reset the lexer's state, but there may be some aspects which have not been reset, notably the current start condition, and the condition stack if that option has been enabled.
The bison-generated parser does not rely on global state. It is designed to clear its internal state prior to returning from yyparse. However, if a parser action executes a return statement directly (this is not recommended), then the cleanup will be bypassed, which is likely to create a memory leak. Actions which prematurely terminate the parse should use the macros YYACCEPT or YYABORT rather than a return statement.
Is there some simple function call I could put at the end of the function above to wipe it all and reset everything back to the initial state?
The default flex-generated parser, which is designed to be called every time a token is required, is heavily reliant on global variables. Most, but not all, of the flex state is maintained in the current YY_BUFFER_STATE (which is kept in a global variable), and that object can be reset by the yyrestart function, or any of the functions which provide a character buffer as lexer input. However, these functions do not reset the start condition nor do they flush the condition stack (if enabled), or the buffer stack. If you want to reset the state completely, you need to flush the stacks manually, and reset the start condition with BEGIN(INITIAL).
One approach to making a more easily restartable scanner is to build a reentrant scanner. A reentrant scanner keeps all of its state (including start conditions and buffer stack) in a scanner structure, which means that you can completely reset the scanner state simply by creating a new scanner structure (and, of course, destroying the old one to avoid leaking memory.)
There are lots of good reasons to use reentrant scanners [Note 1]. For one thing, it allows you to have more than one parser active at the same time, and it eliminates a reliance on global state. But unfortunately, it's not as simple as just setting a flex options.
Reentrant scanners have a different API (which includes a pointer to the scanner state structure). This state structure needs to be passed into yyparse and yyparse needs to pass it to yylex; all of this requires some modifications to the bison options. Also, reentrant scanners cannot use the global yylval to communicate the semantic value of a token to the parser [Note 2].
If you use the %bison-bridge option and tell bison to generate a reentrant parser, then yylex will expect to be called with another additional parameter (or two, if you use locations), and the reentrant bison parser will supply the additional parameters. That all works fine, but it has the effect of changing yylval (and yylloc, if used) to a pointer, which means that you need to go through all the scanner actions changing yylval.something to yylval->something.
Notes
You can also create a reentrant parser, using some additional bison options. Normally, the only mutable globals used by a bison-generated parser are yylval and yylloc (if you use location reporting). (And yynerrs, but it is rare to refer to that variable outside of a parser action.) Specifying a reentrant parser turns those globals into lexer arguments, but it does not create an externally visible parser state structure. But it also gives you the option of using a "push parser", which does have a persistent parser state structure. In some cases, the flexibility of push parsers can significantly simplify scanners.
Strictly speaking, nothing stops you from creating a reentrant scanner which still uses globals to communicate with the parser, except that it is not really reentrant any more. I wouldn't recommend this option for obvious reasons, but you might want to do it as a transitional strategy, since it requires less modification to the parser and to scanner actions.
Even if you are using a non-reentrant parser, you can use yylex_destroy (without arguments) after lexing to force an initialisation, the next time the the lexer is invoked:
extern int yylex_destroy(void);
...
// do parsing here
...
yylex_destroy()
For reentrant parsers see here.

Encode FIRST and FOLLOW sets into a recursive descent parser

This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive
You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue

Sharing Iterators

I'm writing a (simple) compiler in Scala and have made the tokenizer iterable and now need to write the parser. The plan is to use a recursive-descent strategy and so the parser is going to be split up into a number of methods, each of which calls (some of) the others.
I assume it's going to be necessary/preferable to maintain the state of the tokenizer iterator and share it among the various methods. Is this the case? How should I go about it? If it's not the case, what are the alternatives?
If you have to maintain the state of the iterator, don't use an iterator! Iterators are for when you can destroy your state as you go.
You might be able to get away with using a stream. Streams have a habit of not giving up their memory when they ought to because of references persisting where you don't want them (but where you can tell they exist if you think about it). So if you started with an iterator, you could .toStream it and pass the substreams in, and then pass on the stream for further processing. But you'd have to be very careful about not keeping a reference to the head of the stream if you wanted to avoid keeping everything in memory.
Another way to go is to just dump everything into a vector or array and keep the whole problem in memory; you can then drop the irrelevant parts (or advance the index) as you proceed.
Finally, if you're absolutely positive that you don't need any backtracking, then you can just use the iterator as it is without worrying about "maintaining the state". That is, when you get back from the sub-method, you will already have consumed exactly the right tokens and no more, and will be free to keep parsing. For this to work without at least a one-element "next token that I didn't consume" on the return value, you need to be able to predict where the last token is (e.g. a list of unbounded length would have to end with a token that was part of the list, so {1,2,3} could be a list (if you go into list processing when you see { and drop out when you hit }), but not 1,2,3 + 7 (because you'd consume + before you realized that the list was over)).
You could just construct the token iterator and pass it down each recursive parser call so that the token-level parsing reads from it:
def parse2(tokens: Iterator[String]) = List(tokens.next, tokens.next)
def parse1(tokens: Iterator[String]) = List(parse2(tokens), parse2(tokens))
val tokens = List("a","b","c","d").iterator
val parsed = parse1(tokens) //List(List(a, b), List(c, d))

Is it acceptable to store the previous state as a global variable?

One of the biggest problems with designing a lexical analyzer/parser combination is overzealousness in designing the analyzer. (f)lex isn't designed to have parser logic, which can sometimes interfere with the design of mini-parsers (by means of yy_push_state(), yy_pop_state(), and yy_top_state().
My goal is to parse a document of the form:
CODE1 this is the text that might appear for a 'CODE' entry
SUBCODE1 the CODE group will have several subcodes, which
may extend onto subsequent lines.
SUBCODE2 however, not all SUBCODEs span multiple lines
SUBCODE3 still, however, there are some SUBCODES that span
not only one or two lines, but any number of lines.
this makes it a challenge to use something like \r\n
as a record delimiter.
CODE2 Moreover, it's not guaranteed that a SUBCODE is the
only way to exit another SUBCODE's scope. There may
be CODE blocks that accomplish this.
In the end, I've decided that this section of the project is better left to the lexical analyzer, since I don't want to create a pattern that matches each line (and identifies continuation records). Part of the reason is that I want the lexical parser to have knowledge of the contents of each line, without incorporating its own tokenizing logic. That is to say, if I match ^SUBCODE[ ][ ].{71}\r\n (all records are blocked in 80-character records) I would not be able to harness the power of flex to tokenize the structured data residing in .{71}.
Given these constraints, I'm thinking about doing the following:
Entering a CODE1 state from the <INITIAL> start condition results
in calls to:
yy_push_state(CODE_STATE)
yy_push_state(CODE_CODE1_STATE)
(do something with the contents of the CODE1 state identifier, if such contents exist)
yy_push_state(SUBCODE_STATE) (to tell the analyzer to expect SUBCODE states belonging to the CODE_CODE1_STATE. This is where the analyzer begins to masquerade as a parser.
The <SUBCODE1_STATE> start condition is nested as follows: <CODE_STATE>{ <CODE_CODE1_STATE> { <SUBCODE_STATE>{ <SUBCODE1_STATE> { (perform actions based on the matching patterns) } } }. It also sets the global previous_state variable to yy_top_state(), to wit SUBCODE1_STATE.
Within <SUBCODE1_STATE>'s scope, \r\n will call yy_pop_state(). If a continuation record is present (which is a pattern at the highest scope against which all text is matched), yy_push_state(continuation_record_states[previous_state]) is called, bringing us back to the scope in 2. continuation_record_states[] maps each state with its continuation record state, which is used by the parser.
As you can see, this is quite complicated, which leads me to conclude that I'm massively over-complicating the task.
Questions
For states lacking an extremely clear token signifying the end of its scope, is my proposed solution acceptable?
Given that I want to tokenize the input using flex, is there any way to do so without start conditions?
The biggest problem I'm having is that each record (beginning with the (SUB)CODE prefix) is unique, but the information appearing after the (SUB)CODE prefix is not. Therefore, it almost appears mandatory to have multiple states like this, and the abstract CODE_STATE and SUBCODE_STATE states would act as groupings for each of the concrete SUBCODE[0-9]+_STATE and CODE[0-9]+_STATE states.
I would look at how the OMeta parser handles these things.

Resources