LL-1 Parsers: Is the FOLLOW-Set really necessary? - parsing

as far as I understand the FOLLOW-Set is there to tell me at the first possible moment if there is an error in the input stream. Is that right?
Because otherwise I'm wondering what you actually need it for. Consider you're parser has a non-terminal on top of the stack (in our class we used a stack as abstraction for LL-Parsers)
i.e.
[TOP] X...[BOTTOM]
The X - let it be a non-terminal - is to be replaced in the next step since it is at the top of the stack. Therefore the parser asks the parsing table what derivation to use for X. Consider the input is
+ b
Where + and b are both terminals.
Suppose X has "" i.e. empty string in its FIRST set. And it has NO + in his FIRST set.
As far as I see it in this situation, the parser could simply check that there is no + in the FIRST set of X and then use the derivation which lets X dissolve into an "" i.e. empty string since it is the only way how the parser possibly can continue parsing the input without throwing an error. If the input stream is invalid the parser will recognize it anyway at some moment later in time. I understand that the FOLLOW set can help here to right away identify whether parsing can continue without an error or not.
My question is - is that really the only role that the FOLLOW set plays?
I hope my question belongs here - I'm sorry if not. Also feel free to request clarification in case something is unclear.
Thank you in advance

You are right about that. The parser could eventually just continue parsing and would eventually find a conflict in another way.
Besides that, the FOLLOW set can be very convenient in reasoning about a grammar. Not by the parser, but by the people that constructs the grammar. For instance, if you discover, that any FIRST/FIRST or FIRST/FOLLOW conflict exists, you have made an ambiguous grammar, and may want to revise it.

Related

Do we have to assume a parser is pre-reading tokens ahead of time?

Coming back to lexers and parsers after many years away, I find myself puzzled over the concept of a state change, for the purposes of context. I'm using Lemon as a parser and putting together my own lexer.
Let's take an example input like this one:
[groups]
syscon:
0x000 sysmemremap
0x004 presetctrl
[registers]
sysmemremap:
map 1-0
rsvd 31-2
presetctrl:32
mux 2-0
..etc...
So "syscon:" and "sysmemremap:" look the same but one is a GROUPNAME and the other is a REGISTERNAME. There's a context change between [groups] and [registers] that determines what each token is in reality.
Is it the parser that is in the best position to make that contexual change? As the parser doesn't have a sectional grammar, where one set of grammar applies in one set of circumstances and another in a different set, I presume the lexer should be the one deciding that "syscon:" generates a GROUPNAME if the mode is such that it should.
EDIT: Just spotted "the lexer hack" entry in Wikipedia that summarises the issue:
Without added context, the lexer cannot distinguish type identifiers
from other identifiers because all identifiers have the same format.
.... The solution generally consists of feeding information from the
semantic symbol table back into the lexer. That is, rather than
functioning as a pure one-way pipeline from the lexer to the parser,
there is a backchannel from semantic analysis back to the lexer.
Except (and this is the question I have) what can you assume about the parser's pre-reading of tokens? If the parser is bashing ahead and reading more tokens to do a better match - which I would expect it to do to some extent at least, it could well run into the situation that a state change in the parser is too late for the lexer as it already met and processed that token!
Or am I overthinking this?
I may be able to answer my own question. I think that any sort of reliance on what the parser is doing internally is likely a Bad Plan.
Now that I've found my Lex and Yacc book (the O'Reilly one), one of the examples in the Lex section is a state change one - If it see the word "verb" it starts defining verbs as opposed to looking them up. That work is done in the lexer so I guess that's the way it is - do it in the lexer.

How to read through an LL parsing ?

I read that the LL parser is a Top down parser. So logically I suppose that we read throughout from the top to the down.
However, there's many way for read from the top to the down.
I found on wikipedia a page which talk about the depth first which speak of the course in an tree data structure (binary tree).
Otherwise, there is 3 kind of depth first: Pre-order, In-order, Post-order.
In my mind, I suppose that I need to use the Post-order one but how to be sure ?
how to know which kind of depth first need I to use for the LL parsing ?
depth first : https://en.wikipedia.org/wiki/Tree_traversal
Thank's
There's usually an infinite number of ways to traverse a grammar, just like there's an infinite number of possible inputs that adhere to the grammar.
When you walk the grammar you typically don't do it like you would a traditional tree or graph structure. Rather, your walk is dictated by an input stream of tokens coming from the lexer.
E.g. if you find yourself at a place in the grammar where it has has a production where either an identifier or an integer literal may occur, the branch taken is dictated by whether the current token is one or the other (or something else, which would then be a syntax error for that input).

Issue with left recursion in top down parsing

I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.

Is it possible to call one yacc parser from another to parse specific token substream?

Suppose I already have a complete YACC grammar. Let that be C grammar for example. Now I want to create a separate parser for domain-specific language, with simple grammar, except that it still needs to parse complete C type declarations. I wouldn't like to duplicate long rules from the original grammar with associated handling code, but instead would like to call out to the original parser to handle exactly one rule (let's call it "declarator").
If it was a recursive descent parser, there would be a function for each rule, easy to call in. But what about YACC with its implicit stack automaton?
Basically, no. Composing LR grammars is not easy, and bison doesn't offer much help.
But all is not lost. Nothing stops you from including the entire grammar (except the %start declaration), and just using part of it, except for one little detail: bison will complain about useless productions.
If that's a show-stopper for you, then you can use a trick to make it possible to create a grammar with multiple start rules. In fact, you can create a grammar which lets you specify the start symbol every time you call the parser; it doesn't even have to be baked in. Then you can tuck that into a library and use whichever parser you want.
Of course, this also comes at a cost: the cost is that the parser is bigger than it would otherwise need to be. However, it shouldn't be any slower, or at least not much -- there might be some cache effects -- and the extra size is probably insignificant compared to the rest of your compiler.
The hack is described in the bison FAQ in quite a lot of detail, so I'll just do an outline here: for each start production you want to support, you create one extra production which starts with a pseudo-token (that is, a lexical code which will never be generated by the lexer). For example, you might do the following:
%start meta_start
%token START_C START_DSL
meta_start: START_C c_start | START_DSL dsl_start;
Now you just have to arrange for the lexer to produce the appropriate START token when it first starts up. There are various ways to do that; the FAQ suggests using a global variable, but if you use a re-entrant flex scanner, you can just put the desired start token in the scanner state (along with a flag which is set when the start token has been sent).

Encode FIRST and FOLLOW sets into a recursive descent parser

This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive
You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue

Resources