Can't understand what professor is saying. LL(1) parsing [closed] - parsing

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm reading it again and again but can't understand.
http://awesomescreenshot.com/09c45nhted
A few things i dont understand:
Epsilon meaning, aside from "empty string".
$ meaning
How R3 is possible? It has term which would go to factor which would go to something that is not exist in input stream.
3rd bullet point on second page
I appreciate any help. Thank you!

Epsilon meaning, aside from "empty string"?
This ϵ symbol in the simplest way means nothing.
$ meaning ?
$ can mean either the starting of input OR the end of input. But,here it means the end of input as starting of input can't start with $ because of such CFG having start state stmt.
How R3 is possible? It has term which would go to factor which would
go to something that is not exist in input stream.
Beginners have problem dealing with this kind of thing. It's normal and should be. This kind of production is kind of recursive production.But,it will get resolved easily on parsing the input. You can notice the next production R4 : term_tail---> ϵ. Whenever substitution of term_tail won't require any input, then this production can be used to deal with that stage. So, no infinite recursion as to what you might have been thinking...
3rd bullet point on second page?
It is the input character that can follow term_tail in the grammar. This statement is the answer to the question mentioned in second bullet point "So what input character can be consumed if we apply R4?" Actually, the input string that is going to be derived for the term_tail can be done in 2 ways :-
EITHER term_tail ---> add_op term term_tail OR term_tail ---> ϵ
Through the help of those bulleted points, the author is trying to highlight the practical significance of FOLLOW() function in top-down parsing. The author's intent is to evaluate the conditions on which R4 can be applied on top-down parsing as mentioned on the top of 2nd page "the possible input characters for which R4 can be applied?".
The FOLLOW() of term_tail comes out to be '$',')'. You will be able to calculate this when you'll study FOLLOW() function's rules.
NOTE (VERY-VERY IMPORTANT) :-
FOLLOW() shows us the terminals that can come after a derived non-terminal. Note, this does not mean the last terminal derived from a non-terminal. It's the set of terminals that can come after it. We define FOLLOW() for all the non-terminals in the grammar.
How do we figure out FOLLOW()? Instead of looking at the first terminal for each phrase on the right side of the arrow, we find every place our non-terminal is located on the right side of any of the arrows. Then we look for some terminals.

Related

All viable prefixes of a Context Free Grammer

I am stuck to a problem from the famous dragon Book of Compiler Design.How to find all the viable prefixes of the following grammar:
S -> 0S1 | 01
The grammar is actually the language of the regex 0n1n.
I presume the set of all viable prefixes might come as a regex too.I came up with the following solution
0+
0+S
0+S1
0+1
S
(By plus , I meant no of zeroes is 1..inf)
after reducing string 000111 with the following steps:
stack input
000111
0 00111
00 0111
000 111
0001 11
00S 11
00S1 1
0S 1
0S1 $
S $
Is my solution correct or I am missing something?
0n1n is not a regular language; regexen don't have variables like n and they cannot enforce an equal number of repetitions of two distinct subsequences. Nonetheless, for any context-free grammar, the set of viable prefixes is a regular language. (A proof of this fact, in some form, appears at the beginning of Part II of Donald Knuth's seminal 1965 paper, On the Translation of Languages from Left to Right, which demonstrated both a test for the LR(k) property and an algorithm for parsing LR(k) grammars in linear time.)
OK, to the actual question. A viable prefix for a grammar is (by definition) the prefix of a sentential form which can appear on the stack during a parse using that grammar. It's called "viable" (which means "still alive" or "could continue growing") precisely because it must be the prefix of some right sentential form whose suffix contains no non-terminal symbol. In other words, there exists a sequence of terminals which can be appended to the viable prefix in order to produce a right-sentential form; the viable prefix can grow.
Knuth shows how to create a DFA which produces all viable prefixes, but it's easier to see this DFA if we already have the LR(k) parser produced by an LR(k) algorithm. That parser is a finite-state machine whose alphabet is the set of terminal and non-terminal symbols of a grammar. To get the viable-prefix grammar, we use exactly the same state machine, but we remove the stack (so that it becomes just a state machine) and the reduce actions, leaving only the shift and goto actions as transitions. All states in the viable-prefix machine are accepting states, since any prefix of a viable prefix is itself a viable prefix.
A key feature of this new automaton is that it cannot extend a prefix with a reduce action (since we removed all the reduce actions). A prefix with a reduce action is a prefix which ends in a handle -- recall that a handle is the right-hand side of some production -- so another definition of a viable prefix is that it is a right-sentential form (that is, a possible step in a derivation) which does not extend beyond the right-most handle.
The grammar you are working with has only two productions, so there are only two handles, 01 and 0S1. Note that 10 and 1S cannot be subsequences of any right-sentential form, nor can a right-sentential form contain more than one S. Any right-sentential form must either be a sentence 0n1n or a sentential form 0nS1n where n>0. But every handle ends at the first 1 of a sentential form, and so a viable prefix must end at or before the first 1. This produces precisely the four possibilities you list, which we can condense to the regular expression 0*0(S1?)?.
Chopping off the suffix removed the second n from the formula, so there is no longer a requirement of concordance and the language is regular.
Note:
Questions like this and their answers are begging to be rendered using MathJax. StackOverflow, unfortunately, does not provide this extension, which is apparently considered unnecessary for programming. However, there is a site in the StackExchange constellation dedicated to computing science questions, http://cs.stackexchange.com, and another one dedicated to mathematical questions, http://math.stackexchange.com. Formal language theory is part of both computing science and mathematics. Both of those sites permit MathJax, and questions on those sites will not be closed because they are not programming questions. I suggest you take this information into account for questions like this one.

LL-1 Parsers: Is the FOLLOW-Set really necessary?

as far as I understand the FOLLOW-Set is there to tell me at the first possible moment if there is an error in the input stream. Is that right?
Because otherwise I'm wondering what you actually need it for. Consider you're parser has a non-terminal on top of the stack (in our class we used a stack as abstraction for LL-Parsers)
i.e.
[TOP] X...[BOTTOM]
The X - let it be a non-terminal - is to be replaced in the next step since it is at the top of the stack. Therefore the parser asks the parsing table what derivation to use for X. Consider the input is
+ b
Where + and b are both terminals.
Suppose X has "" i.e. empty string in its FIRST set. And it has NO + in his FIRST set.
As far as I see it in this situation, the parser could simply check that there is no + in the FIRST set of X and then use the derivation which lets X dissolve into an "" i.e. empty string since it is the only way how the parser possibly can continue parsing the input without throwing an error. If the input stream is invalid the parser will recognize it anyway at some moment later in time. I understand that the FOLLOW set can help here to right away identify whether parsing can continue without an error or not.
My question is - is that really the only role that the FOLLOW set plays?
I hope my question belongs here - I'm sorry if not. Also feel free to request clarification in case something is unclear.
Thank you in advance
You are right about that. The parser could eventually just continue parsing and would eventually find a conflict in another way.
Besides that, the FOLLOW set can be very convenient in reasoning about a grammar. Not by the parser, but by the people that constructs the grammar. For instance, if you discover, that any FIRST/FIRST or FIRST/FOLLOW conflict exists, you have made an ambiguous grammar, and may want to revise it.

Are there different types of parsers? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Consider this simple grammar:
S -> a | b
The set of strings that may be generated by the grammar is:
{a, b}
Thus, a grammar generates a set of strings.
A parser for a grammar takes an input string and determines if the string could be generated by the grammar.
Thus, a parser is a recognizer for a grammar.
At least, that is one use of a parser.
But oftentimes a parser is used for other things. For example, a parser for a grammar may take an input string and create a tree structure which contains the input data and conforms to the grammar.
In that case the parser is not a recognizer, it is a data structure builder.
I conclude that there are different types of parsers.
Am I thinking logically? Are there indeed different types of parsers?
Has someone created a list of the different types of things that parsers have been created for?
Please let me know of any looseness or ambiguity in the above statements. I am trying to learn to be precise in statements about these concepts. For example, do you agree that "a grammar generates a set of strings"? Is that precise? correct?
No, I do not agree. A grammar does not generate anything. It is a set of rules which define the structure of something. A parser takes a grammar and some form of input and produces some form of output, whether that be an abstract syntax tree, an indication of whether the input is well-formed according to the grammar, or whatever else it might be. There are different types of parsers, but not because of what they produce. Rather, parsers are classified based on what kind of grammars they can accept and how that grammar is interpreted. For example, there are LL parsers and LR parsers, with various subtypes having additional restrictions on, for example, how many tokens of lookahead are needed.
Regarding a grammar "generating" something, what would this generate?
S -> ("a" | "b") S?
As soon as the grammar becomes non-trivial, finding all valid input starts to no longer make sense.

Can an LL(1) parse table be valid if there is a column with no entries in its cells?

I'm doing exam questions for revision for an exam. One of the questions is to construct an LL(1) parse table from the first and follow sets calculated in the previous question.
Now I am nearly positive that I have constructed the first and follow sets correctly and the table does not have any duplicate entries in any of it's cells, so I assumed that the grammar was a valid LL(1) grammar (we are asked to determine if it is valid hence why I needed to construct the table).
However the next question is to convert the grammar into a valid LL(1) grammar, obviously implying that it is not LL(1)
So my question is actually 2 questions.
Is the grammar not an LL(1) grammar due to the fact that there is a column without any entries?
OR
If this is allowable in an LL(1) parse table, is it most likely that I went wrong creating the first and follow sets?
Here is my working out of the question and the grammar which is in the box
http://imgur.com/UwmOAvX
It is perfectly ok for a column to have no symbols -- that just means that the terminal in question not in the FIRST set of any non-terminal, which can easily happen for symbols that don't appear in lead context anywhere (for example, a ) will often be such a symbol.)
In your case, the problem appears to be that you forgot to put the rule B -> B v into the table. You also have an error in FIRST(D) and FOLLOW(B) -- the latter comes from the former.

How to turn a token stream into a parse tree [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have a lexer built that streams out tokens from in input but I'm not sure how to build the next step in the process - the parse tree. Does anybody have any good resources or examples on how to accomplish this?
I would really recommend http://www.antlr.org/ and of course the classic Dragon Compilers book.
For an easy language like JavaScript it's not hard to hand roll a recursive descent parser, but it's almost always easier to use a tool like yacc or antlr.
I think to step back to the basics of your question, you really want to study up on BNF-esque grammar syntax and pick a syntax for your target. If you have that, the parse tree should sort of fall out, being the 'instance' manifestation of that grammar.
Also, don't try to turn the creation of your parse tree into your final solution (like generating code, or what-not). It might seem do-able and more effecient; but invariably there will come a time when you'll really wish you had that parse tree 'as is' laying around.
You should investigate parser generator tools for your platform. A parser generator allows you to specify a context-free grammar for your language. The language consists of a number of rules which "reduce" a series of symbols into a new symbol. You can usually also specify precedence and associativity for different rules to eliminate ambiguity in the language. For instance, a very simple calculator language might look something like this:
%left PLUS, MINUS # low precedence, evaluated left-to-right
%left TIMES, DIV # high precedence, left-to-right
expr ::= INT
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIV expr
| LEFT_PAREN expr RIGHT_PAREN
Usually, you can associate a bit of code with each rule to construct a new value (in this case an expression) from the other symbols in that rule. The parser generator will take in the grammar and produce code in your language that translates a token stream to a parse tree.
Most parser generators are language specific. ANTLR is well-known and supports C, C++, Objective C, Java, and Python. I've heard it's hard to use though. I've used bison for C/C++, CUP for Java, and ocamlyacc for OCaml, and they're all pretty good. If you are already using a lexer generator, you should look for a parser generator that is specifically compatible with it.
I believe a common a approach is to use a Finite State Machine. For example if you read an operand you go into a state where you next expect an operator, and you usually use the operator as the root node for the operands and so on.
As described above by Marcos Marin, a state machine that uses your language rules in BNF to parse your token list will do the trick if you want to do it yourself. Only, as said in above comment by Paul Hollingsworth, the easier way is to use a Pushdown-Automaton that has a simple FiFo memory stack.
Every class of token has a next expected token in your grammar, which also is represented in your state-machine. The stack is used to "remember" what the previous token class was, to reduce the required states (could be done without stack, but you would need a new state for every class and subclass split in the grammar tree).
The accepting state(s) would be (in natural languages and most programming languages too) the starting state, and maybe some other state in particular cases.
Antlr would be my suggestion if you want to use a tool (waaay faster and less extensive). Good luck!

Resources