Bottom-Up-Parser: When to apply which reduction rule? - parsing

Let's take the following context-free grammar:
G = ( {Sum, Product, Number}, {decimal representations of numbers, +, *}, P, Sum)
Being P:
Sum → Sum + Product
Sum → Product
Product → Product * Number
Product → Number
Number → decimal representation of a number
I am trying to parse expressions produced by this grammar with a bottom-up-parser and a look-ahead-buffer (LAB) of length 1 (which supposingly should do without guessing and back-tracking).
Now, given a stack and a LAB, there are often several possibilities of how to reduce the stack or whether to reduce it at all or push another token.
Currently I use this decision tree:
If any top n tokens of the stack plus the LAB are the begining of the
right side of a rule, I push the next token onto the stack.
Otherwise, I reduce the maximum number of tokens on top of the stack.
I.e. if it is possible to reduce the topmost item and at the same time
it is possible to reduce the three top most items, I do the latter.
If no such reduction is possible, I push another token onto the stack.
Rinse and repeat.
This, seems (!) to work, but it requires an awful amount of rule searching, finding matching prefixes, etc. No way this can run in O(NM).
What is the standard (and possibly only sensible) approach to decide whether to reduce or push (shift), and in case of reduction, which reduction to apply?
Thank you in advance for your comments and answers.

The easiest bottom-up parsing approach for grammars like yours (basically, expression grammars) is operator-precedence parsing.
Recall that bottom-up parsing involves building the parse tree left-to-right from the bottom. In other words, at any given time during the parse, we have a partially assembled tree with only terminal symbols to the right of where we're reading, and a combination of terminals and non-terminals to the left (the "prefix"). The only possible reduction is one which applies to a suffix of the prefix; if no reduction applies, we must be able to shift a terminal from the input to the prefix.
An operator grammar has the feature that there are never two consecutive non-terminals in any production. Consequently, in a bottom-up parse of an operator grammar, either the last symbol in the prefix is a terminal or the second-last symbol is one. (Both of them could be.)
An operator precedence parser is essentially blind to non-terminals; it simply doesn't distinguish between them. So you cannot have two productions whose right-hand sides contain exactly the same sequence of terminals, because the op-prec parser wouldn't know which of these two productions to apply. (That's the traditional view. It's actually possible to extend that a bit so that you can have two productions with the same terminals, provided that the non-terminals are in different places. That allows grammars which have unary - operators, for example, since the right hand sides <non-terminal> - <non-terminal> and - <non-terminal> can be distinguished without knowing the names of the non-terminals; only their presence.
The other requirement is that you have to be able to build a precedence relationship between the terminals. More precisely, we define three precedence relations, usually written <·, ·> and ·=· (or some typographic variation on the theme), and insist that for any two terminals x and y, at most one of the relations x ·> y, x ·=· y and x <· y are true.
Roughly speaking, the < and > in the relations correspond to the edges of a production. In other words, if x <· y, that means that x can be followed by a non-terminal with a production whose first terminal is y. Similarly, x ·> y means that y can follow a non-terminal with a production whose last terminal is x. And x ·=· y means that there is some right-hand side where x and y are consecutive terminals, in that order (possibly with an intervening non-terminal).
If the single-relation restriction is true, then we can parse as follows:
Let x be the last terminal in the prefix (that is, either the last or second-last symbol), and let y be the lookahead symbol, which must be a terminal. If x ·> y then we reduce, and repeat the rule. Otherwise, we shift y onto the prefix.
In order to reduce, we need to find the start of the production. We move backwards over the prefix, comparing consecutive terminals (all of which must have <· or ·=· relations) until we find one with a <· relation. Then the terminals between the <· and the ·> are the right-hand side of the production we're looking for, and we can slot the non-terminals into the right-hand side as indicated.
There is no guarantee that there will be an appropriate production; if there isn't, the parse fails. But if the input is a valid sentence, and if the grammar is an operator-precedence grammar, then we will be able to find the right production to reduce.
Note that it is usually really easy to find the production, because most productions have only one (<non-terminal> * <non-terminal>) or two (( <non-terminal> )) terminals. A naive implementation might just run the terminals together into a string and use that as the key in a hash-table.
The classic implementation of operator-precedence parsing is the so-called "Shunting Yard Algorithm", devised by Edsger Dijkstra. In that algorithm, the precedence relations are modelled by providing two functions, left-precedence and right-precedence, which map terminals to integers such that x <· y is true only if right-precedence(x) < left-precedence(y) (and similarly for the other operators). It is not always possible to find such mappings, and the mappings are a cover of the actual precedence relations because it is often the case that there are pairs of terminals for which no precedence relationship applies. Nonetheless, it is often the case that these mappings can be found, and almost always the case for simple expression grammars.
I hope that's enough to get you started. I encourage you to actually read some texts about bottom-up parsing, because I think I've already written far too much for an SO answer, and I haven't yet included a single line of code. :)

Related

All viable prefixes of a Context Free Grammer

I am stuck to a problem from the famous dragon Book of Compiler Design.How to find all the viable prefixes of the following grammar:
S -> 0S1 | 01
The grammar is actually the language of the regex 0n1n.
I presume the set of all viable prefixes might come as a regex too.I came up with the following solution
0+
0+S
0+S1
0+1
S
(By plus , I meant no of zeroes is 1..inf)
after reducing string 000111 with the following steps:
stack input
000111
0 00111
00 0111
000 111
0001 11
00S 11
00S1 1
0S 1
0S1 $
S $
Is my solution correct or I am missing something?
0n1n is not a regular language; regexen don't have variables like n and they cannot enforce an equal number of repetitions of two distinct subsequences. Nonetheless, for any context-free grammar, the set of viable prefixes is a regular language. (A proof of this fact, in some form, appears at the beginning of Part II of Donald Knuth's seminal 1965 paper, On the Translation of Languages from Left to Right, which demonstrated both a test for the LR(k) property and an algorithm for parsing LR(k) grammars in linear time.)
OK, to the actual question. A viable prefix for a grammar is (by definition) the prefix of a sentential form which can appear on the stack during a parse using that grammar. It's called "viable" (which means "still alive" or "could continue growing") precisely because it must be the prefix of some right sentential form whose suffix contains no non-terminal symbol. In other words, there exists a sequence of terminals which can be appended to the viable prefix in order to produce a right-sentential form; the viable prefix can grow.
Knuth shows how to create a DFA which produces all viable prefixes, but it's easier to see this DFA if we already have the LR(k) parser produced by an LR(k) algorithm. That parser is a finite-state machine whose alphabet is the set of terminal and non-terminal symbols of a grammar. To get the viable-prefix grammar, we use exactly the same state machine, but we remove the stack (so that it becomes just a state machine) and the reduce actions, leaving only the shift and goto actions as transitions. All states in the viable-prefix machine are accepting states, since any prefix of a viable prefix is itself a viable prefix.
A key feature of this new automaton is that it cannot extend a prefix with a reduce action (since we removed all the reduce actions). A prefix with a reduce action is a prefix which ends in a handle -- recall that a handle is the right-hand side of some production -- so another definition of a viable prefix is that it is a right-sentential form (that is, a possible step in a derivation) which does not extend beyond the right-most handle.
The grammar you are working with has only two productions, so there are only two handles, 01 and 0S1. Note that 10 and 1S cannot be subsequences of any right-sentential form, nor can a right-sentential form contain more than one S. Any right-sentential form must either be a sentence 0n1n or a sentential form 0nS1n where n>0. But every handle ends at the first 1 of a sentential form, and so a viable prefix must end at or before the first 1. This produces precisely the four possibilities you list, which we can condense to the regular expression 0*0(S1?)?.
Chopping off the suffix removed the second n from the formula, so there is no longer a requirement of concordance and the language is regular.
Note:
Questions like this and their answers are begging to be rendered using MathJax. StackOverflow, unfortunately, does not provide this extension, which is apparently considered unnecessary for programming. However, there is a site in the StackExchange constellation dedicated to computing science questions, http://cs.stackexchange.com, and another one dedicated to mathematical questions, http://math.stackexchange.com. Formal language theory is part of both computing science and mathematics. Both of those sites permit MathJax, and questions on those sites will not be closed because they are not programming questions. I suggest you take this information into account for questions like this one.

LALR parsers and look-ahead

I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.
The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.

Issue with left recursion in top down parsing

I have read this to understand more the difference between top down and bottom up parsing, can anyone explain the problems associated with left recursion in a top down parser?
In a top-down parser, the parser begins with the start symbol and tries to guess which productions to apply to reach the input string. To do so, top-down parsers need to use contextual clues from the input string to guide its guesswork.
Most top-down parsers are directional parsers, which scan the input in some direction (typically, left to right) when trying to determine which productions to guess. The LL(k) family of parsers is one example of this - these parsers use information about the next k symbols of input to determine which productions to use.
Typically, the parser uses the next few tokens of input to guess productions by looking at which productions can ultimately lead to strings that start with the upcoming tokens. For example, if you had the production
A → bC
you wouldn't choose to use this production unless the next character to match was b. Otherwise, you'd be guaranteed there was a mismatch. Similarly, if the next input character was b, you might want to choose this production.
So where does left recursion come in? Well, suppose that you have these two productions:
A → Ab | b
This grammar generates all strings of one or more copies of the character b. If you see a b in the input as your next character, which production should you pick? If you choose Ab, then you're assuming there are multiple b's ahead of you even though you're not sure this is the case. If you choose b, you're assuming there's only one b ahead of you, which might be wrong. In other words, if you have to pick one of the two productions, you can't always choose correctly.
The issue with left recursion is that if you have a nonterminal that's left-recursive and find a string that might match it, you can't necessarily know whether to use the recursion to generate a longer string or avoid the recursion and generate a shorter string. Most top-down parsers will either fail to work for this reason (they'll report that there's some uncertainty about how to proceed and refuse to parse), or they'll potentially use extra memory to track each possible branch, running out of space.
In short, top-down parsers usually try to guess what to do from limited information about the string. Because of this, they get confused by left recursion because they can't always accurately predict which productions to use.
Hope this helps!
Reasons
1)The grammar which are left recursive(Direct/Indirect) can't be converted into {Greibach normal form (GNF)}* So the Left recursion can be eliminated to Right Recuraive Format.
2)Left Recursive Grammars are also nit LL(1),So again elimination of left Recursion may result into LL(1) grammer.
GNF
A Grammer of the form A->aV is Greibach Normal Form.

Negative lookahead in LR parsing algorithm

Consider such a rule in grammar for an LR-family parsing generator (e.g YACC, BISON, etc.):
Nonterminal : [ lookahead not in {Terminal1, ..., TerminalN} ] Rule ;
It's an ordinary rule, except that it has a restriction: a phrase produced with this rule cannot begin with Terminal1, ..., TerminalN. (Surely, this rule can be replaced with a set of usual rules, but it will result in a bigger grammar). This can be useful for resolving conflicts.
The question is, is there a modification of LR table construction algorithm that accepts such restrictions? It seems to me that such a modification is possible (like precedence relations).
Surely, it can be checked in runtime, but I mean compile-time check (a check which is performed while building parsing table, like %prec, %left, %right and %nonassoc directives in yacc-compartible generators.)
I don't see why this shouldn't be possible, but I also don't see any obvious reason why it would be useful. Do you have an example in mind?
The easiest way to do this would be to do the grammar transform you mention in parentheses. This would make a larger grammar, but it won't artificially increase the number of LR states.
The basic transformation, with only a bit of hand-waving:
For any production with terminal restrictions:
If the production starts with a non-nullable non-terminal, replace the non-terminal with a terminal-restricted version.
If the production starts with a terminal in the terminal restriction list, remove the production
If the production starts with a terminal not in the terminal restriction list, no change is necessary.
If a production starts with a nullable non-terminal, you have to create two versions of the nullable non-terminal, one of which is always null, and the other of which is non-nullable; and then create two versions of the production, one starting with each of the new non-terminals. Then apply the above transforms, but interpreting "starts with" to mean "starts with after any always-null non-terminals."
You don't actually need to modify the grammar, since the above transformations can be done on the fly during the construction of the underlying SLR machine, at least for LR(0) and LALR(1) constructions.

Is there such thing as a left-associative prefix operator or right-associative postfix operator?

This page says "Prefix operators are usually right-associative, and postfix operators left-associative" (emphasis mine).
Are there real examples of left-associative prefix operators, or right-associative postfix operators? If not, what would a hypothetical one look like, and how would it be parsed?
It's not particularly easy to make the concepts of "left-associative" and "right-associative" precise, since they don't directly correspond to any clear grammatical feature. Still, I'll try.
Despite the lack of math layout, I tried to insert an explanation of precedence relations here, and it's the best I can do, so I won't repeat it. The basic idea is that given an operator grammar (i.e., a grammar in which no production has two non-terminals without an intervening terminal), it is possible to define precedence relations ⋖, ≐, and ⋗ between grammar symbols, and then this relation can be extended to terminals.
Put simply, if a and b are two terminals, a ⋖ b holds if there is some production in which a is followed by a non-terminal which has a derivation (possibly not immediate) in which the first terminal is b. a ⋗ b holds if there is some production in which b follows a non-terminal which has a derivation in which the last terminal is a. And a ≐ b holds if there is some production in which a and b are either consecutive or are separated by a single non-terminal. The use of symbols which look like arithmetic comparisons is unfortunate, because none of the usual arithmetic laws apply. It is not necessary (in fact, it is rare) for a ≐ a to be true; a ≐ b does not imply b ≐ a and it may be the case that both (or neither) of a ⋖ b and a ⋗ b are true.
An operator grammar is an operator precedence grammar iff given any two terminals a and b, at most one of a ⋖ b, a ≐ b and a ⋗ b hold.
If a grammar is an operator-precedence grammar, it may be possible to find an assignment of integers to terminals which make the precedence relationships more or less correspond to integer comparisons. Precise correspondence is rarely possible, because of the rarity of a ≐ a. However, it is often possible to find two functions, f(t) and g(t) such that a ⋖ b is true if f(a) < g(b) and a ⋗ b is true if f(a) > g(b). (We don't worry about only if, because it may be the case that no relation holds between a and b, and often a ≐ b is handled with a different mechanism: indeed, it means something radically different.)
%left and %right (the yacc/bison/lemon/... declarations) construct functions f and g. They way they do it is pretty simple. If OP (an operator) is "left-associative", that means that expr1 OP expr2 OP expr3 must be parsed as <expr1 OP expr2> OP expr3, in which case OP ⋗ OP (which you can see from the derivation). Similarly, if ROP were "right-associative", then expr1 ROP expr2 ROP expr3 must be parsed as expr1 ROP <expr2 ROP expr3>, in which case ROP ⋖ ROP.
Since f and g are separate functions, this is fine: a left-associative operator will have f(OP) > g(OP) while a right-associative operator will have f(ROP) < g(ROP). This can easily be implemented by using two consecutive integers for each precedence level and assigning them to f and g in turn if the operator is right-associative, and to g and f in turn if it's left-associative. (This procedure will guarantee that f(T) is never equal to g(T). In the usual expression grammar, the only ≐ relationships are between open and close bracket-type-symbols, and these are not usually ambiguous, so in a yacc-derivative grammar it's not necessary to assign them precedence values at all. In a Floyd parser, they would be marked as ≐.)
Now, what about prefix and postfix operators? Prefix operators are always found in a production of the form [1]:
non-terminal-1: PREFIX non-terminal-2;
There is no non-terminal preceding PREFIX so it is not possible for anything to be ⋗ PREFIX (because the definition of a ⋗ b requires that there be a non-terminal preceding b). So if PREFIX is associative at all, it must be right-associative. Similarly, postfix operators correspond to:
non-terminal-3: non-terminal-4 POSTFIX;
and thus POSTFIX, if it is associative at all, must be left-associative.
Operators may be either semantically or syntactically non-associative (in the sense that applying the operator to the result of an application of the same operator is undefined or ill-formed). For example, in C++, ++ ++ a is semantically incorrect (unless operator++() has been redefined for a in some way), but it is accepted by the grammar (in case operator++() has been redefined). On the other hand, new new T is not syntactically correct. So new is syntactically non-associative.
[1] In Floyd grammars, all non-terminals are coalesced into a single non-terminal type, usually expression. However, the definition of precedence-relations doesn't require this, so I've used different place-holders for the different non-terminal types.
There could be in principle. Consider for example the prefix unary plus and minus operators: suppose + is the identity operation and - negates a numeric value.
They are "usually" right-associative, meaning that +-1 is equivalent to +(-1), the result is minus one.
Suppose they were left-associative, then the expression +-1 would be equivalent to (+-)1.
The language would therefore have to give a meaning to the sub-expression +-. Languages "usually" don't need this to have a meaning and don't give it one, but you can probably imagine a functional language in which the result of applying the identity operator to the negation operator is an operator/function that has exactly the same effect as the negation operator. Then the result of the full expression would again be -1 for this example.
Indeed, if the result of juxtaposing functions/operators is defined to be a function/operator with the same effect as applying both in right-to-left order, then it always makes no difference to the result of the expression which way you associate them. Those are just two different ways of defining that (f g)(x) == f(g(x)). If your language defines +- to mean something other than -, though, then the direction of associativity would matter (and I suspect the language would be very difficult to read for someone used to the "usual" languages...)
On the other hand, if the language doesn't allow juxtaposing operators/functions then prefix operators must be right-associative to allow the expression +-1. Disallowing juxtaposition is another way of saying that (+-) has no meaning.
I'm not aware of such a thing in a real language (e.g., one that's been used by at least a dozen people). I suspect the "usually" was merely because proving a negative is next to impossible, so it's easier to avoid arguments over trivia by not making an absolute statement.
As to how you'd theoretically do such a thing, there seem to be two possibilities. Given two prefix operators # and # that you were going to treat as left associative, you could parse ##a as equivalent to #(#(a)). At least to me, this seems like a truly dreadful idea--theoretically possible, but a language nobody should wish on even their worst enemy.
The other possibility is that ##a would be parsed as (##)a. In this case, we'd basically compose # and # into a single operator, which would then be applied to a.
In most typical languages, this probably wouldn't be terribly interesting (would have essentially the same meaning as if they were right associative). On the other hand, I can imagine a language oriented to multi-threaded programming that decreed that application of a single operator is always atomic--and when you compose two operators into a single one with the left-associative parse, the resulting fused operator is still a single, atomic operation, whereas just applying them successively wouldn't (necessarily) be.
Honestly, even that's kind of a stretch, but I can at least imagine it as a possibility.
I hate to shoot down a question that I myself asked, but having looked at the two other answers, would it be wrong to suggest that I've inadvertently asked a subjective question, and that in fact that the interpretation of left-associative prefixes and right-associative postfixes is simply undefined?
Remembering that even notation as pervasive as expressions is built upon a handful of conventions, if there's an edge case that the conventions never took into account, then maybe, until some standards committee decides on a definition, it's better to simply pretend it doesn't exist.
I do not remember any left-associated prefix operators or right-associated postfix ones. But I can imagine that both can easily exist. They are not common because the natural way of how people are looking to operators is: the one which is closer to the body - is applying first.
Easy example from C#/C++ languages:
~-3 is equal 2, but
-~3 is equal 4
This is because those prefix operators are right associative, for ~-3 it means that at first - operator applied and then ~ operator applied to the result of previous. It will lead to value of the whole expression will be equal to 2
Hypothetically you can imagine that if those operators are left-associative, than for ~-3 at first left-most operator ~ is applied, and after that - to the result of previous. It will lead to value of the whole expression will be equal to 4
[EDIT] Answering to Steve Jessop:
Steve said that: the meaning of "left-associativity" is that +-1 is equivalent to (+-)1
I do not agree with this, and think it is totally wrong. To better understand left-associativity consider the following example:
Suppose I have hypothetical programming language with left-associative prefix operators:
# - multiplies operand by 3
# - adds 7 to operand
Than following construction ##5 in my language will be equal to (5*3)+7 == 22
If my language was right-associative (as most usual languages) than I will have (5+7)*3 == 36
Please let me know if you have any questions.
Hypothetical example. A language has prefix operator # and postfix operator # with the same precedence. An expression #x# would be equal to (#x)# if both operators are left-associative and to #(x#) if both operators are right-associative.

Resources