Why look ahead at most 1 input token? - parsing

I'm currently trying to implement LL parser but I have a question.
Need I to look ahead at most 1 input token for verify if the user's input
is syntactically correct or it's for another reason?

There are many different kinds of parsing algorithms. The one you're describing is called LL(1) and by definition it just uses one token of lookahead. However, there are other parsing algorithms that use more lookahead than this. For example, an LL(2) parser uses two tokens of lookahead, and an LL(*) parser has unbounded lookahead. There are grammars for which one token of lookahead isn't enough (that is, grammars that are LL(2) but not LL(1)). Here's an example:
S → n | n + S
Try working out why one token of lookahead isn't enough, but two tokens suffices.
The reason parsing algorithms try to keep the number of lookahead tokens low is for simplicity and efficiency. As the number of lookahead tokens increases, the size of the parsing table needed to drive the parser increases, as does the complexity of building those tables.

Related

Theory: LL(k) parser vs parser for LL(k) grammars

I'm concerned about the very important difference between the therms: "LL(k) parser" and "parser for LL(k) grammars". When a LL(1) backtracking parser is in question, it IS a parser for LL(k) grammars, because it can parse them, but its NOT LL(k) parser, because it does not use k tokens to look ahead from a single position in the grammar, but its exploring with backtracking the possible cases, regardless that it still uses k tokens to explore.
Am I right?
The question may break down to the way the look-ahead is performed. If the look-ahead is actually still processing the grammar with backtracking, that does not make it LL(k) parser. To be LL(k) parser the parser must not use the grammar with backtracking mechanism, because then it would be "LL(1) parser with backtracking that can parser LL(k) grammars".
Am I right again?
I think the difference is related to the expectation that LL(1) parser is using a constant time per token, and LL(k) parser is using at most k * constant (linear to the look-ahead) time per token, not an exponential time as it would be in the case of a backtracking parser.
Update 1: to simplify - per token, is the parsing LL(k) expected to run exponentially in respect to k or in a linear time in respect to k?
Update 2: I have changed it to LL(k) because the question is irrelevant to the range of which k is (integer or infinity).
An LL(k) parser needs to do the following at each point in the inner loop:
Collect the next k input symbols. Since this is done at each point in the input, this can be done in constant time by keeping the lookahead vector in a circular buffer.
If the top of the prediction stack is a terminal, then it is compared with the next input symbol; either both are discarded or an error is signalled. This is clearly constant time.
If the top of the prediction stack is a non-terminal, the action table is consulted, using the non-terminal, the current state and the current lookahead vector as keys. (Not all LL(k) parsers need to maintain a state; this is the most general formulation. But it doesn't make a difference to complexity.) This lookup can also be done in constant time, again by taking advantage of the incremental nature of the lookahead vector.
The prediction action is normally done by pushing the right-hand side of the selected production onto the stack. A naive implementation would take time proportional to the length of the right-hand side, which is not correlated with either the lookahead k nor the length of the input N, but rather is related to the size of the grammar itself. It's possible to avoid the variability of this work by simply pushing a reference to the right-hand side, which can be used as though it were the list of symbols (since the list can't change during the parse).
However, that's not the full story. Executing a prediction action does not consume an input, and it's possible -- even likely -- that multiple predictions will be made for a single input symbol. Again, the maximum number of predictions is only related to the grammar itself, not to k nor to N.
More specifically, since the same non-terminal cannot be predicted twice in the same place without violating the LL property, the total number of predictions cannot exceed the number of non-terminals in the grammar. Therefore, even if you do push the entire right-hand side onto the stack, the total number of symbols pushed between consecutive shift actions cannot exceed the size of the grammar. (Each right-hand side can be pushed at most once. In fact, only one right-hand side for a given non-terminal can be pushed, but it's possible that almost every non-terminal has only one right-hand side, so that doesn't reduce the asymptote.) If instead only a reference is pushed onto the stack, the number of objects pushed between consecutive shift actions -- that is, the number of predict actions between two consecutive shift actions -- cannot exceed the size of the non-terminal alphabet. (But, again, it's possible that |V| is O(|G|).
The linearity of LL(k) parsing was established, I believe, in Lewis and Stearns (1968), but I don't have that paper at hand right now so I'll refer you to the proof in Sippu & Soisalon-Soininen's Parsing Theory (1988), where it is proved in Chapter 5 for Strong LL(K) (as defined by Rosenkrantz & Stearns 1970), and in Chapter 8 for Canonical LL(K).
In short, the time the LL(k) algorithm spends between shifting two successive input symbols is expected to be O(|G|), which is independent of both k and N (and, of course, constant for a given grammar).
This does not really have any relation to LL(*) parsers, since an LL(*) parser does not just try successive LL(k) parses (which would not be possible, anyway). For the LL(*) algorithm presented by Terence Parr (which is the only reference I know of which defines what LL(*) means), there is no bound to the amount of time which could be taken between successive shift actions. The parser might expand the lookahead to the entire remaining input (which would, therefore, make the time complexity dependent on the total size of the input), or it might fail over to a backtracking algorithm, in which case it is more complicated to define what is meant by "processing an input symbol".
I suggest you to read the chapter 5.1 of Aho Ullman Volume 1.
https://dl.acm.org/doi/book/10.5555/578789
A LL(k) parser is a k-predictive algorithm (k is the lookahead integer >= 1).
A LL(k) parser can parse any LL(k) grammar. (chapter 5.1.2)
for all a, b you have a < b => LL(b) grammar is also a LL(a) grammar. But the reverse is not true.
A LL(k) parser is PREDICTIVE. So there is NO backtracking.
All LL(k) parsers are O(n) n is the length of the parsed sentence.
It is important to understand that a LL(3) parser do not parse faster than a LL(1). But the LL(3) parser can parse MORE grammars than the LL(1). (see the point #2 and #3)

How does LR parsing select a qualifying grammar production (to construct the parse tree from the leaves)?

I am reading a tutorial of the LR parsing. The tutorial uses an example grammar here:
S -> aABe
A -> Abc | b
B -> d
Then, to illustrate how the parsing algorithm works, the tutorial shows the process of parsing the word "abbcde" below.
I understand at each step of the algorithm, a qualifying production (namely a gramma rule, illustrated in column 2 in the table) is searched to match a segment of the string. But how does the LR parsing chooses among a set of qualifying productions (illustrate in column 3 in the table)?
An LR parse of a string traces out a rightmost derivation in reverse. In that sense, the ordering of the reductions applied is what you would get if you derived the string by always expanding out the rightmost nonterminal, then running that process backwards. (Try this out on your example - isn’t that neat?)
The specific mechanism by which LR parsers actually do this involves the use of a parsing automaton that tracks where within the grammar productions the parse happens to be, along with some lookahead information. There are several different flavors of LR parser (LR(0), SLR(1), LALR(1), LR(1), etc.), which differ on how the automaton is structured and how they use lookahead information. You may find it helpful to search for a tutorial on how these automata work, as that’s the heart of how LR parsers actually work.

Converting grammars to the most efficient parser

Is there an online algorithm which converts certain grammar to the most efficient parser possible?
For example: SLR/LR(k) such as k>=0
For the class of grammars you are discussing (xLR(k)), they are all linear time anyway, and it is impossible to do sublinear time if you have to examine every character.
If you insist on optimizing parse time, you should get a very fast LR parsing engine. LRStar used to be the cat's meow on this topic, but the guy behind it got zero reward from the world of "I want it for free" and pulled all instances of it off the net. You can settle for Bison.
Frankly most of your parsing time will be determined by how fast your parser can process individual characters, e.g., the lexer. Tune that first and you may discover there's no need to tune the parser.
First let's distinguish LR(k) grammars and LR(k) languages. A grammar may not be LR(1), but let's say, for example, LR(2). But the language it generates must have an LR(1) grammar -- and for that matter, it must have an LALR(1) grammar. The table size for such a grammar is essentially the same as for SLR(1) and is more powerful (all SLR(1) grammars are LALR(1) but not vice versa). So, there is really no reason not to use an LALR(1) parser generator if you are looking to do LR parsing.
Since parsing represents only a fraction of the compilation time in modern compilers when lexical analysis and code generation that contains peephole and global optimizations are taken into consideration, I would just pick a tool considering its entire set of features. You should also remember that one parser generator may take a little longer than another to analyze a grammar and to generate the parsing tables. But once that job is done, the table-driven parsing algorithm that will run in the actual compiler thousands of times should not vary significantly from one parser generator to another.
As far as tools for converting arbitrary grammars to LALR(1), for example (in theory this can be done), you could do a Google search (I have not done this). But since semantics are tied to the productions, I would want to have complete control over the grammar being used for parsing and would probably eschew such conversion tools.

Parsing cases that need much lookahead

Most parsing can be done by looking only at the next symbol (character for lexical analysis, token for parsing proper), and most of the remaining cases can be handled by looking at only one symbol after that.
Are there any practical cases - for programming languages or data formats in actual use - that need several or indefinitely many symbols of lookahead (or equivalently backtracking)?
As I recall, Fortran is one language in which you need a big lookahead buffer. Parsing Fortran requires (in theory) unbounded lookahead, although most implementations limit the length of a statement line, which puts a limit on the size of the lookahead buffer.
Also, see the selected answer for Why can't C++ be parsed with a LR(1) parser?. In particular, the quote:
"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".
Knuth proved that any LR(k) grammar can be mechanically transformed to a LR(1) one. Also, AFAIK, any LR(k) grammar can be parsed in time proportional to the length of the parsed string. As it includes LR(1), I don't see what use there would be for implementing LR(k) parsing with k > 1.
I never studied the LR(k)->LR(1) transformation, so it may be possible that it is not that practical for some cases, though.

Concurrent parsing algorithms

Are there existing parser algorithms (similar to LALR, SLR and LL) that can parse a single input, not just multiple inputs, in parallel?
Edit: Sorry, I wasn't really looking for research papers, more like, "There are compiler-compilers that generate concurrent parsers" or "This compiler for this language parses it in parallel"- real world examples.
There aren't any well known ones :-}
Much of the reason is the problem is described as parsing a string, presenting to the parser one token a time. That makes the problem sequential by definition, ugh.
One could imagine presenting the array of tokens to some parser all at once, and then have the parser parse substrings at various points across the array, stitching compatible trees for substrings together. The stitching process is likely to be complicated, but might be manageable if driven by an L(AL)R [better, a GLR] parser that swallowed nonterminals left-to-right after most of the parse trees for substrings were produced; think of this an an "accumulator".
[Shades, a quick Google search produces a 1990 Japanese paper on doing parallel GLR with what amounts to parallel Prolog]
You now have the problem of producing the array of tokens magically in parallel. Now you need a parallel lexer :-}
EDIT June 2013: I finally remembered McKeeman's 1982 paper on parallel LR parsing.
I'm working on a deterministic context free parsing algorithm with O(N) work complexity, O(log N) time complexity, fine-grained parallelism that is proportional to the length of the input string, and equivalent behavior to a LR parser. I'll be submitting it for peer review shortly.
The main idea is to process each character in the input stream independently, assume that it could match any rule, then piece together neighboring groups of characters until they unambiguously match a single rule. Once a rule is matched, it is filtered out by the algorithm. After all rules are matched, tokens are gathered together into a sequence.
There is some complexity involved in handling tokens with wildcards that could partially nest, and a post-processing step is needed to handle these and maintain the worst-case O(log N) complexity. This step probably is not needed in practice.

Resources