How to identify whether a grammar is LR(n), LL(n) - parsing

For a language that is not LL(1) or LR(1) how can one try to find out if some number n exists such that the grammar can be LL(n) or LR(n)?
You check if a grammar is LR(0) by looking at the canonical collection of LR(0) items. Then, assuming it wasn't LR(0), you can check if it is LR(1) by introducing the lookahead symbol. My simple reasoning tells me that, to check whether it is LR(2) or not, you probably have to make the lookahead contain the next two symbols instead of just one. For LR(3) you have to take three symbols into consideration etc.
Even if this is the case, even though I doubt it, I am struggling to think of how can one try to identify (or even get a hint at) an n, or the non-existence thereof, for which a specific grammar can be LR(n) and/or LL(n) without checking incrementally from an arbitary LR(m) upwards.

If a language is LR(k) for some k>1, then it is LR(1). (That's not true for a grammar, of course.) That is, if you have an LR(k) grammar for a language, then you can mechanically construct an LR(1) grammar which allows you to recover the original parse tree. This is not true of LL(k); LL(k) languages are a strict subset of LL(k+1) languages.
The test you propose will indeed let you decide whether a grammar is LR(k) for some given k (or LL(k)). Unfortunately, there's no way of figuring out the smallest possible value of k other than the successive search you propose, and there is no guarantee that the search will ever terminate.
Although the problem is hard (or impossible) in the general case, it can often be answered for specific grammars, by considering the possible valid successors of a grammar state which exhibits conflicts.
In most real-world grammars, there will only be a few conflicts, so manual examination of conflicting states is possible. In general terms, one needs to figure out the path which led to the conflicting state, and the possible continuations. In many cases it will be clear that the parsing conflict could be resolved with slightly more lookahead.
A large class of grammars where this will fail is the set of ambiguous grammars. An ambiguous grammar cannot be LR(k) (or LL(k)) for any k. Again, the question of whether a grammar is ambiguous is not decidable but effective heuristics exist, some of which are included in commercial products.
Again, it is often easy to find ambiguities in real-world grammars, either by visual inspection (as above), or by feeding a lot of valid texts into a GLR parser (such as the one produced by bison) until an ambiguity is reported. (Alternatively, you can enumerate valid texts from the grammar with a straight-forward algorithm, and see if a text appears twice in the enumeration.)
Here are a couple of possibly relevant SO questions illustrating analysis techniques. I'm sure there are more.
A yacc shift/reduce conflict on an unambiguous grammar
Bison reduce/reduce situation
yacc shift-reduce for ambiguous lambda syntax
How to understand and fix conflicts in PLY

Related

Converting grammars to the most efficient parser

Is there an online algorithm which converts certain grammar to the most efficient parser possible?
For example: SLR/LR(k) such as k>=0
For the class of grammars you are discussing (xLR(k)), they are all linear time anyway, and it is impossible to do sublinear time if you have to examine every character.
If you insist on optimizing parse time, you should get a very fast LR parsing engine. LRStar used to be the cat's meow on this topic, but the guy behind it got zero reward from the world of "I want it for free" and pulled all instances of it off the net. You can settle for Bison.
Frankly most of your parsing time will be determined by how fast your parser can process individual characters, e.g., the lexer. Tune that first and you may discover there's no need to tune the parser.
First let's distinguish LR(k) grammars and LR(k) languages. A grammar may not be LR(1), but let's say, for example, LR(2). But the language it generates must have an LR(1) grammar -- and for that matter, it must have an LALR(1) grammar. The table size for such a grammar is essentially the same as for SLR(1) and is more powerful (all SLR(1) grammars are LALR(1) but not vice versa). So, there is really no reason not to use an LALR(1) parser generator if you are looking to do LR parsing.
Since parsing represents only a fraction of the compilation time in modern compilers when lexical analysis and code generation that contains peephole and global optimizations are taken into consideration, I would just pick a tool considering its entire set of features. You should also remember that one parser generator may take a little longer than another to analyze a grammar and to generate the parsing tables. But once that job is done, the table-driven parsing algorithm that will run in the actual compiler thousands of times should not vary significantly from one parser generator to another.
As far as tools for converting arbitrary grammars to LALR(1), for example (in theory this can be done), you could do a Google search (I have not done this). But since semantics are tied to the productions, I would want to have complete control over the grammar being used for parsing and would probably eschew such conversion tools.

How easy is to find a string that leads to conflict in a SLR(1) parser compared to a LR(1)

It is known that SLR(1) parsers usually have less states than LR(1). But is it easier or harder because of this to find a string that leads to conflict in a SLR(1) parser compared to a LR(1) and why?
Thank you in advance.
Let’s say you have some CFG and you build both an SLR(1) parser and an LR(1) parser for that grammar. If you don’t have any shift/reduce or reduce/reduce conflicts, then you’re done - there aren’t any strings that lead to conflicts. On the other hand, if such a conflict exists, then yes, there is a string that leads to a conflict. You can find such a string by working backwards through the automaton: find a path from the start state to the state with the shift/reduce or reduce/reduce conflict and write out the series of terminals and nonterminals that take you there. If it’s all terminals, great! You’ve got your string. If there are any nonterminals there, since an LR parser traces a rightmost derivation in reverse, pick the rightmost nonterminal in the sequence you found and repeat this process to expand it out further. Eventually you’ll get your string.
In that sense, once you have the automata constructed, the exact same procedure will find a string that can’t be parsed. So the difficulty of finding a bad string basically boils down to building out the SLR(1) versus LR(1) automata, and that’s where the fact that LR(1) automata are a bit bigger than SLR(1) automata come in. It’ll probably take a bit longer to find out that a grammar isn’t LR(1) than to find out that it isn’t SLR(1) simply because it takes more time to build LR(1) parsers.

Deterministic Context-Free Grammar versus Context-Free Grammar?

I'm reading my notes for my comparative languages class and I'm a bit confused...
What is the difference between a context-free grammar and a deterministic context-free grammar? I'm specifically reading about how parsers are O(n^3) for CFGs and compilers are O(n) for DCFGs, and don't really understand how the difference in time complexities could be that great (not to mention I'm still confused about what the characteristics that make a CFG a DCFG).
Thank you so much in advance!
Conceptually they are quite simple to understand. The context free grammars are those which can be expressed in BNF. The DCFGs are the subset for which a workable parser can be written.
In writing compilers we are only interested in DCFGs. The reason is that 'deterministic' means roughly that the next rule to be applied at any point in the parse is determined by the input so far and a finite amount of lookahead. Knuth invented the LR() compiler back in the 1960s and proved it could handle any DCFG. Since then some refinements, especially LALR(1) and LL(1), have defined grammars that can be parsed in limited memory, and techniques by which we can write them.
We also have techniques to derive parsers automatically from the BNF, if we know it's one of these grammars. Yacc, Bison and ANTLR are familiar examples.
I've never seen a parser for a NDCFG, but at any point in the parse it would potentially need to consider the whole of the input string and every possible parse that could be applied. It's not hard to see why that would get rather large and slow.
I should point out that many real languages are imperfect, in that they are not entirely context free, not unambiguous or otherwise depart from the ideal DCFG. C/C++ is a good example, but there are many others. These languages are usually handled by special purpose rules such as semantic or syntactic predicates, special case backtracking or other 'tricks' with no effect on performance.
The comments point out that certain kinds of NDCFG are common and many tools provide a way to handle them. One common problem is ambiguity. It is relatively easy to parse an ambiguous grammar by introducing a simple local semantic rule, but of course this can only ever generate one of the possible parse trees. A generalised parser for NDCFG would potentially produce all parse trees, and could perhaps allow those trees to be filtered on some arbitrary condition. I don't know any of those.
Left recursion is not a feature of NDCFG. It presents a particular challenge to the design of LL() parsers but no problems for LR() parsers.

LR(k) to LR(1) grammar conversion

I am confused by the following quote from Wikipedia:
In other words, if a language was reasonable enough to allow an
efficient one-pass parser, it could be described by an LR(k) grammar.
And that grammar could always be mechanically transformed into an
equivalent (but larger) LR(1) grammar. So an LR(1) parsing method was,
in theory, powerful enough to handle any reasonable language. In
practice, the natural grammars for many programming languages are
close to being LR(1).[citation needed]
This means that a parser generator, like bison, is very powerful (since it can handle LR(k) grammars), if one is able to convert a LR(k) grammar to a LR(1) grammar. Do some examples of this exist, or a recipe on how to do this? I'd like to know this since I have a shift/reduce conflict in my grammar, but I think this is because it is a LR(2) grammar and would like to convert it to a LR(1) grammar. Side question: is C++ an unreasonable language, since I've read, that bison-generated parsers cannot parse it.
For references on the general purpose algorithm to find a covering LR(1) grammar for an LR(k) grammar, see Real-world LR(k > 1) grammars?
The general purpose algorithm produces quite large grammars; in fact, I'm pretty sure that the resulting PDA is the same size as the LR(k) PDA would be. However, in particular cases it's possible to come up with simpler solutions. The general principle applies, though: you need to defer the shift/reduce decision by unconditionally shifting until the decision can be made with a single lookahead token.
One example: Is C#'s lambda expression grammar LALR(1)?
Without knowing more details about your grammar, I can't really help more than that.
With regard to C++, the things that make it tricky to parse are the preprocessor and some corner cases in parsing (and lexing) template instantiations. The fact that the parse of an expression depends on the "kind" (not type) of a symbol (in the context in which the symbol occurs) makes precise parsing with bison complicated. [1] "Unreasonable" is a value judgement which I'm not comfortable making; certainly, tool support (like accurate syntax colourizers and tab-completers) would have been simple with a different grammar, but the evidence is that it is not that hard to write (or even read) good C++ code.
Notes:
[1] The classic tricky parse, which also applies to C, is (a)*b, which is a cast of a dereference if a represents a type, and otherwise a multiplication. If you were to write it in the context: c/(a)*b, it would be clear that an AST cannot be constructed without knowing whether it's a cast or a product, since that affects the shape of the AST,
A more C++-specific issue is: x<y>(z) (or x<y<z>>(3)) which parse (and arguably tokenise) differently depending on whether x names a template or not.

Converting a context-free grammar into a LL(1) grammar

First off, I have checked similar questions and none has quite the information I seek.
I want to know if, given a context-free grammar, it is possible to:
Know if there exists or not an equivalent LL(1) grammar. Equivalent in the sense that it should generate the same language.
Convert the context-free grammar to the equivalent LL(1) grammar, given it exists. The conversion should succeed if an equivalent LL(1) grammar exists. It is OK if it does not terminate if an equivalent does not exists.
If the answer to those questions are yes, are such algorithms or tools available somewhere ? My own researches have been fruitless.
Another answer mentions that the Dragon Book has an algorithms to eliminate left recursion and to left factor a context-free grammar. I have access to the book and checked it but it is unclear to me if the grammar is guaranteed to be LL(1). The restriction imposed on the context-free grammar (no null production and no cycle) are agreeable to me.
From university compiler courses I took I remember that LL(1) grammar is context free grammar, but context free grammar is much bigger than LL(1). There is no general algorithm (meaning not NP hard) to check and convert from context-free (that can be transformed to LL(1)) to LL(1).
Applying the bag of tricks like removing left recursion, removing first-follow conflict, left-factoring, etc. are similar to mathematical transformation when you want to integrate a function... You need experience that is sometimes very close to an art. The transformations are often inverse of each other.
One reason why LR type grammars are being used now a lot for generated parsers is that they cover much wider spectrum of context free grammars than LL(1).
Btw. e.g. C grammar can be expressed as LL(1), but C# cannot (e.g. lambda function x -> x + 1 comes to mind, where you cannot decide if you are seeing a parameter of lambda or a known variable).

Resources