Should a merge failure stop the LR(1) to LALR(1) conversion

Should a merge failure stop the LR(1) to LALR(1) conversion - parsing

Let's say I have got a set of LR(1) states and I want to try to convert it to LALR(1).
I did the first step of finding states that have got the same LR(0) core and now I'm doing the merge process. However one of such set of states can't be merged, because it would introduce RR conflict(s). Should I:
Stop the conversion right now and say the grammar, that constructed this state machine, is not a LALR(1) grammar,
or should I continue merge other possible states and stop only if none of such candidates can be merged?

The conflict is not going to go away with more merges, so you could stop immediately and report failure.
Most parser generators would continue to the end, though:
The user might appreciate knowing about all conflicts and not just the first one found, in order to debug their grammar;
Sometimes a simple heuristic to resolve conflicts succeeds in generating a usable parser. (Yacc, for example, first resolves using declared operator precedences, then by preferring shift to reduce, and finally by preferring the reduction which appears earlier in the grammar.)

Related

How to fix grammar with optional non-terminal?

I wrote grammar for LALR parser and I am stuck at optional non-terminal. Consider for example C++ dereference, when you can write:
******expression;
Of course you can write:
expression;
And here is my problem, dereference non terminal is optional really and this has such impact on grammar, that now parser sees it fits everywhere (almost), because, well, it might be empty.
Is there a common pattern how should I rewrite the grammar to fix it?
I would also be grateful for pointing out some book or other resources which deals with "common problems & patterns when writing grammars".

First of all, the problem you are having is not the one you are claiming to have. Having a nullable (possibly empty) nonterminal does not mean that the parser will try to stick it everywhere. (I use the term “nullable” here to avoid confusion, because “optional” might refer to an optional occurrence of a nonterminal, as in x? where x is the nonterminal name). It just means that whenever you use that nonterminal in your grammar, the parser might skip over it or match with an empty word (details are according to the rules of the particular parsing algorithm, in your case LALR).
Secondly, the problem most probably is that the resulting grammar is ambiguous. My guess is that you used some kind of combination of right recursion for defining the nonterminal with the stars, and having an asterisk as a binary multiplication operator. (Feel free to update the question with a grammar fragment, then I might be able to offer more detailed help).
Thirdly, and mainly concerning your quest for general problems and patterns in grammars: usually people would not put the stars in one nonterminal and the expression in another, because ultimately you would want to transform your parse tree into an abstract syntax tree on which you probably intend to perform some calculations, in that case you would prefer to have a construction that says “dereference of a dereference of a dereference of an expression” rather than “three stars followed by an expression”. Again, the answer would have been less vague if you provided more details.

Why can't a compiler have a "shift/shift" conflict?

I currently am studying about compilers and as I understand in LR(0) there are cases where we have "shift/reduce" or "reduce/reduce" conflicts, but it's impossible to have "shift/shift" conflicts! Why we can't have a "shift/shift" conflict?

Shift/reduce conflicts occur when the parser can't tell whether to shift (push the next input token atop the parsing stack) or reduce (pop a series of terminals and nonterminals from the parsing stack). A reduce/reduce conflict is when the parser knows to reduce, but can't tell which reduction to perform.
If you were to have a shift/shift conflict, the parser would know that it needed to push the next token onto its parsing stack, but wouldn't know how to do it. Since there is only one way to push the token onto the parsing stack, there generally cannot be any conflicts of this form.
That said, it's theoretically possible for a shift/shift conflict to exist if you had a strange setup in which there were two or more transitions leading out of a given parsing state that were labeled with the same terminal symbol. The conflict in that case would be whether to shift and go to one state or shift and go to another. This could happen if you tried to compress an automaton into fewer states and did so incorrectly, or if you were trying to build a nondeterministic parsing automaton. In practice, this would never happen.
Hope this helps!

LR(k) parsers, with k infinite, not restricted to deterministic context-free languages?

Is a theoretical LR parser with infinite lookahead capable of parsing (unambiguous) languages which can be desribed by a context-free grammar?
Normally LR(k) parsers are restricted to deterministic context-free languages. I think this means that there always has to be exactly one grammar rule that can be applied currently. Meaning within the current lookahead context not more than one possible way of parsing is allowed to occur. The book "Language Implementation Patterns" states that a "... parser is nondeterministic - it cannot determine which alternative to choose." if the lookahead sets overlap. In contrast a non-deterministic parser just chooses one way if there are multiple alternatives and then goes back to the decision point and chooses the next alternative if it is impossible at a certain point to continue with the decision previously made.
Wherever I read definitions of LR(k) parsers (like on Wikipedia or in the Dragon Book) I always read something like: "k is the number of lookahead tokens" or cases when "k > 1" but never if k can be infinite. Wouldn't an infinite lookahead be the same as trying all alternatives until one succeeds?
Could it be that k is assumed to be finite in order to (implicitly) distinguish LR(k) parsers from non-deterministic parsers?

You are raising several issues here that are difficult to answer in a short form. Nevertheless I will try.
First of all, what is "infinite lookahead"? There is no book that describes such parser. If you have clear idea of what is this, you need to describe it first. Only after that we can discuss this topic. For now parsing theory discusses only LR(k) grammars, where the k is finite.
Normally LR(k) parsers are restricted to deterministic context-free
languages. I think this means that there always has to be exactly one
grammar rule that can be applied currently.
This is wrong. LR(k) grammars may have "grammar conflicts". Dragon book briefly mentions them without going into any details. "Grammars may have conflicts" means that some grammars do not have conflicts, while all other grammars have them. When grammar does not have conflicts, then there is ALWAYS only one rule and the situation is relatively simple.
When grammar has conflicts, this means that in certain cases more than one rule is applicable. Classic parsers cannot help here. What makes matters worse is that some input statement may have a set of correct parsings, not just one. From the grammar theory stand point all these parsings have the same value and importance.
The book "Language Implementation Patterns" states that a "... parser
is nondeterministic - it cannot determine which alternative to
choose."
I have impression that there is no definitive agreement on what "nondeterministic parser" means. I would tend to say that nondeterministic parser just picks up one of the alternatives randomly and goes on.
Practically only 2 strategies of resolving conflicts are used. The first one is conflict resolution in the callback handler. Callback handler is a regular code. Programmer, who writes it, checks whatever he wants in any way he wants. This code only gives back the result - what action to take. For the parser on top this callback handler is a black box. There is no theory here.
Second approach is called "backtracking". The idea behind is very simple. We do not know where to go. Ok, let's try all possible alternatives. In this case all variants are tried. There is nothing non deterministic here. There are several different flavors of backtracking.
If this is not enough I can write a little bit more.

nondeterminism means that in order to produce the correct result(s!), a finite state machine reads a token and then has N>1 next states. You can recognize a nondeterministic FSM if a node has more than one outgoing edge with the same label. Note that not every branch has to be valid, but the FSM can't pick just one. In practice you could fork here, resulting in N state machines or you could try a branch completely and then come back and try the next one until every outgoing statetransfer was tested.

If the grammar is ambiguous then there exists exactly one handle for each sentential form.?

there can be two productions from which we can do the reduction. After giving precedence and associations as required there will be one handle only.so is this statement true??

This is partially true, a reduce/reduce conflict is usually resolved by specifying precedence or by letting the parser builder choose which rule to apply before the other.
This means that the conflict is solved but not that the parser is going to behave exactly as intended. It is convenient to study what is causing the conflict and think if a refactoring of the grammar is needed to express what you are trying to parse or if the automatic choice/precedence is enough.

If you have a grammar which has ambiguous rules, you get multiple interpretations. You don't have to insist that the grammar removes ambiguity; you can simply agree that something is ambiguous and parse it multiple ways:
fruit flies like an arrow.
The result of the parse is multiple interpretations.
Now, for such a language to be useful to a reader, either he has to be happy with the ambiguity, or you need to give him a way to resolve it. (In the example, I've decided for you that you are happy the ambiguity, because I haven't given you a way to resolve it!). Or, one can provide the reader of something with ambiguous parsess, a way to choose which parse make sense, and he rejects the inappropriate parses.
I can do that for the above case by telling you that I mean "fruit => watermelon".
Computer grammars are not different, but most programmers don't want ambiguous code. So in general, langauge designers like to define unambiguous grammars. In practice, they don't succeed and you get funny language rules like, "If this could be interpreted ambiguously, then interpret it this way.".

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.

LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.

My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart