Parsing, which method choose? - parsing

I'm working on a compiler (language close to C) and I've to implement it in C. My main question is how to choose the right parsing method in order to be efficient while coding my compiler.
Here's my current grammar:
http://img11.hostingpics.net/pics/273965Capturedcran20130417192526.png
I was thinking about making a top-down parser LL(1) as described here: http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/07-Top-Down-Parsing.pdf
Could it be an efficient choice considering this grammar, knowing that I first have to remove the left recursive rules. Do you have any other advices?
Thank you,
Mentinet

Lots of answers here, but they get things confused. Yes, there are LL and LR parsers, but that isn't really your choice.
You have a grammar. There are tools that automatically create a parser for you given a grammar. The venerable Yacc and Bison do this. They create an LR parser (LALR, actually). There are also tools that create an LL parser for you, like ANTLR. The downsides of tools like this are they inflexible. Their automatically generated syntax error messages suck, error recovery is hard and the older ones encourage you to factor your code in one way - which happens to be the wrong way. The right way is to have your parser spit out an Abstract Syntax Tree, and then have the compiler generate code from that. The tools want you to mix parser and compiler code.
When you are using automated tools like this the differences in power between LL, LR and LALR really does matter. You can't "cheat" to extend their power. (Power in this case means being able to generate a parser for valid context free grammar. A valid context free grammar is one that generates a unique, correct parse tree for every input, or correctly says it doesn't match the grammar.) We currently have no parser generator that can create parser for every valid grammar. However LR can handle more grammars than any other sort. Not being able to handle a grammar isn't a disaster as you can re-write the grammar in a form the parser generator can accept. However, it isn't always obvious how that should be done, and worse it effects the Abstract Syntax Tree generated which means weaknesses in the parser ripple down through the rest of your code - like the compiler.
The reason there are LL, LALR and LR parsers is a long time ago, the job of generating a LR parser was taxing for a modern computer both in terms of time and memory. (Note this is the it takes the generate the parser, which only happens when you write it. The generated parser runs very quickly.) But that was a looong time ago. Generating a LR(1) parser takes far less than 1GB of RAM for a moderately complex language and on a modern computer takes under a second. For that reason you are far better off with an LR automatic parser generator, like Hyacc.
The other option is you write your own parser. In this case there is only one choice: an LL parser. When people here say writing LR is hard, they understate the case. It is near impossible for a human to manually create an LR parser. You might think this means if you write your own parser you are constrained to use LL(1) grammars. But that isn't quite true. Since you are writing the code, you can cheat. You can lookahead an arbitrary number of symbols, and because you don't have to output anything till you are good and ready the Abstract Syntax Tree doesn't have to match the grammar you are using. This ability to cheat makes up for all of lost power between LL and LR(1), and often then some.
Hand written parsers have their downsides of course. There is no guarantee that your parser actually matches your grammar, or for that matter no checking if your grammar is valid (ie recognises the language you think it does). They are longer, and they are even worse at encouraging you to mix parsing code with compile code. They are also obviously implemented in only one language, whereas a parser generator often spit out their results in several different languages. Even if they don't, an LR parse table can be represented in a data structure containing only constants (say in JSON), and the actual parser is only 100 lines of codes or so. But there are also upsides to hand written parser. Because you wrote the code, you know what is going on, so it is easier to do error recovery and generate sane error messages.
In the end, the tradeoff often works like this:
For one off jobs, you are far better using a LR(1) parser generator. The generator will check your grammar, save you work, and modern ones split out the Abstract Syntax Tree directly, which is exactly what you want.
For highly polished tools like mcc or gcc, use a hand written LL parser. You will be writing lots of unit tests to guard your back anyway, error recovery and error messages are much easier to get right, and they can recognise a larger class of languages.
The only other question I have is: why C? Compilers aren't generally time critical code. There are very nice parsing packages out there that will allow you to get the job done in 1/2 the code if you willing to have your compiler run a fair bit slower - my own Lrparsing for instance. Bear in mind a "fair bit slower" here means "hardly noticeable to a human". I guess the answer is "the assignment I am working on specifies C". To give you an idea, here is how simple getting from your grammar to parse tree becomes when you relax the requirement. This program:
#!/usr/bin/python
from lrparsing import *
class G(Grammar):
Exp = Ref("Exp")
int = Token(re='[0-9]+')
id = Token(re='[a-zA-Z][a-zA-Z0-9_]*')
ActArgs = List(Exp, ',', 1)
FunCall = id + '(' + Opt(ActArgs) + ')'
Exp = Prio(
id | int | Tokens("[]", "False True") | Token('(') + List(THIS, ',', 1, 2) + ')' |
Token("! -") + THIS,
THIS << Tokens("* / %") << THIS,
THIS << Tokens("+ -") << THIS,
THIS << Tokens("== < > <= >= !=") << THIS,
THIS << Tokens("&&") << THIS,
THIS << Tokens("||") << THIS,
THIS << Tokens(":") << THIS)
Type = (
Tokens("", "Int Bool") |
Token('(') + THIS + ',' + THIS + ')' |
Token('[') + THIS + ']')
Stmt = (
Token('{') + THIS * Many + '}' |
Keyword("if") + '(' + Exp + ')' << THIS + Opt(Keyword('else') + THIS) |
Keyword("while") + '(' + Exp + ')' + THIS |
id + '=' + Exp + ';' |
FunCall + ';' |
Keyword('return') + Opt(Exp) + ';')
FArgs = List(Type + id, ',', 1)
RetType = Type | Keyword('void')
VarDecl = Type + id + '=' + Exp + ';'
FunDecl = (
RetType + id + '(' + Opt(FArgs) + ')' +
'{' + VarDecl * Many + Stmt * Some + '}')
Decl = VarDecl | FunDecl
Prog = Decl * Some
COMMENTS = Token(re="/[*](?:[^*]|[*][^/])*[*]/") | Token(re="//[^\n\r]*")
START = Prog
EXAMPLE = """\
Int factorial(Int n) {
Int result = 1;
while (n > 1) {
result = result * n;
n = n - 1;
}
return result;
}
"""
parse_tree = G.parse(EXAMPLE)
print G.repr_parse_tree(parse_tree)
Produces this output:
(START (Prog (Decl (FunDecl
(RetType (Type 'Int'))
(id 'factorial') '('
(FArgs
(Type 'Int')
(id 'n')) ')' '{'
(VarDecl
(Type 'Int')
(id 'result') '='
(Exp (int '1')) ';')
(Stmt 'while' '('
(Exp
(Exp (id 'n')) '>'
(Exp (int '1'))) ')'
(Stmt '{'
(Stmt
(id 'result') '='
(Exp
(Exp (id 'result')) '*'
(Exp (id 'n'))) ';')
(Stmt
(id 'n') '='
(Exp
(Exp (id 'n')) '-'
(Exp (int '1'))) ';') '}'))
(Stmt 'return'
(Exp (id 'result')) ';') '}'))))

The most efficient way to build a parser is to use a specific tool which purpose of existance is to build parsers. They used to be called compiler compilers, but nowadays the focus has shifted (broadened) to language workbenches which provide you with more aid to build your own language. For instance, almost any language workbench would provide you with IDE support and syntax highlighting for your language right off the bat, just by looking at a grammar. They also help immensely with debugging your grammar and your language (you didn’t expect left recursion to be the biggest of your problems, did you?).
Among the best currently supported and developing language workbenches one could name:
Rascal
Spoofax
MPS
MetaEdit+
Xtext
If you really so inclined, or if you consider writing a parser yourself just for amusement and experience, the best modern algorithms are SGLR, GLL and Packrat. Each one of those is a quintessence of algorithmic research that lasted half a century, so do not expect to understand them fully in a flash, and do not expect any good to come out of the first couple of “fixes” you’ll come up with. If you do come up with a nice improvement, though, do not hesitate to share your findings with the authors or publish it otherwise!

Thank you for all those advices but we finally decided to build our own recursive-descent parser by using exactly the same method as here: http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html
Indeed, we changed the grammar in order to remove the left-recursive rules and because the grammar I showed in my first message isn't LL(1), we used our token list (made by our scanner) to proceed a lookahead which go further. It looks that it works quite well.
Now we have the build an AST within those recursive functions. Would you have any suggestions? Tips? Thank you very much.

The most efficient parsers are LR-Parsers and LR-parsers are bit difficult to implement .You can go for recursive descent parsing technique as it is easier to implement in C.

Related

Antlr Matlab grammar lexing conflict

I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.
The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".

Does a priority declaration disambiguate between alternative lexicals?

In my previous question, there was a priority > declaration in the example. It turned out not to matter because the solution there did not actually invoke priority but rather avoided it by making the alternatives disjoint. In this question, I'm asking whether priority can be used to select one lexical production over another. In the example below, the language of the production WordInitialDigit is intentionally a subset of that of WordAny. The production Word looks like it should disambiguate between the two properly, but the resulting parse tree has an ambiguity node at the top. Is a priority declaration able to decide between different lexical reductions, or does it require there to be a basis of common lexical elements? Or something else?
The example is contrived (there are no actions in the grammar), but the situations it arises from are not. For example, I'd like to use something like this for error recovery, where I can recognize a natural boundary for a unit of syntax and write a production for it. This generic production would be the last element in a priority chain; if it reduces, it means that there was no valid parse. More generally, I need to be able to select lexical elements based on syntactic context. I had hoped, since Rascal is scannerless, that this would be seamless. Perhaps it is, though I don't see it at the moment.
I'm on the unstable branch, version 0.10.0.201807050853.
EDIT: This question is not about > for defining an expression grammar. The documentation for priority declarations talks mostly about expressions, but the very first sentence provides what looks like a perfectly clear definition:
Priority declarations define a partial ordering between the productions within a single non-terminal.
So the example has two productions, an ordering declared between them, and yet the parser is still generating an ambiguity node in the clear presence of a disambiguation rule. So to put a finer point on my question, it looks like I don't know which of two situations pertains. Either (1) if this isn't supposed to work, then there's a defect in the language definition as documented, a deficiency in error reporting of the compiler, and a language design decision that's somewhere between counter-intuitive and user-hostile. Or (2) if this is supposed to work, there's a defect in the compiler and/or parser (presumably because the focus was initially on expressions) and at some point the example will pass its tests.
module ssce
import analysis::grammars::Ambiguity;
import ParseTree;
import IO;
import String;
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
> WordAny
;
test bool WordInitialDigit_0() = parseAccept( #Word, "4foo" );
test bool WordInitialDigit_1() = parseAccept( #WordInitialDigit, "4foo" );
test bool WordInitialDigit_2() = parseAccept( #WordAny, "4foo" );
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] #<a>, \"<b>\"");
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}
Excellent questions.
TL;DR:
the rule priority mechanism is not capable of an algorithmic ordering of a non-terminal's alternatives. Although some kind of partial order is involved in the additional grammatical constraints that a priority declaration generates, there is no "trying" one rule first, before the other. So it simply can't do that. The good news is that the priority mechanism has a formal semantics independent of any parsing algorithm, it's just defined in terms of context-free grammar rules and reduction traces.
using ambiguous rules for error recovery or "robust parsing", is a good idea. However, if there are too many such rules, the parser will eventually start showing quadratic or even cubic behavior, and tree building after parsing might even have higher polynomials. I believe the generated parser algorithm should have a (parameterized) mode for error recovery rather then expressing this at the grammar level.
Accepting ambiguity at parse time, and filtering/choosing trees after parsing is the recommended way to go.
All this talk of "ordering" in the documentation is misleading. Disambiguation is minefield of confusing terminology. For now, I recommend this SLE paper which has some definitions: https://homepages.cwi.nl/~jurgenv/papers/SLE2013-1.pdf
Details
priority mechanism not capable of choosing among alternatives
The use of the > operator and left, right generates a partial order between mutually recursive rules, such as found in expression languages, and limited to specific item positions in each rule: namely the left-most and right-most recursive positions which overlap. Rules which are lower in the hierarchy are not allowed to be grammatically expanded as "children" of rules which are higher in the hierarchy. So in E "*" E, neither E may be expaned to E "+" E if E "*" E > E "+" E.
The additional constraints do not choose for any E which alternative to try first. No they simply disallow certain expansions, assuming the other expansion is still valid and thus the ambiguity is solved.
The reason for the limitation at specific positions is that for these positions the parser generator can "prove" that they will generate ambiguity, and thus filtering one of the two alternatives by disallowing certain nestings will not result in additional parse errors. (consider a rule for array indexing: E "[" E "]" which should not have additional constraints for the second E. This is a so-called "syntax-safe" disambiguation mechanism.
All and all it is a pretty weak mechanism algorithmically, and specifically tailored for mutually recursive combinator/expression-like languages. The end-goal of the mechanism is to make sure we use have to use only 1 non-terminal for the entire expression language, and the parse trees looking very much akin in shape to abstract syntax trees. Rascal inherited all these considerations from SDF, via SDF2, by the way.
Current implementations actually "factor" the grammar or the parse table in some fashion invisibly to get the same effect, as-if somebody would have factored the grammar completely; however these implementations under-the-hood are very specific to the parsing algorithm in question. the GLR version is quite different from the GLL version, which again is quite different from the DataDependent version.
Post-parse filtering
Of course any tree, including ambiguous parse forests produced by the parser, can be manipulated by Rascal programs using pattern matching, visit, etc. You could write any algorithm to remove the trees you want. However, this requires the entire forest to be constructed first. It's possible and often fast enough, but there is a faster alternative.
Since the tree is built in a bottom-up fashion from the parse graph after parsing, we can also apply "rewrite rules" during the construction of the tree, and remove certain alternatives.
For example:
Tree amb({Tree a, *Tree others}) = amb(others) when weDoNotWant(a);
Tree amb({Tree a}) = a;
This first rule would match on the ambiguity cluster for all trees, and remove all alternatives which weDoNotWant. The second rule removes the cluster if only one alternative is left and let's the last tree "win".
If you want to choose among alternatives:
Tree amb({Tree a, Tree b, *Tree others}) = amb({a, others} when weFindPeferable(a, b);
If you don't want to use Tree but a more specific non-terminal like Statement that should also work.
This example module uses #prefer tags in syntax definitions to "prefer" rules which have been tagged over the other rules, as post-parse rewrite rules:
https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/sdf2/filters/PreferAvoid.rsc
Hacking around with additional lexical constraints
Next to priority disambiguation and post-parse rewriting, we still have the lexical level disambiguation mechanisms in the toolkit:
`NT \ Keywords" - rejecting finite (keyword) languages from a non-terminals
CC << NT, NT >> CC, CC !<< NT, NT !>> CC follow and preceede restrictions (where CC stands for character-class and NT for non-terminal)
Solving other kinds of ambiguity apart from the operator precedence stuff can be tried with these, in particular if the length of different sub-sentences is shorter/longer between the different alternatives, !>> can do the "maximal munch" or "longest match" thing. So I was thinking out loud:
lexical C = A? B?;
where A is one lexical alternative and B is the other. With the proper !>> restrictions on A and !<< restrictions on B the grammar might be tricked into always wanting to put all characters in A, unless they don't fit into A as a language, in which case they would default to B.
The obvious/annoying advice
Think harder about an unambiguous and simpler grammar.
Sometimes this means to abstract and allow more sentences in the grammar, avoiding use of the grammar for "type checking" the tree. It's often better to over-approximate the syntax of the language and then use (static) semantic analysis (over simpler trees) to get what you want, rather then staring at a complex ambiguous grammar.
A typical example: C blocks with declarations only at the start are much harder to define unambiguously then C blocks where declarations are allowed everywhere. And for a C90 mode, all you have to do is flag declarations which are not at the start of a block.
This particular example
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
| [0-9] !<< WordAny // this would help!
;
wrap up
Great question, thanks for the patience. Hope this helps!
The > disambiguation mechanism is for recursive definitions, like for example a expression grammar.
So it's to solve the following ambiguity:
syntax E
= [0-9]+
| E "+" E
| E "-" E
;
The string 1 + 3 - 4 can not be parsed as 1 + (3 - 4) or (1 + 3) - 4.
The > gives an order to this grammar, which production should be at the top of the tree.
layout L = " "*;
syntax E
= [0-9]+
| E "+" E
> E "-" E
;
this now only allows the (1 + 3) - 4 tree.
To finish this story, how about 1 + 1 + 1? That could be 1 + (1 + 1) or (1 + 1) + 1.
This is what we have left, right, and non-assoc for. They define how recursion in the same production should be handled.
syntax E
= [0-9]+
| left E "+" E
> left E "-" E
;
will now enforce: 1 + (1 + 1).
When you take an operator precendence table, like for example this c operator precedance table you can almost literally copy them.
note that these two disambiguation features are not exactly opposite to each other. the first ambiguitity could also have been solved by putting both productions in a left group like this:
syntax E
= [0-9]+
| left (
E "+" E
| E "-" E
)
;
As the left side of the tree is favored, you will now get a different tree 1 + (3 - 4). So it makes a difference, but it all depends on what you want.
More details can be found in the tutor pages on disambiguation

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.
Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.
If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

SQL Parser Disambiguation

I have written a very simple SQL Parser for a very small subset of the language to handle a one time specific problem. I had to translate an extremely large amount of old SQL expressions into an intermediate form that could then possibly be brought into a business rule system. The initial attempt worked for about 80% of the existing data.
I looked at some commercial solutions but thought I could do this pretty easy based on some past experience and some reading. I hit a problem and decided to go and finish the task with a commercial solution, I know when to admit defeat. However I am still curious as to how to handle this or what I may have done wrong.
My initial solution was based on a simple recursive descent parser, found in many books and online articles, producing an Abstract Syntax Tree and then during the analysis phase, I would determine type differences and whether logical expressions were being mixed with algebraic expressions and such.
I referenced the ANTLR SQL Lite grammar by Bark Kiers
https://github.com/bkiers/sqlite-parser
I also referenced an online SQL grammar site
http://savage.net.au/SQL/
The main question is how to make the parser differentiate between the following
expr AND expr
BETWEEN expr AND expr
The problem I am encountering is when I hit the following unit test case
case when PP_ID between '009000' and '009999' then 'MA' when PP_ID between '001000' and '001999' then 'TL' else 'LA' end
The '009000' and '009999' is matched as a Binary Expression so the parser throws an error expecting the keyword AND but instead encounters THEN.
The online ANSI grammar actually breaks down expressions into finer grained productions and I suspect that is the proper approach. I am also wondering if my parser should detect if an expression is actually Boolean vs. Algebraic during the parse phase and not the semantic phase, and use that information to handle the above case.
I am sure I could brute force the solution but I want to learn the correct way to handle this.
Thanks for any help offered.
I also met with this problem while developed Jison (Bison) parser for SQLite, and solved it with who different rules in grammar for binary operations: one for AND and one for BETWEEN (this is a Jison grammar):
%left BETWEEN // Here I defined that AND has higher priority over BETWEEN
%left AND //
: expr AND expr // Rule for AND
{ $$ = {op: 'AND', left: $1, right: $3}; }
;
: expr BETWEEN expr // Rule for BETWEEN
{
if($3.op != 'AND') throw new Error('Wrong syntax of BETWEEN AND');
$$ = {op: 'BETWEEN', expr: $1, left:$3.left, right:$3.right};
}
;
and then parser checks right expression, and pass only expressions with AND operations. May be this approach can help you.
For ANTLR grammar I found the following rule (see this grammar made by Bart Kiers)
expr
:
| expr K_AND expr
| expr K_NOT? K_BETWEEN expr K_AND expr
;
But I am not sure, that it works in proper way.

Bison: how to fix reduce/reduce conflict

Below is a a Bison grammar which illustrates my problem. The actual grammar that I'm using is more complicated.
%glr-parser
%%
s : e | p '=' s;
p : fp | p ',' fp;
fp : 'x';
e : te | e ';' te;
te : fe | te ',' fe;
fe : 'x';
Some examples of input would be:
x
x = x
x,x = x,x
x,x = x;x
x,x,x = x,x;x,x
x = x,x = x;x
What I'm after is for the x's on the left side of an '=' to be parsed differently than those on the right. However, the set of legal "expressions" which may appear on the right of an '='-sign is larger than those on the left (because of the ';').
Bison prints the message (input file was test.y):
test.y: conflicts: 1 reduce/reduce.
There must be some way around this problem. In C, you have a similar situation. The program below passes through gcc with no errors.
int main(void) {
int x;
int *px;
x;
*px;
*px = x = 1;
}
In this case, the 'px' and 'x' get treated differently depending on whether they appear to the left or right of an '='-sign.
You're using %glr-parser, so there's no need to "fix" the reduce/reduce conflict. Bison just tells you there is one, so that you know you grammar might be ambiguous, so you might need to add ambiguity resolution with %dprec or %merge directives. But in your case, the grammar is not ambiguous, so you don't need to do anything.
A conflict is NOT an error, its just an indication that your grammar is not LALR(1).
The reduce-reduce conflict in your grammar comes from the context:
... = ... x ,
At this point, the parser has to decide whether x is an fe or an fp, and it cannot know with one symbol lookahead. Indeed, it cannot know with any finite lookahead, you could have any number of repetitions of x , following that point without encountering a =, ; or the end of the input, any of which would reveal the answer.
This is not quite the same as the C issue, which can be resolved with single symbol lookahead. However, the C example is a classic illustration of why SLR(1) grammars are less powerful than LALR(1) grammars -- it's used for that purpose in the dragon book -- and a similarly problematic grammar is an example of the difference between LALR(1) and LR(1); it can be found in the bison manual (here):
def: param_spec return_spec ',';
param_spec: type | name_list ':' type;
return_spec: type | name ':' type;
type: "id";
name: "id";
name_list: name | name ',' name_list;
(The bison manual explains how to resolve this issue for LALR(1) grammars, although using a GLR grammar is always a possibility.)
The key to resolving such conflicts without using a GLR grammar is to avoid forcing the parser to make premature decisions.
For example, it is traditional to distinguish syntactically between lvalues and rvalues, and some languages continue to do so. C and C++ do not, however; and this turns out to be an extremely powerful feature in C++ because it allows the definition of functions which can act as lvalues.
In C, I think it's just to simplify the grammar a bit: the C grammar allows the result of any unary operator to appear on the left hand side of an assignment operator, but unary operators are actually a mix of lvalues (*v, v[expr]) and rvalues (sizeof v, f(expr)). The grammar could have distinguished between the two kinds of unary operators, but it could not resolve the actual restriction, which is that only modifiable lvalues may appears on the left side of an assignment operator.
C++ allows an arbitrary expression to appear on the left-hand side of an assignment operator (although some need to be parenthesized); consequently, the following is totally legal:
(predicate(x) ? *some_pointer : some_variable) = 42;
In your case, you could resolve the conflict syntactically by replacing te with p, since both non-terminals produce the same set of derivations. That's probably not the general solution, unless it is really the case in your full grammar that left-side expressions are a strict subset of right-side expressions. In a full grammar, you might end up with three types of expression (left-only, right-only, common), which could considerably complicated the grammar, and leaving the resolution for semantic analysis might prove to be easier (and even, as in the case of C++, surprisingly useful).

Resources