Does a priority declaration disambiguate between alternative lexicals? - rascal

In my previous question, there was a priority > declaration in the example. It turned out not to matter because the solution there did not actually invoke priority but rather avoided it by making the alternatives disjoint. In this question, I'm asking whether priority can be used to select one lexical production over another. In the example below, the language of the production WordInitialDigit is intentionally a subset of that of WordAny. The production Word looks like it should disambiguate between the two properly, but the resulting parse tree has an ambiguity node at the top. Is a priority declaration able to decide between different lexical reductions, or does it require there to be a basis of common lexical elements? Or something else?
The example is contrived (there are no actions in the grammar), but the situations it arises from are not. For example, I'd like to use something like this for error recovery, where I can recognize a natural boundary for a unit of syntax and write a production for it. This generic production would be the last element in a priority chain; if it reduces, it means that there was no valid parse. More generally, I need to be able to select lexical elements based on syntactic context. I had hoped, since Rascal is scannerless, that this would be seamless. Perhaps it is, though I don't see it at the moment.
I'm on the unstable branch, version 0.10.0.201807050853.
EDIT: This question is not about > for defining an expression grammar. The documentation for priority declarations talks mostly about expressions, but the very first sentence provides what looks like a perfectly clear definition:
Priority declarations define a partial ordering between the productions within a single non-terminal.
So the example has two productions, an ordering declared between them, and yet the parser is still generating an ambiguity node in the clear presence of a disambiguation rule. So to put a finer point on my question, it looks like I don't know which of two situations pertains. Either (1) if this isn't supposed to work, then there's a defect in the language definition as documented, a deficiency in error reporting of the compiler, and a language design decision that's somewhere between counter-intuitive and user-hostile. Or (2) if this is supposed to work, there's a defect in the compiler and/or parser (presumably because the focus was initially on expressions) and at some point the example will pass its tests.
module ssce
import analysis::grammars::Ambiguity;
import ParseTree;
import IO;
import String;
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
> WordAny
;
test bool WordInitialDigit_0() = parseAccept( #Word, "4foo" );
test bool WordInitialDigit_1() = parseAccept( #WordInitialDigit, "4foo" );
test bool WordInitialDigit_2() = parseAccept( #WordAny, "4foo" );
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] #<a>, \"<b>\"");
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}

Excellent questions.
TL;DR:
the rule priority mechanism is not capable of an algorithmic ordering of a non-terminal's alternatives. Although some kind of partial order is involved in the additional grammatical constraints that a priority declaration generates, there is no "trying" one rule first, before the other. So it simply can't do that. The good news is that the priority mechanism has a formal semantics independent of any parsing algorithm, it's just defined in terms of context-free grammar rules and reduction traces.
using ambiguous rules for error recovery or "robust parsing", is a good idea. However, if there are too many such rules, the parser will eventually start showing quadratic or even cubic behavior, and tree building after parsing might even have higher polynomials. I believe the generated parser algorithm should have a (parameterized) mode for error recovery rather then expressing this at the grammar level.
Accepting ambiguity at parse time, and filtering/choosing trees after parsing is the recommended way to go.
All this talk of "ordering" in the documentation is misleading. Disambiguation is minefield of confusing terminology. For now, I recommend this SLE paper which has some definitions: https://homepages.cwi.nl/~jurgenv/papers/SLE2013-1.pdf
Details
priority mechanism not capable of choosing among alternatives
The use of the > operator and left, right generates a partial order between mutually recursive rules, such as found in expression languages, and limited to specific item positions in each rule: namely the left-most and right-most recursive positions which overlap. Rules which are lower in the hierarchy are not allowed to be grammatically expanded as "children" of rules which are higher in the hierarchy. So in E "*" E, neither E may be expaned to E "+" E if E "*" E > E "+" E.
The additional constraints do not choose for any E which alternative to try first. No they simply disallow certain expansions, assuming the other expansion is still valid and thus the ambiguity is solved.
The reason for the limitation at specific positions is that for these positions the parser generator can "prove" that they will generate ambiguity, and thus filtering one of the two alternatives by disallowing certain nestings will not result in additional parse errors. (consider a rule for array indexing: E "[" E "]" which should not have additional constraints for the second E. This is a so-called "syntax-safe" disambiguation mechanism.
All and all it is a pretty weak mechanism algorithmically, and specifically tailored for mutually recursive combinator/expression-like languages. The end-goal of the mechanism is to make sure we use have to use only 1 non-terminal for the entire expression language, and the parse trees looking very much akin in shape to abstract syntax trees. Rascal inherited all these considerations from SDF, via SDF2, by the way.
Current implementations actually "factor" the grammar or the parse table in some fashion invisibly to get the same effect, as-if somebody would have factored the grammar completely; however these implementations under-the-hood are very specific to the parsing algorithm in question. the GLR version is quite different from the GLL version, which again is quite different from the DataDependent version.
Post-parse filtering
Of course any tree, including ambiguous parse forests produced by the parser, can be manipulated by Rascal programs using pattern matching, visit, etc. You could write any algorithm to remove the trees you want. However, this requires the entire forest to be constructed first. It's possible and often fast enough, but there is a faster alternative.
Since the tree is built in a bottom-up fashion from the parse graph after parsing, we can also apply "rewrite rules" during the construction of the tree, and remove certain alternatives.
For example:
Tree amb({Tree a, *Tree others}) = amb(others) when weDoNotWant(a);
Tree amb({Tree a}) = a;
This first rule would match on the ambiguity cluster for all trees, and remove all alternatives which weDoNotWant. The second rule removes the cluster if only one alternative is left and let's the last tree "win".
If you want to choose among alternatives:
Tree amb({Tree a, Tree b, *Tree others}) = amb({a, others} when weFindPeferable(a, b);
If you don't want to use Tree but a more specific non-terminal like Statement that should also work.
This example module uses #prefer tags in syntax definitions to "prefer" rules which have been tagged over the other rules, as post-parse rewrite rules:
https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/sdf2/filters/PreferAvoid.rsc
Hacking around with additional lexical constraints
Next to priority disambiguation and post-parse rewriting, we still have the lexical level disambiguation mechanisms in the toolkit:
`NT \ Keywords" - rejecting finite (keyword) languages from a non-terminals
CC << NT, NT >> CC, CC !<< NT, NT !>> CC follow and preceede restrictions (where CC stands for character-class and NT for non-terminal)
Solving other kinds of ambiguity apart from the operator precedence stuff can be tried with these, in particular if the length of different sub-sentences is shorter/longer between the different alternatives, !>> can do the "maximal munch" or "longest match" thing. So I was thinking out loud:
lexical C = A? B?;
where A is one lexical alternative and B is the other. With the proper !>> restrictions on A and !<< restrictions on B the grammar might be tricked into always wanting to put all characters in A, unless they don't fit into A as a language, in which case they would default to B.
The obvious/annoying advice
Think harder about an unambiguous and simpler grammar.
Sometimes this means to abstract and allow more sentences in the grammar, avoiding use of the grammar for "type checking" the tree. It's often better to over-approximate the syntax of the language and then use (static) semantic analysis (over simpler trees) to get what you want, rather then staring at a complex ambiguous grammar.
A typical example: C blocks with declarations only at the start are much harder to define unambiguously then C blocks where declarations are allowed everywhere. And for a C90 mode, all you have to do is flag declarations which are not at the start of a block.
This particular example
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
| [0-9] !<< WordAny // this would help!
;
wrap up
Great question, thanks for the patience. Hope this helps!

The > disambiguation mechanism is for recursive definitions, like for example a expression grammar.
So it's to solve the following ambiguity:
syntax E
= [0-9]+
| E "+" E
| E "-" E
;
The string 1 + 3 - 4 can not be parsed as 1 + (3 - 4) or (1 + 3) - 4.
The > gives an order to this grammar, which production should be at the top of the tree.
layout L = " "*;
syntax E
= [0-9]+
| E "+" E
> E "-" E
;
this now only allows the (1 + 3) - 4 tree.
To finish this story, how about 1 + 1 + 1? That could be 1 + (1 + 1) or (1 + 1) + 1.
This is what we have left, right, and non-assoc for. They define how recursion in the same production should be handled.
syntax E
= [0-9]+
| left E "+" E
> left E "-" E
;
will now enforce: 1 + (1 + 1).
When you take an operator precendence table, like for example this c operator precedance table you can almost literally copy them.
note that these two disambiguation features are not exactly opposite to each other. the first ambiguitity could also have been solved by putting both productions in a left group like this:
syntax E
= [0-9]+
| left (
E "+" E
| E "-" E
)
;
As the left side of the tree is favored, you will now get a different tree 1 + (3 - 4). So it makes a difference, but it all depends on what you want.
More details can be found in the tutor pages on disambiguation

Related

Using midaction rules in Lemon to interpret "let" expression

I'm trying to write a "toy" interpreter using Flex + Lemon that supports a very basic "let" syntax where a variable X is temporarily bound to an expression. For example, "letx 3 + 4 in x + 8" should evaluate to 15.
In essence, what I'd "like" the rule to say is:
expr(E) ::= LETX expr(N) IN expr(O). {
environment->X = N;
E = O;
}
But that won't work since O is evaluated before the X = N assignment is made.
I understand that the usual solution for this would be a mid-rule action. Lemon doesn't explicitly support this, but I've read elsewhere that would just be syntactic sugar in any event.
So I've tried to put together a mid-rule action that would do my assignment of X = N before interpreting O:
midruleaction ::= /* mid rule */. { environment->X = N; }
expr(E) ::= LETX expr(N) IN midruleaction expr(O). { E = O; }
But that won't work because there's no way for the midruleaction rule to access N, or at least none I can see in the lemon docs/examples.
I think I'm missing something here. I know I could build up a tree and then walk it in a second pass. And I might end up doing that, but I'd like to understand how to solve this more directly first.
Any suggestions?
It's really not a very scalable solution to evaluate immediately in a parser. See below.
It is true that mid-rule actions are (mostly) syntactic sugar. However, in most cases they are not syntactic sugar for "markers" (non-terminals with empty right-hand sides) but rather for non-terminals representing production prefixes. For example, you could write your letx rule like this:
expr(E) ::= letx_prefix IN expr(O). { E = O; }
letx_prefix ::= LETX expr(N). { environment->X = N; }
Or you could do this:
expr(E) ::= LETX assigned_expr IN expr(O). { E = O; }
assigned_expr ::= expr(N). { environment->X = N; }
The first one is the prefix desugaring; the second one is the one I'd use because I feel that it separates concerns better. The important point is that the environment->X = N; action requires access to the semantic values of the prefix of the RHS, so it must be part of a prefix rule (which includes at least the symbols whose semantic values are requires), rather than a marker, which has access to no semantic values at all.
Having said all that, immediate evaluation during parsing is a very limited strategy. It cannot cope with a large range of constructs which require deferred evaluation, such as loops and function definitions. It cannot cleanly cope with constructs which may suppress evaluation, such as conditionals and short-circuit operators. (These can be handled using MRAs and a stateful environment which contains an evaluation-suppressed flag, but that's very ugly.)
Another problem is that syntactically incorrect expressions may be partially evaluated before the syntax error is discovered, and it may not be immediately obvious to the user which parts of the expression have and have not been evaluated.
On the whole, you're better off building an easily-evaluated AST during the parse, and evaluating the AST when the parse successfully completes.

Left recursion parsing

Description:
While reading Compiler Design in C book I came across the following rules to describe a context-free grammar:
a grammar that recognizes a list of one or more statements, each of
which is an arithmetic expression followed by a semicolon. Statements are made up of a
series of semicolon-delimited expressions, each comprising a series of numbers
separated either by asterisks (for multiplication) or plus signs (for addition).
And here is the grammar:
1. statements ::= expression;
2. | expression; statements
3. expression ::= expression + term
4. | term
5. term ::= term * factor
6. | factor
7. factor ::= number
8. | (expression)
The book states that this recursive grammar has a major problem. The right hand side of several productions appear on the left-hand side as in production 3 (And this property is called left recursion) and certain parsers such as recursive-descent parser can't handle left-recursion productions. They just loop forever.
You can understand the problem by considering how the parser decides to apply a particular production when it is replacing a non-terminal that has more than one right hand side. The simple case is evident in Productions 7 and 8. The parser can choose which production to apply when it's expanding a factor by looking at the next input symbol. If this symbol is a number, then the compiler applies Production 7 and replaces the factor with a number. If the next input symbol was an open parenthesis, the parser
would use Production 8. The choice between Productions 5 and 6 cannot be solved in this way, however. In the case of Production 6, the right-hand side of term starts with a factor which, in tum, starts with either a number or left parenthesis. Consequently, the
parser would like to apply Production 6 when a term is being replaced and the next input symbol is a number or left parenthesis. Production 5-the other right-hand side-starts with a term, which can start with a factor, which can start with a number or left parenthesis, and these are the same symbols that were used to choose Production 6.
Question:
That second quote from the book got me completely lost. So by using an example of some statements as (for example) 5 + (7*4) + 14:
What's the difference between factor and term? using the same example
Why can't a recursive-descent parser handle left-recursion productions? (Explain second quote).
What's the difference between factor and term? using the same example
I am not giving the same example as it won't give you clear picture of what you have doubt about!
Given,
term ::= term * factor | factor
factor ::= number | (expression)
Now,suppose if I ask you to find the factors and terms in the expression 2*3*4.
Now,multiplication being left associative, will be evaluated as :-
(2*3)*4
As you can see, here (2*3) is the term and factor is 4(a number). Similarly you can extend this approach upto any level to draw the idea about term.
As per given grammar, if there's a multiplication chain in the given expression, then its sub-part,leaving a single factor, is a term ,which in turn yields another sub-part---the another term, leaving another single factor and so on. This is how expressions are evaluated.
Why can't a recursive-descent parser handle left-recursion productions? (Explain second quote).
Your second statement is quite clear in its essence. A recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the productions of the grammar.
It is said so because it's clear that recursive descent parser will go into infinite loop if the non-terminal keeps on expanding into itself.
Similarly, talking about a recursive descent parser,even with backtracking---When we try to expand a non-terminal, we may eventually find ourselves again trying to expand the same non-terminal without having consumed any input.
A-> Ab
Here,while expanding the non-terminal A can be kept on expanding into
A-> AAb -> AAAb -> ... -> infinite loop of A.
Hence, we prevent left-recursive productions while working with recursive-descent parsers.
The rule factor matches the string "1*3", the rule term does not (though it would match "(1*3)". In essence each rule represents one level of precedence. expression contains the operators with the lowest precedence, factor the second lowest and term the highest. If you're in term and you want to use an operator with lower precedence, you need to add parentheses.
If you implement a recursive descent parser using recursive functions, a rule like a ::= b "*" c | d might be implemented like this:
// Takes the entire input string and the index at which we currently are
// Returns the index after the rule was matched or throws an exception
// if the rule failed
parse_a(input, index) {
try {
after_b = parse_b(input, index)
after_star = parse_string("*", input, after_b)
after_c = parse_c(input, after_star)
return after_c
} catch(ParseFailure) {
// If one of the rules b, "*" or c did not match, try d instead
return parse_d(input, index)
}
}
Something like this would work fine (in practice you might not actually want to use recursive functions, but the approach you'd use instead would still behave similarly). Now, let's consider the left-recursive rule a ::= a "*" b | c instead:
parse_a(input, index) {
try {
after_a = parse_a(input, index)
after_star = parse_string("*", input, after_a)
after_b = parse_c(input, after_star)
return after_b
} catch(ParseFailure) {
// If one of the rules a, "*" or b did not match, try c instead
return parse_c(input, index)
}
}
Now the first thing that the function parse_a does is to call itself again at the same index. This recursive call will again call itself. And this will continue ad infinitum, or rather until the stack overflows and the whole program comes crashing down. If we use a more efficient approach instead of recursive functions, we'll actually get an infinite loop rather than a stack overflow. Either way we don't get the result we want.

bison: a specific number of recursions?

I've been writing a parser with flex and bison for a few weeks now and have ground to a halt on account of a double recursion, the definitions of which are similar for the first few rules. Bison always chooses the wrong path at one particular stage and crashes because the grammar doesn't fit. The bison code looks a little like this:
set :
TOKEN_ /* token */
QString
QString
Integer /* number of descrs (see below) */
M_op /*'M' optional*/
alts;
and
alts :
alt | alts alt ;
alt :
QString
pName_op /* empty | TOKEN1 QString */
deVal_op /* empty | TOKEN2 Integer */
descrs
;
and
descrs :
descr | descrs descr ;
descr :
QString
QString_op /* optional qstring */
Integer
D_op /* optional 'D' */
Bison stays in the descrs recursion and never exits it to progress to the next alt. The integer that is read in in the initial block, however, tells us how many instances of descr are going to come. So my question is this:
Is there a way of preparing bison for a specific number of instances of the recursion so that he can exit this recursion and enter the recursion "above"? I can access this integer in the C code, but I'm not aware of syntax for said move, something like a descrs : {for (int i=0;i<n;++i){descr}} (I'm aware that probably looks ridiculous)
Failing this, is there any other way around this problem?
Any input would be much appreciated. Thanks in advance.
A context-free grammar cannot be contingent on semantic information. Yet, that is precisely what you are seeking: you wish the value of a numeric token to be taken into account in the syntax of an expression.
As a request, that's not unreasonable or immoral; it's simply outside of the reach of context-free grammars. And bison is intended to create parsers for context-free grammars. So it's simply not the correct tool for this problem.
Having said that, it is possible to use bison in this manner, if you are using a reasonably recent version of bison which includes support for GLR grammars. Bison`s GLR support includes the option of using semantic predicates to control the parse. (See the bison manual for details.) A solution based on that mechanism is possible, and probably not too complicated.
Much easier -- if the grammar allows for it -- would be to use a top-down parser. Parsing a number and then that number of descrs would be trivial in a recursive-descent parser, for example.
The liberal use of FOO_op non-terminals in the grammar suggests that top-down parsing would not be problematic, but it is impossible to say for sure without seeing the entire grammar. Artificial non-terminals (like FOO_op) often cause shift-reduce conflicts in LR(1) languages, because they force an immediate shift/reduce decision to be made. In an LR(1) language, a production of the form: A → ω B? χ
would normally be rendered as the pair of productions A → ω B χ; A → ω χ, rather than the substitution Bop → B | ε; A → ω Bop χ, in order to avoid creating conflicts with other productions of the form C → ω ζ where FIRST(ζ) ∩ FIRST(B ∪ ω) ≠ ∅.

Bison: how to fix reduce/reduce conflict

Below is a a Bison grammar which illustrates my problem. The actual grammar that I'm using is more complicated.
%glr-parser
%%
s : e | p '=' s;
p : fp | p ',' fp;
fp : 'x';
e : te | e ';' te;
te : fe | te ',' fe;
fe : 'x';
Some examples of input would be:
x
x = x
x,x = x,x
x,x = x;x
x,x,x = x,x;x,x
x = x,x = x;x
What I'm after is for the x's on the left side of an '=' to be parsed differently than those on the right. However, the set of legal "expressions" which may appear on the right of an '='-sign is larger than those on the left (because of the ';').
Bison prints the message (input file was test.y):
test.y: conflicts: 1 reduce/reduce.
There must be some way around this problem. In C, you have a similar situation. The program below passes through gcc with no errors.
int main(void) {
int x;
int *px;
x;
*px;
*px = x = 1;
}
In this case, the 'px' and 'x' get treated differently depending on whether they appear to the left or right of an '='-sign.
You're using %glr-parser, so there's no need to "fix" the reduce/reduce conflict. Bison just tells you there is one, so that you know you grammar might be ambiguous, so you might need to add ambiguity resolution with %dprec or %merge directives. But in your case, the grammar is not ambiguous, so you don't need to do anything.
A conflict is NOT an error, its just an indication that your grammar is not LALR(1).
The reduce-reduce conflict in your grammar comes from the context:
... = ... x ,
At this point, the parser has to decide whether x is an fe or an fp, and it cannot know with one symbol lookahead. Indeed, it cannot know with any finite lookahead, you could have any number of repetitions of x , following that point without encountering a =, ; or the end of the input, any of which would reveal the answer.
This is not quite the same as the C issue, which can be resolved with single symbol lookahead. However, the C example is a classic illustration of why SLR(1) grammars are less powerful than LALR(1) grammars -- it's used for that purpose in the dragon book -- and a similarly problematic grammar is an example of the difference between LALR(1) and LR(1); it can be found in the bison manual (here):
def: param_spec return_spec ',';
param_spec: type | name_list ':' type;
return_spec: type | name ':' type;
type: "id";
name: "id";
name_list: name | name ',' name_list;
(The bison manual explains how to resolve this issue for LALR(1) grammars, although using a GLR grammar is always a possibility.)
The key to resolving such conflicts without using a GLR grammar is to avoid forcing the parser to make premature decisions.
For example, it is traditional to distinguish syntactically between lvalues and rvalues, and some languages continue to do so. C and C++ do not, however; and this turns out to be an extremely powerful feature in C++ because it allows the definition of functions which can act as lvalues.
In C, I think it's just to simplify the grammar a bit: the C grammar allows the result of any unary operator to appear on the left hand side of an assignment operator, but unary operators are actually a mix of lvalues (*v, v[expr]) and rvalues (sizeof v, f(expr)). The grammar could have distinguished between the two kinds of unary operators, but it could not resolve the actual restriction, which is that only modifiable lvalues may appears on the left side of an assignment operator.
C++ allows an arbitrary expression to appear on the left-hand side of an assignment operator (although some need to be parenthesized); consequently, the following is totally legal:
(predicate(x) ? *some_pointer : some_variable) = 42;
In your case, you could resolve the conflict syntactically by replacing te with p, since both non-terminals produce the same set of derivations. That's probably not the general solution, unless it is really the case in your full grammar that left-side expressions are a strict subset of right-side expressions. In a full grammar, you might end up with three types of expression (left-only, right-only, common), which could considerably complicated the grammar, and leaving the resolution for semantic analysis might prove to be easier (and even, as in the case of C++, surprisingly useful).

Parsing, which method choose?

I'm working on a compiler (language close to C) and I've to implement it in C. My main question is how to choose the right parsing method in order to be efficient while coding my compiler.
Here's my current grammar:
http://img11.hostingpics.net/pics/273965Capturedcran20130417192526.png
I was thinking about making a top-down parser LL(1) as described here: http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/07-Top-Down-Parsing.pdf
Could it be an efficient choice considering this grammar, knowing that I first have to remove the left recursive rules. Do you have any other advices?
Thank you,
Mentinet
Lots of answers here, but they get things confused. Yes, there are LL and LR parsers, but that isn't really your choice.
You have a grammar. There are tools that automatically create a parser for you given a grammar. The venerable Yacc and Bison do this. They create an LR parser (LALR, actually). There are also tools that create an LL parser for you, like ANTLR. The downsides of tools like this are they inflexible. Their automatically generated syntax error messages suck, error recovery is hard and the older ones encourage you to factor your code in one way - which happens to be the wrong way. The right way is to have your parser spit out an Abstract Syntax Tree, and then have the compiler generate code from that. The tools want you to mix parser and compiler code.
When you are using automated tools like this the differences in power between LL, LR and LALR really does matter. You can't "cheat" to extend their power. (Power in this case means being able to generate a parser for valid context free grammar. A valid context free grammar is one that generates a unique, correct parse tree for every input, or correctly says it doesn't match the grammar.) We currently have no parser generator that can create parser for every valid grammar. However LR can handle more grammars than any other sort. Not being able to handle a grammar isn't a disaster as you can re-write the grammar in a form the parser generator can accept. However, it isn't always obvious how that should be done, and worse it effects the Abstract Syntax Tree generated which means weaknesses in the parser ripple down through the rest of your code - like the compiler.
The reason there are LL, LALR and LR parsers is a long time ago, the job of generating a LR parser was taxing for a modern computer both in terms of time and memory. (Note this is the it takes the generate the parser, which only happens when you write it. The generated parser runs very quickly.) But that was a looong time ago. Generating a LR(1) parser takes far less than 1GB of RAM for a moderately complex language and on a modern computer takes under a second. For that reason you are far better off with an LR automatic parser generator, like Hyacc.
The other option is you write your own parser. In this case there is only one choice: an LL parser. When people here say writing LR is hard, they understate the case. It is near impossible for a human to manually create an LR parser. You might think this means if you write your own parser you are constrained to use LL(1) grammars. But that isn't quite true. Since you are writing the code, you can cheat. You can lookahead an arbitrary number of symbols, and because you don't have to output anything till you are good and ready the Abstract Syntax Tree doesn't have to match the grammar you are using. This ability to cheat makes up for all of lost power between LL and LR(1), and often then some.
Hand written parsers have their downsides of course. There is no guarantee that your parser actually matches your grammar, or for that matter no checking if your grammar is valid (ie recognises the language you think it does). They are longer, and they are even worse at encouraging you to mix parsing code with compile code. They are also obviously implemented in only one language, whereas a parser generator often spit out their results in several different languages. Even if they don't, an LR parse table can be represented in a data structure containing only constants (say in JSON), and the actual parser is only 100 lines of codes or so. But there are also upsides to hand written parser. Because you wrote the code, you know what is going on, so it is easier to do error recovery and generate sane error messages.
In the end, the tradeoff often works like this:
For one off jobs, you are far better using a LR(1) parser generator. The generator will check your grammar, save you work, and modern ones split out the Abstract Syntax Tree directly, which is exactly what you want.
For highly polished tools like mcc or gcc, use a hand written LL parser. You will be writing lots of unit tests to guard your back anyway, error recovery and error messages are much easier to get right, and they can recognise a larger class of languages.
The only other question I have is: why C? Compilers aren't generally time critical code. There are very nice parsing packages out there that will allow you to get the job done in 1/2 the code if you willing to have your compiler run a fair bit slower - my own Lrparsing for instance. Bear in mind a "fair bit slower" here means "hardly noticeable to a human". I guess the answer is "the assignment I am working on specifies C". To give you an idea, here is how simple getting from your grammar to parse tree becomes when you relax the requirement. This program:
#!/usr/bin/python
from lrparsing import *
class G(Grammar):
Exp = Ref("Exp")
int = Token(re='[0-9]+')
id = Token(re='[a-zA-Z][a-zA-Z0-9_]*')
ActArgs = List(Exp, ',', 1)
FunCall = id + '(' + Opt(ActArgs) + ')'
Exp = Prio(
id | int | Tokens("[]", "False True") | Token('(') + List(THIS, ',', 1, 2) + ')' |
Token("! -") + THIS,
THIS << Tokens("* / %") << THIS,
THIS << Tokens("+ -") << THIS,
THIS << Tokens("== < > <= >= !=") << THIS,
THIS << Tokens("&&") << THIS,
THIS << Tokens("||") << THIS,
THIS << Tokens(":") << THIS)
Type = (
Tokens("", "Int Bool") |
Token('(') + THIS + ',' + THIS + ')' |
Token('[') + THIS + ']')
Stmt = (
Token('{') + THIS * Many + '}' |
Keyword("if") + '(' + Exp + ')' << THIS + Opt(Keyword('else') + THIS) |
Keyword("while") + '(' + Exp + ')' + THIS |
id + '=' + Exp + ';' |
FunCall + ';' |
Keyword('return') + Opt(Exp) + ';')
FArgs = List(Type + id, ',', 1)
RetType = Type | Keyword('void')
VarDecl = Type + id + '=' + Exp + ';'
FunDecl = (
RetType + id + '(' + Opt(FArgs) + ')' +
'{' + VarDecl * Many + Stmt * Some + '}')
Decl = VarDecl | FunDecl
Prog = Decl * Some
COMMENTS = Token(re="/[*](?:[^*]|[*][^/])*[*]/") | Token(re="//[^\n\r]*")
START = Prog
EXAMPLE = """\
Int factorial(Int n) {
Int result = 1;
while (n > 1) {
result = result * n;
n = n - 1;
}
return result;
}
"""
parse_tree = G.parse(EXAMPLE)
print G.repr_parse_tree(parse_tree)
Produces this output:
(START (Prog (Decl (FunDecl
(RetType (Type 'Int'))
(id 'factorial') '('
(FArgs
(Type 'Int')
(id 'n')) ')' '{'
(VarDecl
(Type 'Int')
(id 'result') '='
(Exp (int '1')) ';')
(Stmt 'while' '('
(Exp
(Exp (id 'n')) '>'
(Exp (int '1'))) ')'
(Stmt '{'
(Stmt
(id 'result') '='
(Exp
(Exp (id 'result')) '*'
(Exp (id 'n'))) ';')
(Stmt
(id 'n') '='
(Exp
(Exp (id 'n')) '-'
(Exp (int '1'))) ';') '}'))
(Stmt 'return'
(Exp (id 'result')) ';') '}'))))
The most efficient way to build a parser is to use a specific tool which purpose of existance is to build parsers. They used to be called compiler compilers, but nowadays the focus has shifted (broadened) to language workbenches which provide you with more aid to build your own language. For instance, almost any language workbench would provide you with IDE support and syntax highlighting for your language right off the bat, just by looking at a grammar. They also help immensely with debugging your grammar and your language (you didn’t expect left recursion to be the biggest of your problems, did you?).
Among the best currently supported and developing language workbenches one could name:
Rascal
Spoofax
MPS
MetaEdit+
Xtext
If you really so inclined, or if you consider writing a parser yourself just for amusement and experience, the best modern algorithms are SGLR, GLL and Packrat. Each one of those is a quintessence of algorithmic research that lasted half a century, so do not expect to understand them fully in a flash, and do not expect any good to come out of the first couple of “fixes” you’ll come up with. If you do come up with a nice improvement, though, do not hesitate to share your findings with the authors or publish it otherwise!
Thank you for all those advices but we finally decided to build our own recursive-descent parser by using exactly the same method as here: http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html
Indeed, we changed the grammar in order to remove the left-recursive rules and because the grammar I showed in my first message isn't LL(1), we used our token list (made by our scanner) to proceed a lookahead which go further. It looks that it works quite well.
Now we have the build an AST within those recursive functions. Would you have any suggestions? Tips? Thank you very much.
The most efficient parsers are LR-Parsers and LR-parsers are bit difficult to implement .You can go for recursive descent parsing technique as it is easier to implement in C.

Resources