Can this be parsed by a LALR(1) parser? - parsing

I am writing a parser in Bison for a language which has the following constructs, among others:
self-dispatch: [identifier arguments]
dispatch: [expression . identifier arguments]
string slicing: expression[expression,expression] - similar to Python.
arguments is a comma-separated list of expressions, which can be empty too. All of the above are expressions on their own, too.
My problem is that I am not sure how to parse both [method [other_method]] and [someString[idx1, idx2].toInt] or if it is possible to do this at all with an LALR(1) parser.
To be more precise, let's take the following example: [a[b]] (call method a with the result of method b). When it reaches the state [a . [b]] (the lookahead is the second [), it won't know whether to reduce a (which has already been reduced to identifier) to expression because something like a[b,c] might follow (which could itself be reduced to expression and continue with the second construct from above) or to keep it identifier (and shift it) because a list of arguments will follow (such as [b] in this case).
Is this shift/reduce conflict due to the way I expressed this grammar or is it not possible to parse all of these constructs with an LALR(1) parser?
And, a more general question, how can one prove that a language is/is not parsable by a particular type of parser?

Assuming your grammar is unambiguous (which the part you describe appears to be) then your best bet is to specify a %glr-parser. Since in most cases, the correct parse will be forced after only a few tokens, the overhead should not be noticeable, and the advantage is that you do not need to complicate either the grammar or the construction of the AST.
The one downside is that bison cannot verify that the grammar is unambiguous -- in general, this is not possible -- and it is not easy to prove. If it turns out that some input is ambiguous, the GLR parser will generate an error, so a good test suite is important.
Proving that the language is not LR(1) would be tricky, and I suspect that it would be impossible because the language probably is recognizable with an LALR(1) parser. (Impossible to tell without seeing the entire grammar, though.) But parsing (outside of CS theory) needs to create a correct parse tree in order to be useful, and the sort of modifications required to produce an LR grammar will also modify the AST, requiring a post-parse fixup. The difficultly in creating a correct AST spring from the difference in precedence between
a[b[c],d]
and
[a[b[c],d]]
In the first (subset) case, b binds to its argument list [c] and the comma has lower precedence; in the end, b[c] and d are sibling children of the slice. In the second case (method invocation), the comma is part of the argument list and binds more tightly than the method application; b, [c] and d are siblings in a method application. But you cannot decide the shape of the parse tree until an arbitrarily long input (since d could be any expression).
That's all a bit hand-wavey since "precedence" is not formally definable, and there are hacks which could make it possible to adjust the tree. Since the LR property is not really composable, it is really possible to provide a more rigorous analysis. But regardless, the GLR parser is likely to be the simplest and most robust solution.
One small point for future reference: CFGs are not just a programming tool; they also serve the purpose of clearly communicating the grammar in question. Nirmally, if you want to describe your language, you are better off using a clear CFG than trying to describe informally. Of course, meaningful non-terminal names will help, and a few examples never hurt, but the essence of the grammar is in the formal description and omitting that makes it harder for others to "be helpful".

Related

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

According to the ECMAScript spec:
There are several situations where the identification of lexical input
elements is sensitive to the syntactic grammar context that is
consuming the input elements. This requires multiple goal symbols for
the lexical grammar.
Two such symbols are InputElementDiv and InputElementRegExp.
In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.
I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.
Questions
Why not just use a single goal symbol like so:
InputElement ::
[...]
DivPunctuator
RegularExpressionLiteral
[...]
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).
Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).
However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.
As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.
In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.
In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.
If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.
Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.
Notes
The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
a /=4/gi;
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.
Why not just use a single goal symbol like so:
InputElement ::
...
DivPunctuator
RegularExpressionLiteral
...
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)
In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).
E.g., you could define the parameter Div, where:
+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".
So then your production would become
InputElement[Div] ::
...
[+Div] DivPunctuator
[~Div] RegularExpressionLiteral
...
But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.
What are some other languages that use multiple goal symbols in their lexical grammar?
I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.
Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.
How would we classify the ECMAScript lexical grammar?
It's basically context-free, plus some extensions.

How do I convert PEG parser into ambiguous one?

As far as I understand, most languages are context-free with some exceptions. For instance, a * b may stand for type * pointer_declaration or multiplication in C++. Which one takes place depends on the context, the meaning of the first identifier. Another example is name production in VHDL
enum_literal ::= char_literal | identifer
physical_literal ::= [num] unit_identifier
func_call ::= func_identifier [parenthized_args]
array_indexing ::= arr_name (index_expr)
name ::= func_call | physical_literal | enum_litral | array_indexing
You see that syntactic forms are different but they can match if optional parameters are omitted, like f, does it stand for func_call, physical_literal, like 1 meter with optional amount 1 is implied, or enum_literal.
Talking to Scala plugin designers, I was educated to know that you build AST to re-evaluate it when dependencies change. There is no need to re-parse the file if you have its AST. AST also worth to display the file contents. But, AST is invalidated if grammar is context-sensitive (suppose that f was a function, defined in another file, but later user requalified it into enum literal or undefined). AST changes in this case. AST changes on whenever you change the dependencies. Another option, that I am asking to evaluate and let me know how to make it, is to build an ambiguous AST.
As far as I know, parser combinators are of PEG kind. They hide the ambiguity by returning you the first matched production and f would match a function call because it is the first alternative in my grammar. I am asking for a combinator that instead of falling back on the first success, it proceeds to the next alternative. In the end, it would return me a list of all matching alternatives. It would return me an ambiguity.
I do not know how would you display the ambiguous file contents tree to the user but it would eliminate the need to re-parse the dependent files. I would also be happy to know how modern language design solve this problem.
Once ambiguous node is parsed and ambiguity of results is returned, I would like the parser to converge because I would like to proceed parsing beyond the name and I do not want to parse to the end of file after every ambiguity. The situation is complicated by situations like f(10), which can be a function call with a single argument or a nullary function call, which return an array, which is indexed afterwards. So, f(10) can match name two ways, either as func_call directly or recursively, as arr_indexing -> name ~ (expr). So, it won't be ambiguity like several parallel rules, like fcall | literal. Some branches may be longer than 1 parser before re-converging, like fcall ~ (expr) | fcall.
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First you claim that "most languages are context-free with some exceptions", this is not totally true. When designing a computer language, we mostly try to keep it as context-free as possible, since CFGs are the de-facto standard for that. It will ease a lot of work. This is not always feasible, though, and a lot[?] of languages depend on the semantic analysis phase to disambiguate any possible ambiguities.
Parser combinators do not use a formal model usually; PEGs, on the other hand, are a formalism for grammars, as are CFGs. On the last decade a few people have decided to use PEGs over CFGs due to two facts: PEGs are, by design, unambiguous, and they might always be parsed in linear time. A parser combinator library might use PEGs as underlying formalism, but might as well use CFGs or even none.
PEGs are attractive for designing computer languages because we usually do not want to handle ambiguities, which is something hard (or even impossible) to avoid when using CFGs. And, because of that, they might be parsed O(n) time by using dynamic programming (the so called packrat parser). It's not simple to "add ambiguities to them" for a few reasons, most importantly because the language they recognize depend on the fact that the options are deterministic, which is used for example when checking for lookahead. It isn't as simple as "just picking the first choice". For example, you could define a PEG:
S = "a" S "a" / "aa"
Which only parse sequences of N "a", where N is a power of 2. So it recognizes a sequence of 2, 4, 8, 16, 32, 64, etc, letter "a". By adding ambiguity, as a CFG would have, then you would recognize any even number of "a" (2, 4, 6, 8, 10, etc), which is a different language.
To answer your question,
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First I must say that this is probably not a good idea. If you wish to keep ambiguity on the AST, you probably should use a CFG parser instead.
One could, for example, make a parser for PEGs which is similar to a parser for boolean grammars, but then our asymptotic parsing time would grow from O(n) to O(n3) by keeping all alternatives alive while keeping the same language. And we actually lose both good things about PEGs at once.
Another way would be to keep a packrat parser in memory, and transverse its table to handle the semantics from the AST. Not really a good idea either, since this would imply a large memory footprint.
Ideally, one should build an AST which already has information regarding possible ambiguities by changing the grammar structure. While this requires manual work, and usually isn't simple, you wouldn't have to go back a phase to check the grammar again.

How can a lexer extract a token in ambiguous languages?

I wish to understand how does a parser work. I learnt about the LL, LR(0), LR(1) parts, how to build, NFA, DFA, parse tables, etc.
Now the problem is, i know that a lexer should extract tokens only on the parser demand in some situation, when it's not possible to extract all the tokens in one separated pass. I don't exactly understand this kind of situation, so i'm open to any explanation about this.
The question now is, how should a lexer does its job ? should it base its recognition on the current "contexts", the current non-terminals supposed to be parsed ? is it something totally different ?
What about the GLR parsing : is it another case where a lexer could try different terminals, or is it only a syntactic business ?
I would also want to understand what it's related to, for example is it related to the kind of parsing technique (LL, LR, etc) or only the grammar ?
Thanks a lot
The simple answer is that lexeme extraction has to be done in context. What one might consider be lexemes in the language may vary considerably in different parts of the language. For example, in COBOL, the data declaration section has 'PIC' strings and location-sensitive level numbers 01-99 that do not appear in the procedure section.
The lexer thus to somehow know what part of the language is being processed, to know what lexemes to collect. This is often handled by having lexing states which each process some subset of the entire language set of lexemes (often with considerable overlap in the subset; e.g., identifiers tend to be pretty similar in my experience). These states form a high level finite state machine, with transitions between them when phase changing lexemes are encountered, e.g., the keywords that indicate entry into the data declaration or procedure section of the COBOL program. Modern languages like Java and C# minimize the need for this but most other languages I've encountered really need this kind of help in the lexer.
So-called "scannerless" parsers (you are thinking "GLR") work by getting rid of the lexer entirely; now there's no need for the lexer to produce lexemes, and no need to track lexical states :-} Such parsers work by simply writing the grammar down the level of individual characters; typically you find grammar rules that are the exact equivalent of what you'd write for a lexeme description. The question is then, why doesn't such a parser get confused as to which "lexeme" to produce? This is where the GLR part is useful. GLR parsers are happy to process many possible interpretations of the input ("locally ambiguous parses") as long as the choice gets eventually resolved. So what really happens in the case of "ambiguous tokens" is the the grammar rules for both "tokens" produce nonterminals for their respectives "lexemes", and the GLR parser continues to parse until one of the parsing paths dies out or the parser terminates with an ambiguous parse.
My company builds lots of parsers for languages. We use GLR parsers because they are very nice for handling complex languages; write the context-free grammar and you have a parser. We use lexical-state based lexeme extractors with the usual regular-expression specification of lexemes and lexical-state-transitions triggered by certain lexemes. We could arguably build scannerless GLR parsers (by making our lexers produce single characters as tokens :) but we find the efficiency of the state-based lexers to be worth the extra trouble.
As practical extensions, our lexers actually use push-down-stack automata for the high level state machine rather than mere finite state machines. This helps when one has high level FSA whose substates are identical, and where it is helpful for the lexer to manage nested structures (e.g, match parentheses) to manage a mode switch (e.g., when the parentheses all been matched).
A unique feature of our lexers: we also do a little tiny bit of what scannerless parsers do: sometimes when a keyword is recognized, our lexers will inject both a keyword and an identifier into the parser (simulates a scannerless parser with a grammar rule for each). The parser will of course only accept what it wants "in context" and simply throw away the wrong alternative. This gives us an easy to handle "keywords in context otherwise interpreted as identifiers", which occurs in many, many languages.
Ideally, the tokens themselves should be unambiguous; you should always be able to tokenise an input stream without the parser doing any additional work.
This isn't always so simple, so you have some tools to help you out:
Start conditions
A lexer action can change the scanner's start condition, meaning it can activate different sets of rules.
A typical example of this is string literal lexing; when you parse a string literal, the rules for tokenising usually become completely different to the language containing them. This is an example of an exclusive start condition.
You can separate ambiguous lexings if you can identify two separate start conditions for them and ensure the lexer enters them appropriately, given some preceding context.
Lexical tie-ins
This is a fancy name for carrying state in the lexer, and modifying it in the parser. If a certain action in your parser gets executed, it modifies some state in the lexer, which results in lexer actions returning different tokens. This should be avoided when necessary, because it makes your lexer and parser both more difficult to reason about, and makes some things (like GLR parsers) impossible.
The upside is that you can do things that would require significant grammar changes with relatively minor impact on the code; you can use information from the parse to influence the behaviour of the lexer, which in turn can come some way to solving your problem of what you see as an "ambiguous" grammar.
Logic, reasoning
It's probable that it is possible to lex it in one parse, and the above tools should come second to thinking about how you should be tokenising the input and trying to convert that into the language of lexical analysis. :)
The fact is, your input is comprised of tokens—whether you like it or not!—and all you need to do is find a way to make a program understand the rules you already know.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.
LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.
My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

Resources