Why bison does not convert grammar automatically? - parsing

I am learning lexer and parser, so I am reading this classical book : flex & bison (By John Levine, Publisher: O'Reilly Media).
An example is given that could not be parsed by bison :
phrase : cart_animal AND CART | work_animal AND PLOW
cart_animal-> HORSE | GOAT
work_animal -> HORSE | OX
I understand very well why it could not. Indeed, it requires TWO symbols of lookahead.
But, with a simple modification, it could be parsed :
phrase : cart_animal CART | work_animal PLOW
cart_animal-> HORSE AND | GOAT AND
work_animal -> HORSE AND | OX AND
I wonder why bison is not able to translate automatically grammar in simple cases like that ?

Because simple cases like that are pretty well all artificial, and in the case of real-life examples, it is difficult or impossible.
To be clear, if you have an LR(k) grammar with k>1 and you know the value of k, there is a mechanical transformation with which you can make an equivalent LR(1) grammar, and moreover you can, with some juggling, fix the reduction actions so that they have the same effect (at least, as long as they don't contain side effects). I don't know any parser generator which does that, in part because correctly translating the reduction actions will be tricky, and in part because the resulting LR(1) grammar is typically quite large, even for small values of k.
But, as I mentioned above, you need to know the value of k to perform this transformation, and it turns out that there is no algorithm which can take a grammar and tell you whether it is LR(k). So all you could do is try successively larger values of k until you find one which works, or you decide to give up.

Related

How do I convert PEG parser into ambiguous one?

As far as I understand, most languages are context-free with some exceptions. For instance, a * b may stand for type * pointer_declaration or multiplication in C++. Which one takes place depends on the context, the meaning of the first identifier. Another example is name production in VHDL
enum_literal ::= char_literal | identifer
physical_literal ::= [num] unit_identifier
func_call ::= func_identifier [parenthized_args]
array_indexing ::= arr_name (index_expr)
name ::= func_call | physical_literal | enum_litral | array_indexing
You see that syntactic forms are different but they can match if optional parameters are omitted, like f, does it stand for func_call, physical_literal, like 1 meter with optional amount 1 is implied, or enum_literal.
Talking to Scala plugin designers, I was educated to know that you build AST to re-evaluate it when dependencies change. There is no need to re-parse the file if you have its AST. AST also worth to display the file contents. But, AST is invalidated if grammar is context-sensitive (suppose that f was a function, defined in another file, but later user requalified it into enum literal or undefined). AST changes in this case. AST changes on whenever you change the dependencies. Another option, that I am asking to evaluate and let me know how to make it, is to build an ambiguous AST.
As far as I know, parser combinators are of PEG kind. They hide the ambiguity by returning you the first matched production and f would match a function call because it is the first alternative in my grammar. I am asking for a combinator that instead of falling back on the first success, it proceeds to the next alternative. In the end, it would return me a list of all matching alternatives. It would return me an ambiguity.
I do not know how would you display the ambiguous file contents tree to the user but it would eliminate the need to re-parse the dependent files. I would also be happy to know how modern language design solve this problem.
Once ambiguous node is parsed and ambiguity of results is returned, I would like the parser to converge because I would like to proceed parsing beyond the name and I do not want to parse to the end of file after every ambiguity. The situation is complicated by situations like f(10), which can be a function call with a single argument or a nullary function call, which return an array, which is indexed afterwards. So, f(10) can match name two ways, either as func_call directly or recursively, as arr_indexing -> name ~ (expr). So, it won't be ambiguity like several parallel rules, like fcall | literal. Some branches may be longer than 1 parser before re-converging, like fcall ~ (expr) | fcall.
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First you claim that "most languages are context-free with some exceptions", this is not totally true. When designing a computer language, we mostly try to keep it as context-free as possible, since CFGs are the de-facto standard for that. It will ease a lot of work. This is not always feasible, though, and a lot[?] of languages depend on the semantic analysis phase to disambiguate any possible ambiguities.
Parser combinators do not use a formal model usually; PEGs, on the other hand, are a formalism for grammars, as are CFGs. On the last decade a few people have decided to use PEGs over CFGs due to two facts: PEGs are, by design, unambiguous, and they might always be parsed in linear time. A parser combinator library might use PEGs as underlying formalism, but might as well use CFGs or even none.
PEGs are attractive for designing computer languages because we usually do not want to handle ambiguities, which is something hard (or even impossible) to avoid when using CFGs. And, because of that, they might be parsed O(n) time by using dynamic programming (the so called packrat parser). It's not simple to "add ambiguities to them" for a few reasons, most importantly because the language they recognize depend on the fact that the options are deterministic, which is used for example when checking for lookahead. It isn't as simple as "just picking the first choice". For example, you could define a PEG:
S = "a" S "a" / "aa"
Which only parse sequences of N "a", where N is a power of 2. So it recognizes a sequence of 2, 4, 8, 16, 32, 64, etc, letter "a". By adding ambiguity, as a CFG would have, then you would recognize any even number of "a" (2, 4, 6, 8, 10, etc), which is a different language.
To answer your question,
How would you go about solving it? Is it possible to add ambiguating combinator to PEG?
First I must say that this is probably not a good idea. If you wish to keep ambiguity on the AST, you probably should use a CFG parser instead.
One could, for example, make a parser for PEGs which is similar to a parser for boolean grammars, but then our asymptotic parsing time would grow from O(n) to O(n3) by keeping all alternatives alive while keeping the same language. And we actually lose both good things about PEGs at once.
Another way would be to keep a packrat parser in memory, and transverse its table to handle the semantics from the AST. Not really a good idea either, since this would imply a large memory footprint.
Ideally, one should build an AST which already has information regarding possible ambiguities by changing the grammar structure. While this requires manual work, and usually isn't simple, you wouldn't have to go back a phase to check the grammar again.

LL parser grammar

I have this grammar below and trying to figure out if can be parsed using the LL parser? If not, please explain.
S --> ab | cB
A --> b | Bb
B --> aAb | cC
C --> cA | Aba
From what I understand the intersection of the two sets must be empty to pass the pairwise disjointness test.
But I am not sure where to begin and have been looking through my textbook and http://en.wikipedia.org/wiki/LL_parser#Parsing_procedure but can't quite understand or find any examples to follow along. I just need to see the procedure or steps to understand how to do other similar problems to this. Any help is appreciated.
Compute FIRST sets for all the non-terminals, and check to see if FIRST sets for the alternatives of a given non-terminal are all disjoint. If all are, it is LL, and if there are any non-terminals for which they are not it is not. If there are any ε rules, you'll need FOLLOW sets as well.
Computing FIRST1 sets is quite easy and will tell you if the grammar is LL(1). Computing FIRSTk sets is quite a bit more involved, but will tell you if the grammar is LL(k) for any specific k you calculate the FIRSTk sets for.

Parser library that can handle ambiguity

I'm looking for a mature parser library, either for Scala or Haskell.
The most important point is, that the library can handle ambiguity.
If an expression is ambiguous, I want every possible abstract syntax tree, that matches the expression.
Simple example: The expression a ⊗ b ⊗ c can be seen as (a ⊗ b) ⊗ c or a ⊗ (b ⊗ c), and I need both variants.
Thanks!
I feel like the old guy for remembering when Walder's papers like Comprehending Monads (the precursor to the do notation) were exciting and new. The idea is that you (to quote) replace a failure by a list of successes, meaning maintain a list of all the possible parses. At the end you normally just take the first match, but with this setup, you can take all of them.
These aren't all that efficient for a deterministic parser, which is why they're less in fashion, but they are what you need.
Have a look at polyparse, and in particular Text.ParserCombinators.HuttonMeijer and Text.ParserCombinators.HuttonMeijerWallace.
(Hutton & Meijer translated the parser library to Haskell (from Gofer) and Wallace added extra features.)
Make sure you check it out on simple cases like parsing "aaaa" with
testP = do
a <- many $ char 'a'
b <- many $ char 'a'
return (a,b)
to see if it has the semantics you seek.
You asked for mature. These libraries are part of pure functional programming's heritage! Having said that, I'd call parsec more mature, even though it's younger.
(Speculation: I don't think parsec can do what you want. Its standard choice combinator is deterministic. I haven't looked into tweaking or replacing that behaviour, and I wouldn't want to I'm afraid.)
This question immediately reminded me of the Yacc is dead / No, it's not debate from the end of 2010. The authors of the Yacc is dead paper provide a library in Scala (unmaintained), Haskell and Racket. In the Yacc is alive response, Russ Cox points out that the code runs in exponential time for ambiguous grammars.
It's well-known that it is possible to parse ambiguous grammars in O(n^3), although obviously it can take exponential time to enumerate all the parse trees in the case that there are exponentially many of them -- and there will be in the case of x1 + x2 + x3 ... + xn. bison implements the GLR algorithm which does so; unfortunately, while bison is certainly mature (if not actually moribund), it is written neither in Haskell nor in Scala.
Daniel Spiewak implemented a GLL parser in Scala IIRC, but last time I looked at it, it suffered from some performance issues. So I'm not sure that it could be described as mature, either.
I can't speak to how mature it is or give you any usage examples, but I've had the scala gll-combinators library open in a tab for a few days. It handles ambiguous grammars and looks pretty nifty.
At the end the the choice fell on the Syntax Definition Formalism (SDF2)
with an sdf table generator here
and JSGLR as parser generator.

What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree?

I've been reading a bit about how interpreters/compilers work, and one area where I'm getting confused is the difference between an AST and a CST. My understanding is that the parser makes a CST, hands it to the semantic analyzer which turns it into an AST. However, my understanding is that the semantic analyzer simply ensures that rules are followed. I don't really understand why it would actually make any changes to make it abstract rather than concrete.
Is there something that I'm missing about the semantic analyzer, or is the difference between an AST and CST somewhat artificial?
A concrete syntax tree represents the source text exactly in parsed form. In general, it conforms to the context-free grammar defining the source language.
However, the concrete grammar and tree have a lot of things that are necessary to make source text unambiguously parseable, but do not contribute to actual meaning. For example, to implement operator precedence, your CFG usually has several levels of expression components (term, factor, etc.), with the operators connecting them at the different levels (you add terms to get expressions, terms are composed of factors optionally multipled, etc.). To actually interpret or compile the language, however, you don't need this; you just need Expression nodes that have operators and operands. The abstract syntax tree is the result of simplifying the concrete syntax tree down to the things actually needed to represent the meaning of the program. This tree has a much simpler definition and is thus easier to process in the later stages of execution.
You usually don't need to actually build a concrete syntax tree. The action routines in your YACC (or Antlr, or Menhir, or whatever...) grammar can directly build the abstract syntax tree, so the concrete syntax tree only exists as a conceptual entity representing the parse structure of your source text.
A concrete syntax tree matches what the grammar rules say is the syntax. The purpose of the abstract syntax tree is have a "simple" representation of what's essential in "the syntax tree".
A real value in the AST IMHO is that it is smaller than the CST, and therefore takes less time to process. (You might say, who cares? But I work with a tool where we have
tens of millions of nodes live at once!).
Most parser generators that have any support for building syntax trees insist that you personally specify exactly how they get built under the assumption that your tree nodes will be "simpler" than the CST (and in that, they are generally right, as programmers are pretty lazy). Arguably it means you have to code fewer tree visitor functions, and that's valuable, too, in that it minimizes engineering energy. When you have 3500 rules (e.g., for COBOL) this matters. And this "simpler"ness leads to the good property of "smallness".
But having such ASTs creates a problem that wasn't there: it doesn't match the grammar, and now you have to mentally track both of them. And when there are 1500 AST nodes for a 3500 rule grammar, this matters a lot. And if the grammar evolves (they always do!), now you have two giant sets of things to keep in synch.
Another solution is to let the parser simply build CST nodes for you and just use those. This is a huge advantage when building the grammars: there's no need to invent 1500 special AST nodes to model 3500 grammar rules. Just think about the tree being isomorphic to the grammar. From the point of view of the grammar engineer this is completely brainless, which lets him focus on getting the grammar right and hacking at it to his heart's content. Arguably you have to write more node visitor rules, but that can be managed. More on this later.
What we do with the DMS Software Reengineering Toolkit is to automatically build a CST based on the results of a (GLR) parsing process. DMS then automatically constructs an "compressed" CST for space efficiency reasons, by eliminating non-value carrying terminals (keywords, punctation), semantically useless unary productions, and forming directly-indexable lists for grammar rule pairs that are list like:
L = e ;
L = L e ;
L2 = e2 ;
L2 = L2 ',' e2 ;
and a wide variety of variations of such forms. You think in terms of the grammar rules and the virtual CST; the tool operates on the compressed representation. Easy on your brain, faster/smaller at runtime.
Remarkably, the compressed CST built this way looks a lot an AST that you might have designed by hand (see link at end to examples). In particular, the compressed CST doesn't carry any nodes that are just concrete syntax.
There are minor bits of awkwardness: for example while the concrete nodes for '(' and ')' classically found in expression subgrammars are not in the tree, a "parentheses node" does appear in the compressed CST and has to be handled. A true AST would not have this. This seems like a pretty small price to pay for the convenience of not have to specify the AST construction, ever. And the documentation for the tree is always available and correct: the grammar is the documentation.
How do we avoid "extra visitors"? We don't entirely, but DMS provides an AST library that walks the AST and handles the differences between the CST and the AST transparently. DMS also offers an "attribute grammar" evaluator (AGE), which is a method for passing values computed at nodes up and down the tree; the AGE handles all the tree representation issues and so the tool engineer only worries about writing computations effectively directly on the grammar rules themselves. Finally, DMS also provides "surface-syntax" patterns, which allows code fragments from the grammar to used to find specific types of subtrees, without knowing most of the node types involved.
One of the other answers observes that if you want to build tools that can regenerate source, your AST will have to match the CST. That's not really right, but it is far easier to regenerate the source if you have CST nodes. DMS generates most of the prettyprinter automatically because it has access to both :-}
Bottom line: ASTs are good for small, both phyiscal and conceptual. Automated AST construction from the CST provides both, and lets you avoid the problem of tracking two different sets.
EDIT March 2015: Link to examples of CST vs. "AST" built this way
This is based on the Expression Evaluator grammar by Terrence Parr.
The grammar for this example:
grammar Expr002;
options
{
output=AST;
ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}
prog : ( stat )+ ;
stat : expr NEWLINE -> expr
| ID '=' expr NEWLINE -> ^('=' ID expr)
| NEWLINE ->
;
expr : multExpr (( '+'^ | '-'^ ) multExpr)*
;
multExpr
: atom ('*'^ atom)*
;
atom : INT
| ID
| '('! expr ')'!
;
ID : ('a'..'z' | 'A'..'Z' )+ ;
INT : '0'..'9'+ ;
NEWLINE : '\r'? '\n' ;
WS : ( ' ' | '\t' )+ { skip(); } ;
Input
x=1
y=2
3*(x+y)
Parse Tree
The parse tree is a concrete representation of the input. The parse tree retains all of the information of the input. The empty boxes represent whitespace, i.e. end of line.
AST
The AST is an abstract representation of the input. Notice that parens are not present in the AST because the associations are derivable from the tree structure.
EDIT
For a more through explanation see Compilers and Compiler Generators pg. 23
This blog post may be helpful.
It seems to me that the AST "throws away" a lot of intermediate grammatical/structural information that wouldn't contribute to semantics. For example, you don't care that 3 is an atom is a term is a factor is a.... You just care that it's 3 when you're implementing the exponentiation expression or whatever.
The concrete syntax tree follows the rules of the grammar of the language. In the grammar, "expression lists" are typically defined with two rules
expression_list can be: expression
expression_list can be: expression, expression_list
Followed literally, these two rules gives a comb shape to any expression list that appears in the program.
The abstract syntax tree is in the form that's convenient for further manipulation. It represents things in a way that makes sense for someone that understand the meaning of programs, and not just the way they are written. The expression list above, which may be the list of arguments of a function, may conveniently be represented as a vector of expressions, since it's better for static analysis to have the total number of expression explicitly available and be able to access each expression by its index.
Simply, AST only contains semantics of the code, Parse tree/CST also includes information on how exactly code was written.
The concrete syntax tree contains all information like superfluous parenthesis and whitespace and comments, the abstract syntax tree abstracts away from this information.
NB: funny enough, when you implement a refactoring engine your AST will again contain all the concrete information, but you'll keep referring to it as an AST because that has become the standard term in the field (so one could say it has long ago lost its original meaning).
CST(Concrete Syntax Tree) is a tree representation of the Grammar(Rules of how the program should be written).
Depending on compiler architecture, it can be used by the Parser to produce an AST.
AST(Abstract Syntax Tree) is a tree representation of Parsed source, produced by the Parser part of the compiler. It stores information about tokens+grammar.
Depending on architecture of your compiler, The CST can be used to produce an AST. It is fair to say that CST evolves into AST. Or, AST is a richer CST.
More explanations can be found on this link: http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees#id6
It is a difference which doesn't make a difference.
An AST is usually explained as a way to approximate the semantics of a programming language expression by throwing away lexical content. For example in a context free grammar you might write the following EBNF rule
term: atom (('*' | '/') term )*
whereas in the AST case you only use mul_rule and div_rule which expresses the proper arithmetic operations.
Can't those rules be introduced in the grammar in the first place? Of course. You can rewrite the above compact and abstract rule by breaking it into a more concrete rules used to mimic the mentioned AST nodes:
term: mul_rule | div_rule
mul_rule: atom ('*' term)*
div_rule: atom ('/' term)*
Now, when you think in terms of top-down parsing then the second term introduces a FIRST/FIRST conflict between mul_rule and div_rule something an LL(1) parser cannot deal with. The first rule form was the left factored version of the second one which effectively eliminated structure. You have to pay some prize for using LL(1) here.
So ASTs are an ad hoc supplement used to fix the deficiencies of grammars and parsers. The CST -> AST transformation is a refactoring move. No one ever bothered when an additional comma or colon is stored in the syntax tree. On the contrary some authors retrofit them into ASTs because they like to use ASTs for doing refactorings instead of maintaining various trees at the same time or write an additional inference engine. Programmers are lazy for good reasons. Actually they store even line and column information gathered by lexical analysis in ASTs for error reporting. Very abstract indeed.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Resources