OK, so here's a question: Given that Haskell allows you to define new operators with arbitrary operator precedence... how is it possible to actually parse Haskell source code?
You cannot know what operator precedences are set until you parse the source. But you cannot parse the source until you know the correct operator precedences. So... um, how?
Consider, for example, the expression
x *** y +++ z
Until we finish parsing the module, we don't know what other modules are imported, and hence what operators (and other identifiers) might be in scope. We certainly don't know their precedences yet. But the parser has to return something... But should it return
(x *** y) +++ z
Or should it return
x *** (y +++ z)
The poor parser has no way to know. This can only be determined once you hunt down the import that brings (+++) and (***) into scope, load that file off disk, and discover what the operator precedences are. Clearly the parser itself isn't going to do all that I/O; a parser just turns a stream of characters into an AST.
Clearly somebody somewhere has figured out how to do this. But I can't work it out... Any hints?
Quoting the page on GHC trac for the parser:
Infix operators are parsed as if they were all left-associative. The
renamer uses the fixity declarations to re-associate the syntax tree.
András Kovács's answer tells what's really done in GHC, but there's some history to this.
There was actually a somewhat hypothetical change from the Haskell 98 to the Haskell 2010 standard. In the former's BNF grammar, operator fixity and parsing were intertwined in such a way that you could in theory have some very strange interactions between the rules for fixity and the rules for when expressions and indentation blocks end. (For the latter two, the rules are essentially, "keep on going until you have to stop".)
In particular you could redefine a local operator and its fixity such that a use of it belonged in the redefining inner where block exactly ... when it didn't. So you got a parser paradox. I cannot find any of the old examples but this may be one:
let (+) = (Prelude.+)
infix 9 + -- make the inner + high precedence and non-associative
in 2 + 3 + 4
-- ^ this + cannot parse here as the inner operator, which means
-- the let ... in ... expression should end automatically first,
-- but then it's the standard +, and its fixity says it should parse
-- as part of the inner expression...
In Haskell 2010 they officially changed that so that operator fixities are determined in a separate stage after the parsing proper.
So why was this a hypothetical change? Because all the compiler writers already did it the Haskell 2010 way, and always had, for their own sanity.
Summarising the comments so far, it seems the possibilities are thus:
Return a parse tree where any infix operators are left as some kind of "list" structure, and then rearrange once precedences become known.
Pretend you know the operator precedences, and then rearrange the parse tree after the fact.
Do a first parse that only reads imports and fixity declarations, load the imports, and then do a full parse with known precedences.
Related
I try a bit the parser generators with Haskell, using Happy here. I used to use parser combinators before, such as Parsec, and one thing I can't achieve now with that is the dynamic addition (during execution) of new externally defined operators. For example, Haskell has some basic operators, but we can add more, giving them precedence and fixity. So I would like to know how to reproduce this with Happy, following the Haskell design (view example code bellow to be parsed), if it is not trivially feasible, or if it should perhaps be done through the parser combinators.
-- Adding the new operator
infixl 5 ++
(++) :: [a] -> [a] -> [a]
[] ++ ys = ys
(x:xs) ++ ys = x : xs ++ ys
-- Using the new operator taking into consideration fixity and precedence during parsing
example = "Hello, " ++ "world!"
Haskell only allows a few precedence levels. So you don't strictly need a dynamic grammar; you could just write out the grammar using precedence-level token classes instead of individual operators, leaving the lexer with the problem of associating a given symbol with a given precedence level.
In effect, that moves the dynamic addition of operators to the lexer. That's a slightly uncomfortable design decision, although in some cases it may not be too difficult to implement. It's uncomfortable design because it requires semantic feedback to the lexer; at a minimum, the lexer needs to consult the symbol table to figure out what type of token it is looking at. In the case of Haskell, at least, this is made more uncomfortable by the fact that fixity declarations are scoped, so in order to track fixity information, the lexer would also need to understand scoping rules.
In practice, most languages which allow program text to define operators and operator precedence work in precisely the same way the Haskell compiler does: expressions are parsed by the grammar into a simple list of items (where parenthesized subexpressions count as a single item), and in a later semantic analysis the list is rearranged into an actual tree taking into account precedence and associativity rules, using a simple version of the shunting yard algorithm. (It's a simple version because it doesn't need to deal with parenthesized subconstructs.)
There are several reasons for this design decision:
As mentioned above, for the lexer to figure out what the precedence of a symbol is (or even if the symbol is an operator with precedence) requires a close collaboration between the lexer and the parser, which many would say violates separation of concerns. Worse, it makes it difficult or impossible to use parsing technologies without a small fixed lookahead, such as GLR parsers.
Many languages have more precedence levels than Haskell. In some cases, even the number of precedence levels is not defined by the grammar. In Swift, for example, you can declare your own precedence levels, and you define a level not with a number but with a comparison to another previously defined level, leading to a partial order between precedence levels.
IMHO, that's actually a better design decision than Haskell, in part because it avoids the ambiguity of a precedence level having both left- and right-associative operators, but more importantly because the relative precedence declarations both avoid magic numbers and allow the parser to flag the ambiguous use of operators from different modules. In other words, it does not force a precedence declaration to mechanically apply to any pair of totally unrelated operators; in this sense it makes operator declarations easier to compose.
The grammar is much simpler, and arguably easier to understand since most people anyway rely on precedence tables rather than analysing grammar productions to figure out how operators interact with each other. In that sense, having precedence set by the grammar is more a distraction than documentation. See the C++ grammar as a good example of why precedence tables are easier to read than grammars.
On the other hand, as the C++ grammar also illustrates, a grammar is a lot more general than simple precedence declarations because it can express asymmetric precedences. (The grammar doesn't always express these gracefully, but they can be expressed.) A classic example of an asymmetric precedence is a lambda construct (λ ID expr) which binds very loosely to the right and very tightly to the left: the expected parse of a ∘ λ b b ∘ a does not ever consult the associativity of ∘ because the λ comes between them.
In practice, there is very little cost to building the tree later. The algorithm to build the tree is well-known, simple and cheap.
I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.
I am writing an interpreter for a language where functions can be used as operators. However, the functions content will only be known at runtime.
For that I considered two solutions:
Parsing is done at runtime, using the runtime information on the function
All user-defined operators use default values for precedence and associativity.
I chose the latter as I see a number of advantages in parsing separately to execution.
Now it comes to implementation and I am interested to see what options there are. My initial thoughts are a shift reduce parser, but I have little experience in constructing parsers.
Example:
LHS op RHS : LHS * RHS /* define a binary operator 'op' */
var : 3 /* define a variable */
print 5 op var /* should print 15 */
LHS op RHS : LHS / RHS /* Re-define op */
print var op var /* Should print 1 */
in the last case, the parser will get from the lexer: " id id id id ". Only at runtime do I know that the 'op' id is an operator.
(Posting the results of the comments, as requested.)
Solution #1 is definitely ugly, complex to implement, and unneeded, I agree. Solution #2 is by far easier to implement and comprehend. You can also allow custom associativity and precedence for operators, as long as those are known statically. The main thing is that these facts are known at parse time.
As for actual parsing, most parsers will work just fine, as any two expression surrounding an id are an application of a custom infix operator (this is less true if you allow custom precedence and associativity, in that case you need an algorithm that allows determine those on a per-operator basis at parse time). Either case, my personal favorite is a "Top Down Operator Precedence Parser", or Pratt parser. I found the following resources (ordered by usefulness to me, YMMV) describe it quite well:
Simple Top-Down Parsing in Python
Pratt Parsers: Expression Parsing Made Easy
Top Down Operator Precedence
Two properties of the algorithm make it suit this problem very well:
The lookup of associativity ("binding power") happens dynamically for each token (allowing the parser to allow the user to define precedence for their operators).
It's very simple to write by hand[*], and you'll probably have to do that as such an degree of dynamism is beyond the scope of most (at least all I know) parser generators.
[*] I've personally written a parser for a very large (lacking only case, multidimensional arrays and perhaps some obscure subtleties) subset of Pascal in 500 lines of Python and 2-3 days of work, the rest is only missing because other parts of the software it's used in were more interesting at the time and I didn't have a reason to implement the rest.
I've been reading a bit about how interpreters/compilers work, and one area where I'm getting confused is the difference between an AST and a CST. My understanding is that the parser makes a CST, hands it to the semantic analyzer which turns it into an AST. However, my understanding is that the semantic analyzer simply ensures that rules are followed. I don't really understand why it would actually make any changes to make it abstract rather than concrete.
Is there something that I'm missing about the semantic analyzer, or is the difference between an AST and CST somewhat artificial?
A concrete syntax tree represents the source text exactly in parsed form. In general, it conforms to the context-free grammar defining the source language.
However, the concrete grammar and tree have a lot of things that are necessary to make source text unambiguously parseable, but do not contribute to actual meaning. For example, to implement operator precedence, your CFG usually has several levels of expression components (term, factor, etc.), with the operators connecting them at the different levels (you add terms to get expressions, terms are composed of factors optionally multipled, etc.). To actually interpret or compile the language, however, you don't need this; you just need Expression nodes that have operators and operands. The abstract syntax tree is the result of simplifying the concrete syntax tree down to the things actually needed to represent the meaning of the program. This tree has a much simpler definition and is thus easier to process in the later stages of execution.
You usually don't need to actually build a concrete syntax tree. The action routines in your YACC (or Antlr, or Menhir, or whatever...) grammar can directly build the abstract syntax tree, so the concrete syntax tree only exists as a conceptual entity representing the parse structure of your source text.
A concrete syntax tree matches what the grammar rules say is the syntax. The purpose of the abstract syntax tree is have a "simple" representation of what's essential in "the syntax tree".
A real value in the AST IMHO is that it is smaller than the CST, and therefore takes less time to process. (You might say, who cares? But I work with a tool where we have
tens of millions of nodes live at once!).
Most parser generators that have any support for building syntax trees insist that you personally specify exactly how they get built under the assumption that your tree nodes will be "simpler" than the CST (and in that, they are generally right, as programmers are pretty lazy). Arguably it means you have to code fewer tree visitor functions, and that's valuable, too, in that it minimizes engineering energy. When you have 3500 rules (e.g., for COBOL) this matters. And this "simpler"ness leads to the good property of "smallness".
But having such ASTs creates a problem that wasn't there: it doesn't match the grammar, and now you have to mentally track both of them. And when there are 1500 AST nodes for a 3500 rule grammar, this matters a lot. And if the grammar evolves (they always do!), now you have two giant sets of things to keep in synch.
Another solution is to let the parser simply build CST nodes for you and just use those. This is a huge advantage when building the grammars: there's no need to invent 1500 special AST nodes to model 3500 grammar rules. Just think about the tree being isomorphic to the grammar. From the point of view of the grammar engineer this is completely brainless, which lets him focus on getting the grammar right and hacking at it to his heart's content. Arguably you have to write more node visitor rules, but that can be managed. More on this later.
What we do with the DMS Software Reengineering Toolkit is to automatically build a CST based on the results of a (GLR) parsing process. DMS then automatically constructs an "compressed" CST for space efficiency reasons, by eliminating non-value carrying terminals (keywords, punctation), semantically useless unary productions, and forming directly-indexable lists for grammar rule pairs that are list like:
L = e ;
L = L e ;
L2 = e2 ;
L2 = L2 ',' e2 ;
and a wide variety of variations of such forms. You think in terms of the grammar rules and the virtual CST; the tool operates on the compressed representation. Easy on your brain, faster/smaller at runtime.
Remarkably, the compressed CST built this way looks a lot an AST that you might have designed by hand (see link at end to examples). In particular, the compressed CST doesn't carry any nodes that are just concrete syntax.
There are minor bits of awkwardness: for example while the concrete nodes for '(' and ')' classically found in expression subgrammars are not in the tree, a "parentheses node" does appear in the compressed CST and has to be handled. A true AST would not have this. This seems like a pretty small price to pay for the convenience of not have to specify the AST construction, ever. And the documentation for the tree is always available and correct: the grammar is the documentation.
How do we avoid "extra visitors"? We don't entirely, but DMS provides an AST library that walks the AST and handles the differences between the CST and the AST transparently. DMS also offers an "attribute grammar" evaluator (AGE), which is a method for passing values computed at nodes up and down the tree; the AGE handles all the tree representation issues and so the tool engineer only worries about writing computations effectively directly on the grammar rules themselves. Finally, DMS also provides "surface-syntax" patterns, which allows code fragments from the grammar to used to find specific types of subtrees, without knowing most of the node types involved.
One of the other answers observes that if you want to build tools that can regenerate source, your AST will have to match the CST. That's not really right, but it is far easier to regenerate the source if you have CST nodes. DMS generates most of the prettyprinter automatically because it has access to both :-}
Bottom line: ASTs are good for small, both phyiscal and conceptual. Automated AST construction from the CST provides both, and lets you avoid the problem of tracking two different sets.
EDIT March 2015: Link to examples of CST vs. "AST" built this way
This is based on the Expression Evaluator grammar by Terrence Parr.
The grammar for this example:
grammar Expr002;
options
{
output=AST;
ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}
prog : ( stat )+ ;
stat : expr NEWLINE -> expr
| ID '=' expr NEWLINE -> ^('=' ID expr)
| NEWLINE ->
;
expr : multExpr (( '+'^ | '-'^ ) multExpr)*
;
multExpr
: atom ('*'^ atom)*
;
atom : INT
| ID
| '('! expr ')'!
;
ID : ('a'..'z' | 'A'..'Z' )+ ;
INT : '0'..'9'+ ;
NEWLINE : '\r'? '\n' ;
WS : ( ' ' | '\t' )+ { skip(); } ;
Input
x=1
y=2
3*(x+y)
Parse Tree
The parse tree is a concrete representation of the input. The parse tree retains all of the information of the input. The empty boxes represent whitespace, i.e. end of line.
AST
The AST is an abstract representation of the input. Notice that parens are not present in the AST because the associations are derivable from the tree structure.
EDIT
For a more through explanation see Compilers and Compiler Generators pg. 23
This blog post may be helpful.
It seems to me that the AST "throws away" a lot of intermediate grammatical/structural information that wouldn't contribute to semantics. For example, you don't care that 3 is an atom is a term is a factor is a.... You just care that it's 3 when you're implementing the exponentiation expression or whatever.
The concrete syntax tree follows the rules of the grammar of the language. In the grammar, "expression lists" are typically defined with two rules
expression_list can be: expression
expression_list can be: expression, expression_list
Followed literally, these two rules gives a comb shape to any expression list that appears in the program.
The abstract syntax tree is in the form that's convenient for further manipulation. It represents things in a way that makes sense for someone that understand the meaning of programs, and not just the way they are written. The expression list above, which may be the list of arguments of a function, may conveniently be represented as a vector of expressions, since it's better for static analysis to have the total number of expression explicitly available and be able to access each expression by its index.
Simply, AST only contains semantics of the code, Parse tree/CST also includes information on how exactly code was written.
The concrete syntax tree contains all information like superfluous parenthesis and whitespace and comments, the abstract syntax tree abstracts away from this information.
NB: funny enough, when you implement a refactoring engine your AST will again contain all the concrete information, but you'll keep referring to it as an AST because that has become the standard term in the field (so one could say it has long ago lost its original meaning).
CST(Concrete Syntax Tree) is a tree representation of the Grammar(Rules of how the program should be written).
Depending on compiler architecture, it can be used by the Parser to produce an AST.
AST(Abstract Syntax Tree) is a tree representation of Parsed source, produced by the Parser part of the compiler. It stores information about tokens+grammar.
Depending on architecture of your compiler, The CST can be used to produce an AST. It is fair to say that CST evolves into AST. Or, AST is a richer CST.
More explanations can be found on this link: http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees#id6
It is a difference which doesn't make a difference.
An AST is usually explained as a way to approximate the semantics of a programming language expression by throwing away lexical content. For example in a context free grammar you might write the following EBNF rule
term: atom (('*' | '/') term )*
whereas in the AST case you only use mul_rule and div_rule which expresses the proper arithmetic operations.
Can't those rules be introduced in the grammar in the first place? Of course. You can rewrite the above compact and abstract rule by breaking it into a more concrete rules used to mimic the mentioned AST nodes:
term: mul_rule | div_rule
mul_rule: atom ('*' term)*
div_rule: atom ('/' term)*
Now, when you think in terms of top-down parsing then the second term introduces a FIRST/FIRST conflict between mul_rule and div_rule something an LL(1) parser cannot deal with. The first rule form was the left factored version of the second one which effectively eliminated structure. You have to pay some prize for using LL(1) here.
So ASTs are an ad hoc supplement used to fix the deficiencies of grammars and parsers. The CST -> AST transformation is a refactoring move. No one ever bothered when an additional comma or colon is stored in the syntax tree. On the contrary some authors retrofit them into ASTs because they like to use ASTs for doing refactorings instead of maintaining various trees at the same time or write an additional inference engine. Programmers are lazy for good reasons. Actually they store even line and column information gathered by lexical analysis in ASTs for error reporting. Very abstract indeed.
I've been mulling over creating a language that would be extremely well suited to creation of DSLs, by allowing definitions of functions that are infix, postfix, prefix, or even consist of multiple words. For example, you could define an infix multiplication operator as follows (where multiply(X,Y) is already defined):
a * b => multiply(a,b)
Or a postfix "squared" operator:
a squared => a * a
Or a C or Java-style ternary operator, which involves two keywords interspersed with variables:
a ? b : c => if a==true then b else c
Clearly there is plenty of scope for ambiguities in such a language, but if it is statically typed (with type inference), then most ambiguities could be eliminated, and those that remain could be considered a syntax error (to be corrected by adding brackets where appropriate).
Is there some reason I'm not seeing that would make this extremely difficult, impossible, or just a plain bad idea?
Edit: A number of people have pointed me to languages that may do this or something like this, but I'm actually interested in pointers to how I could implement my own parser for it, or problems I might encounter if doing so.
This is not too hard to do. You'll want to assign each operator a fixity (infix, prefix, or postfix) and a precedence. Make the precedence a real number; you'll thank me later. Operators of higher precedence bind more tightly than operators of lower precedence; at equal levels of precedence, you can require disambiguation with parentheses, but you'll probably prefer to permit some operators to be associative so you can write
x + y + z
without parentheses. Once you have a fixity, a precedence, and an associativity for each operator, you'll want to write an operator-precedence parser. This kind of parser is fairly simply to write; it scans tokens from left to right and uses one auxiliary stack. There is an explanation in the dragon book but I have never found it very clear, in part because the dragon book describes a very general case of operator-precedence parsing. But I don't think you'll find it difficult.
Another case you'll want to be careful of is when you have
prefix (e) postfix
where prefix and postfix have the same precedence. This case also requires parentheses for disambiguation.
My paper Unparsing Expressions with Prefix and Postfix Operators has an example parser in the back, and you can download the code, but it's written in ML, so its workings may not be obvious to the amateur. But the whole business of fixity and so on is explained in great detail.
What are you going to do about order of operations?
a * b squared
You might want to check out Scala which has a kind of unique approach to operators and methods.
Haskell has just what you're looking for.