YACC|BISON :How do I manipulate parse tree? - parsing

The goal of my application is to validate an sql code and generate,in the mean time, from that code a formatted one with some modification.For example this where clause :
where e.student_name= c.contact_name and ( c.address = " nefta"
or c.address=" tozeur ") and e.age <18
we will have as formatted output something like that :
where e.student_name= c.contact_name and (c.address=trim("nefta")
or c.address=trim("tozeur") ) and e.age <18
I hope I've explained my aim well
The problem is grammars may contain recursive rules which make the rewrite task unreliable ; for instance in my sql grammar i have this :
search_condition : search_condition OR search_condition{clbck_or}
| search_condition AND search_condition{clbck_and}
| NOT search_condition {clbck_not}
| '(' search_condition ')'{clbck__}
| predicate {clbck_pre}
;
Knowing that I specified a precedence priority to solve shift reduce problems
%left OR
%left AND
%left NOT
So back on the last example ; my clause where will be consumed this way:
c.address="nefta"or c.address="tozeur" -> search_condition
(c.address="nefta"or c.address="tozeur")->search_condition
e.student_name= c.contact_name and (c.address="nefta"or c.address="tozeur")-> search_condition
... and e.age<18-> search_condition
You can by the way understand that it's tough to rebuild the input stream referring to callbacks triggered by each reduction cause the order is not the same.
Any help for this problem ?

Your question is a bit vague, so I'm guessing that you actually print in your clbck_or (), etc. The "common" way to which wildplasser has alluded is to use "semantic values", i. e. (untested):
search_condition : search_condition OR search_condition{$$ = clbck_or($1, $3);}
| search_condition AND search_condition{$$ = clbck_and($1, $3);}
| NOT search_condition {$$ = clbck_not($2);}
| '(' search_condition ')'{$$ = clbck__($2);}
| predicate {$$ = clbck_pre($1);}
;
If you're using Bison, the manual has a fine example in the section "Infix Notation Calculator: `calc'". With strings and C, you will have to add some memory handling.

Bison is good at parsing, and with some manual help, good at building a custom syntax tree. After that, its up to you to do what you want with the tree. The good news is you can do whatever you want. The bad news is you still have to build a lot of machinery to do what you want. Your basic problem of regenerating source code is called "prettyprinting"; see my SO answer on how to prettyprint to understand what it takes to do this, including all the peccadillos of lexical syntax (you don't to lose the escapes in your literal strings, right?). You didn't at all address how to find the construct you wanted to change in the tree, or how you'd smash the tree to change it.
If you don't want to do all of that, then what you really want is a program transformation system, which is good at parsing, building a syntax tree for you (so you don't have to think about it, SQL is pretty big grammar), will let you find patterns in the tree in terms of SQL syntax you are used to, make tree changes without knowing much about the shape of the tree, and can finally regenerate valid source text by prettyprint as I describe in my answer link above. (A program transformation systems essentially includes a parser as a subroutine).
Our DMS Software Reengineering Toolkit is such a program transformation system. It has a set of predefined language definitions including SQL2011 and means for configuring for a particular dialect.
Using DMS source-to-source syntax rules, you could carry out the change in your example with the following rule:
domain SQL;
rule trim_c_members(f: identifier, s: string):condition->condition
= " c.\f = \s " -> " c.\f = trim(\s) ";
This is DMS Rule language (meta) syntax to describe a rewrite on ("domain") SQL code.
The rule has a name (because in complex application there's lot of rules) and it
as syntactic place holders "f" and "s"; it rewrites only conditions in the code.
The quotes are RSL meta-quotes; stuff inside is SQL with RSL metavariables "\f"
and "\s"; stuff outside is RSL rule syntax. What the rule says is,
"for any condition on a variable explicitly named 'c', with any field f,
if that field is compared by equality to some literal string, then replace
the literal string by 'trim' applied to the literal string".
I left out some code that basically says, "apply this rule to the entire tree, and don't apply it twice in the same place". That "strategy" is one of many built into DMS.
There's the question of how does the rule work. that is accomplished by DMS applying the SQL parser to the meta-quoted strings, to produce "pattern" syntax trees with placeholders where the metavariables are written. The left hand side pattern tree is then matched against the target tree with placeholder referring to subtrees; the right hand tree is spliced in where the left tree matched, and the placeholder subtrees transferred. So, you the programmer see surface sytax that you know and love; the tool works with trees and so it isn't confused by text.
Now, I don't think my rule matches exactly your intent, but that's partly because I can't guess your actual intent. You can write other rules if this isn't what you wanted.
This rule is purely driven by syntax; one can add a semantic predicate (not shown) if you want more complicated conditions to apply to the rule (e.g, the variable has to be ones only in certain scopes you define), and that gets messier to say. But it is much simpler and far easier to read than C code that climbs over the AST (notice you didn't see the AST here?) and tries to figure all this out.
The parsing and prettyprinting happens before and after rule application; there's a lot of machinery required to implement all that, but that machinery is built into DMS (e.g., it has something like [but more powerful] than Bison built in), and for predefined domains such as SQL, all the pretty printing works has been preconfigured, too.
If you want to get a better sense of what it takes to go full cycle with DMS (define your own language parser, define a pretty printer, define complicated rules), here's a nice and complete example of defining and symbolically simplifying calculus using DMS.

Related

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

How to create a Recipe / Ingredient parser

Given a line such as
1 pound of Beef
I want to extract the ingredient. Initially im only interested in the ingredient name.
Ive looked at rubys famous time parser Chronic and like its use of regexs.
def self.scan_for_month_names(token)
scanner = {/^jan\.?(uary)?$/ => :january,
/^feb\.?(ruary)?$/ => :february,
/^mar\.?(ch)?$/ => :march,
/^apr\.?(il)?$/ => :april,
/^may$/ => :may,
/^jun\.?e?$/ => :june,
/^jul\.?y?$/ => :july,
/^aug\.?(ust)?$/ => :august,
/^sep\.?(tember)?$/ => :september,
/^oct\.?(ober)?$/ => :october,
/^nov\.?(ember)?$/ => :november,
/^dec\.?(ember)?$/ => :december}
scanner.keys.each do |scanner_item|
return Chronic::RepeaterMonthName.new(scanner[scanner_item]) if scanner_item =~ token.word
end
return nil
end
However in my case Id probably have to create over 300 regexs for each individual ingredient.
I'd also have to take into account of synonyms such as Cilantro & Corriander Leaf
Ive never done parsing before but is the use of regexs here still the best way to go. I cant think of any other reasonable alternative.
Firstly, I'm assuming that the ingredients don't always take the form of QUANTITY UNIT of INGREDIENT - otherwise, this would be a very trivial task (just copy the substring after of
This is a difficult problem - the solution will not be simple.
I think using regex may not be the best approach here:
As you mention, you'll have to write a lot of expressions for each
ingredient
Your list of possible ingredients will always be limited
by the regex list, and you can't detect new ingredients without
compiling more.
it will be very difficult to parse some ingredients(cheese, 1 pound (parmesan))
I think that natural language processing is the way to go here. You have unstructured input, but in a very restricted context.
Perhaps counter-intuitively, I think the best way to find the ingredient may very well be to not look for it - look for everything else instead. If you assume that a line will always have
a numeral (quantity)
a unit (pounds, teaspoons, etc)
a ingredient
and that it's pretty easy to detect numerals and units, it should be straightforward to recognize those first and then extract the ingredient.
If you use a part-of-speech tagger, it's easy to identify relevant words:
[('1', 'LS'), ('pound', 'NN'), ('of', 'IN'), ('Beef', 'NNP')]
From there, you may want to use a classifier. For that, you'll need to label the ingredients manually on a good quantity of lines (say, hundreds). Some possibly good features to use:
position of the word in the line
presence in a precomputed ingredient dictionary (possibly using some partial string matching metric like Levenshtein's
output of part-of-speech tagger
words immediately before and after (if you have an 'of' before the word, there's a high probability it's a ingredient
I'm sure you'll be able to find countless others after working on a few lines.
Finally, I expect that some lines will be very difficult to work on. 1 pound of parmesan cheese, 1 pound of emmentaler: you'd have to infer that the second ingredient is a cheese, too.
As to software, if you can choose the language to use, python has the fantastic Natural Language Toolkit. I can't vouch for toolkits in other languages, but maybe someone else will.
I think I would start by running a series of regex checks against each line, and adjust the parsed text as you go. For example (pseudocode):
First, check for instruction:
/^(add|fold in|stir in|etc...)/
If you found an instruction, remove it from the line, set a flag, and continue:
instruction = $1
this_line = this_line.substring(instruction.length())
If an instruction was found, check to see if there was a subsequent instruction (like "and cover" or "and set aside")
/\b(and\s)(.*)$/
If found, strip that and insert it before the next line of the recipe
instruction = instruction.substring(0, instuction.length - $1.length - $2.length)
splice $2 into the array of lines immediately following this one
Next, maybe you'll check for a preposition:
/((?in)to\s(.+)/
If found, you might use that to check for equipment names, bowls, measuring cups, etc.
Even if you don't use it, you can probably remove it from the string you're parsing, to improve your matching.
Finally, the real work is done with the text that's left:
Check against /^(\d+\s+(?a\s)?\w+)\s*(?of\s*)?(.+)$/
Which should give you $1 containing the unit of measure and $2 containing the ingredient.
Lather. Rinse. Repeat.
After that, do whatever magic your app does with this information.
First of all, I suggest doing some searching to see if someone else has already created a solution to this problem which is good enough for you to use, rather than reinventing the wheel.
For instance, you may find this project to be interesting. It uses machine learning to attempt to parse ingredient phrases, including type of ingredients and amounts.
Other interesting projects also come up when googling for "ingredient parser".
If you are really determined to write this yourself, then I suggest that you do some research into the category of software tools known as a "parser generator", which is a tool which will allow you to write the language you want to recognize in an abstract form (a "grammar"), and then will generate code in your language of choice which will parse text according to that grammar and will identify specific subconstructs within it it very efficiently (much more efficiently than could be done by hundreds of regular expression matches).
For instance, a grammar used as input to a parser generator might look something like this:
// I am making up the following syntax for demonstration purposes, but it illustrates the
// sort of things that one could specify in a grammar, and is not terribly different from
// the grammar languages that real parser generators use.
//
// Note that everything in the curly braces is code to be inserted into the generated parser.
// Each such code block will be invoked when the preceding parsing rule is matched.
%declare { bool organic=false; bool dried=false; bool smoked=false; }
INGREDIENT ::= "organic" INGREDIENT { organic=true; }
| INGREDIENT "(" "organic" ")" { organic=true; }
| "dried" INGREDIENT { dried=true; }
| "smoked" INGREDIENT { smoked=true; }
| AMOUNT "of" INGREDIENT
| INGREDIENT "(" AMOUNT ")"
| BASE_INGREDIENT
BASE_INGREDIENT ::= ( WORD )* {
doSomethingWithBaseIngredient(organic, dried, smoked, $BASE_INGREDIENT);
}
AMOUNT ::= NUMBER ( VOLUME_UNIT | WEIGHT_UNIT )
VOLUME_UNIT ::= "cup" | "liter"
WEIGHT_UNIT ::= "mg" | "kg" | "pound"
NUMBER ::= [0-9]+
WORD ::= [a-zA-Z]+
... and so forth.
The parser generator, when run, would take this grammar as input, and would generate code in your desired programming language as output. This code would parse input text according to the grammar and would also set variables and/or call functions of yours as desired when certain parsing rules are matched. The parsers generated by such tools often use special parsing techniques (often involving large tables, state machines, and so forth) to parse very efficiently in a single pass without having to do any more work than necessary, and avoiding backtracking when possible.
Some common examples of parser generators are lexx/yacc, bison, and Antlr. Many others exist. (Personally, I have gotten good results with Antlr in the past, and am particularly fond of the fact that it can generate parsers in many different programming languages.) Many of these parser generators are mostly intended for use by compiler writers, but that does not mean that they can't be used for other purposes, such as recognizing the various forms that ingredients in recipes take.
This article provides an overview of parser generators, and this article contains a table of various parser generators and their attributes (output languages, etc.) as well as links on where to find more.

Building a parser (Part I)

I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.

What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree?

I've been reading a bit about how interpreters/compilers work, and one area where I'm getting confused is the difference between an AST and a CST. My understanding is that the parser makes a CST, hands it to the semantic analyzer which turns it into an AST. However, my understanding is that the semantic analyzer simply ensures that rules are followed. I don't really understand why it would actually make any changes to make it abstract rather than concrete.
Is there something that I'm missing about the semantic analyzer, or is the difference between an AST and CST somewhat artificial?
A concrete syntax tree represents the source text exactly in parsed form. In general, it conforms to the context-free grammar defining the source language.
However, the concrete grammar and tree have a lot of things that are necessary to make source text unambiguously parseable, but do not contribute to actual meaning. For example, to implement operator precedence, your CFG usually has several levels of expression components (term, factor, etc.), with the operators connecting them at the different levels (you add terms to get expressions, terms are composed of factors optionally multipled, etc.). To actually interpret or compile the language, however, you don't need this; you just need Expression nodes that have operators and operands. The abstract syntax tree is the result of simplifying the concrete syntax tree down to the things actually needed to represent the meaning of the program. This tree has a much simpler definition and is thus easier to process in the later stages of execution.
You usually don't need to actually build a concrete syntax tree. The action routines in your YACC (or Antlr, or Menhir, or whatever...) grammar can directly build the abstract syntax tree, so the concrete syntax tree only exists as a conceptual entity representing the parse structure of your source text.
A concrete syntax tree matches what the grammar rules say is the syntax. The purpose of the abstract syntax tree is have a "simple" representation of what's essential in "the syntax tree".
A real value in the AST IMHO is that it is smaller than the CST, and therefore takes less time to process. (You might say, who cares? But I work with a tool where we have
tens of millions of nodes live at once!).
Most parser generators that have any support for building syntax trees insist that you personally specify exactly how they get built under the assumption that your tree nodes will be "simpler" than the CST (and in that, they are generally right, as programmers are pretty lazy). Arguably it means you have to code fewer tree visitor functions, and that's valuable, too, in that it minimizes engineering energy. When you have 3500 rules (e.g., for COBOL) this matters. And this "simpler"ness leads to the good property of "smallness".
But having such ASTs creates a problem that wasn't there: it doesn't match the grammar, and now you have to mentally track both of them. And when there are 1500 AST nodes for a 3500 rule grammar, this matters a lot. And if the grammar evolves (they always do!), now you have two giant sets of things to keep in synch.
Another solution is to let the parser simply build CST nodes for you and just use those. This is a huge advantage when building the grammars: there's no need to invent 1500 special AST nodes to model 3500 grammar rules. Just think about the tree being isomorphic to the grammar. From the point of view of the grammar engineer this is completely brainless, which lets him focus on getting the grammar right and hacking at it to his heart's content. Arguably you have to write more node visitor rules, but that can be managed. More on this later.
What we do with the DMS Software Reengineering Toolkit is to automatically build a CST based on the results of a (GLR) parsing process. DMS then automatically constructs an "compressed" CST for space efficiency reasons, by eliminating non-value carrying terminals (keywords, punctation), semantically useless unary productions, and forming directly-indexable lists for grammar rule pairs that are list like:
L = e ;
L = L e ;
L2 = e2 ;
L2 = L2 ',' e2 ;
and a wide variety of variations of such forms. You think in terms of the grammar rules and the virtual CST; the tool operates on the compressed representation. Easy on your brain, faster/smaller at runtime.
Remarkably, the compressed CST built this way looks a lot an AST that you might have designed by hand (see link at end to examples). In particular, the compressed CST doesn't carry any nodes that are just concrete syntax.
There are minor bits of awkwardness: for example while the concrete nodes for '(' and ')' classically found in expression subgrammars are not in the tree, a "parentheses node" does appear in the compressed CST and has to be handled. A true AST would not have this. This seems like a pretty small price to pay for the convenience of not have to specify the AST construction, ever. And the documentation for the tree is always available and correct: the grammar is the documentation.
How do we avoid "extra visitors"? We don't entirely, but DMS provides an AST library that walks the AST and handles the differences between the CST and the AST transparently. DMS also offers an "attribute grammar" evaluator (AGE), which is a method for passing values computed at nodes up and down the tree; the AGE handles all the tree representation issues and so the tool engineer only worries about writing computations effectively directly on the grammar rules themselves. Finally, DMS also provides "surface-syntax" patterns, which allows code fragments from the grammar to used to find specific types of subtrees, without knowing most of the node types involved.
One of the other answers observes that if you want to build tools that can regenerate source, your AST will have to match the CST. That's not really right, but it is far easier to regenerate the source if you have CST nodes. DMS generates most of the prettyprinter automatically because it has access to both :-}
Bottom line: ASTs are good for small, both phyiscal and conceptual. Automated AST construction from the CST provides both, and lets you avoid the problem of tracking two different sets.
EDIT March 2015: Link to examples of CST vs. "AST" built this way
This is based on the Expression Evaluator grammar by Terrence Parr.
The grammar for this example:
grammar Expr002;
options
{
output=AST;
ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}
prog : ( stat )+ ;
stat : expr NEWLINE -> expr
| ID '=' expr NEWLINE -> ^('=' ID expr)
| NEWLINE ->
;
expr : multExpr (( '+'^ | '-'^ ) multExpr)*
;
multExpr
: atom ('*'^ atom)*
;
atom : INT
| ID
| '('! expr ')'!
;
ID : ('a'..'z' | 'A'..'Z' )+ ;
INT : '0'..'9'+ ;
NEWLINE : '\r'? '\n' ;
WS : ( ' ' | '\t' )+ { skip(); } ;
Input
x=1
y=2
3*(x+y)
Parse Tree
The parse tree is a concrete representation of the input. The parse tree retains all of the information of the input. The empty boxes represent whitespace, i.e. end of line.
AST
The AST is an abstract representation of the input. Notice that parens are not present in the AST because the associations are derivable from the tree structure.
EDIT
For a more through explanation see Compilers and Compiler Generators pg. 23
This blog post may be helpful.
It seems to me that the AST "throws away" a lot of intermediate grammatical/structural information that wouldn't contribute to semantics. For example, you don't care that 3 is an atom is a term is a factor is a.... You just care that it's 3 when you're implementing the exponentiation expression or whatever.
The concrete syntax tree follows the rules of the grammar of the language. In the grammar, "expression lists" are typically defined with two rules
expression_list can be: expression
expression_list can be: expression, expression_list
Followed literally, these two rules gives a comb shape to any expression list that appears in the program.
The abstract syntax tree is in the form that's convenient for further manipulation. It represents things in a way that makes sense for someone that understand the meaning of programs, and not just the way they are written. The expression list above, which may be the list of arguments of a function, may conveniently be represented as a vector of expressions, since it's better for static analysis to have the total number of expression explicitly available and be able to access each expression by its index.
Simply, AST only contains semantics of the code, Parse tree/CST also includes information on how exactly code was written.
The concrete syntax tree contains all information like superfluous parenthesis and whitespace and comments, the abstract syntax tree abstracts away from this information.
NB: funny enough, when you implement a refactoring engine your AST will again contain all the concrete information, but you'll keep referring to it as an AST because that has become the standard term in the field (so one could say it has long ago lost its original meaning).
CST(Concrete Syntax Tree) is a tree representation of the Grammar(Rules of how the program should be written).
Depending on compiler architecture, it can be used by the Parser to produce an AST.
AST(Abstract Syntax Tree) is a tree representation of Parsed source, produced by the Parser part of the compiler. It stores information about tokens+grammar.
Depending on architecture of your compiler, The CST can be used to produce an AST. It is fair to say that CST evolves into AST. Or, AST is a richer CST.
More explanations can be found on this link: http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees#id6
It is a difference which doesn't make a difference.
An AST is usually explained as a way to approximate the semantics of a programming language expression by throwing away lexical content. For example in a context free grammar you might write the following EBNF rule
term: atom (('*' | '/') term )*
whereas in the AST case you only use mul_rule and div_rule which expresses the proper arithmetic operations.
Can't those rules be introduced in the grammar in the first place? Of course. You can rewrite the above compact and abstract rule by breaking it into a more concrete rules used to mimic the mentioned AST nodes:
term: mul_rule | div_rule
mul_rule: atom ('*' term)*
div_rule: atom ('/' term)*
Now, when you think in terms of top-down parsing then the second term introduces a FIRST/FIRST conflict between mul_rule and div_rule something an LL(1) parser cannot deal with. The first rule form was the left factored version of the second one which effectively eliminated structure. You have to pay some prize for using LL(1) here.
So ASTs are an ad hoc supplement used to fix the deficiencies of grammars and parsers. The CST -> AST transformation is a refactoring move. No one ever bothered when an additional comma or colon is stored in the syntax tree. On the contrary some authors retrofit them into ASTs because they like to use ASTs for doing refactorings instead of maintaining various trees at the same time or write an additional inference engine. Programmers are lazy for good reasons. Actually they store even line and column information gathered by lexical analysis in ASTs for error reporting. Very abstract indeed.

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.
LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.
My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

Resources