How to get ANTLR to output hierarchical ASTs? - parsing

I have a Lua grammar, (minor modifications to get it to output for C#, just namespace directives and a couple of option changes) and when I run it on some sample input, it gives me back a tree with a root "nil" node and as childs what looks to be a tokenized version of the input code. It looks like ANTLR's tree grammars operate on hierarchical trees and not "flat" ones, so I don't think I can use the output as-is.
Is there an easy fix for the grammar or does it need to be rewritten from the ground up?

Assuming your tree is just a 1 dimensional list of nodes, here's how you can create parent/sibling hierarchy:
In ANTLR there are two operators for AST creation:
! excludes the node (token) from the (sub)tree;
^ makes a node the root of a (sub)tree.
When no operator is provided, the nodes/tokens are added as children of the current root. This is probably what has happened to you: all you see is a one dimensional list of nodes/tokens.
An example:
grammar Exp;
options {output=AST;}
// ... some rules ...
addition
: Integer '+'^ Integer ';'!
;
Integer
: '0'
| '1'..'9' '0'..'9'*
;
The addition rule will create the following tree for the expression 6+9;:
+
/ \
/ \
6 9
As you can see: the + is the root (it has ^ after it), the numbers are tokens (they have no operator) and the semi colon is excluded (it has a ! after it).
For a detailed explanation, see chapter 7, Tree Construction, from The Definitive ANTLR Reference. I highly recommend you get a hold of a copy.
The question whether you should start from scratch is for you to decide. I'd just start with an empty grammar file and gradually add rules to it checking frequently to see if all works. Simply sprinkling some tree operators in an existing grammar may be pretty hard to debug: especially if you're not too familiar with ANTLR.
Best of luck!

Related

Extracting from .bib file with Raku (previously aka Perl 6)

I have this .bib file for reference management while writing my thesis in LaTeX:
#article{garg2017patch,
title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
journal={Journal of Cosmetic Dermatology},
year={2017},
publisher={Wiley Online Library}
}
#article{hauso2008neuroendocrine,
title={Neuroendocrine tumor epidemiology},
author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
journal={Cancer},
volume={113},
number={10},
pages={2655--2664},
year={2008},
publisher={Wiley Online Library}
}
#article{siperstein1997laparoscopic,
title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
journal={Surgery},
volume={122},
number={6},
pages={1147--1155},
year={1997},
publisher={Elsevier}
}
If anyone wants to know what bib file is, you can find it detailed here.
I'd like to parse this with Perl 6 to extract the key along with the title like this:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Can you please help me to do this, maybe in two ways:
Using basic Perl 6
Using a Perl 6 Grammar
TL;DR
A complete and detailed answer that does just exactly as #Suman asks.
An introductory general answer to "I want to parse X. Can anyone help?"
A one-liner in a shell
I'll start with terse code that's perfect for some scenarios[1], and which someone might write if they're familiar with shell and Raku basics and in a hurry:
> raku -e 'for slurp() ~~ m:g / "#article\{" (<-[,]>+) \, \s+
"title=\{" (<-[}]>+) \} / -> $/ { put "$0: $1\n" }' < derm.bib
This produces precisely the output you specified:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Same single statement, but in a script
Skipping shell escapes and adding:
Whitespace.
Comments.
► use tio.run to run the code below
for slurp() # "slurp" (read all of) stdin and then
~~ m :global # match it "globally" (all matches) against
/ '#article{' (<-[,]>+) ',' \s+ # a "nextgen regex" that uses (`(...)`) to
'title={' (<-[}]>+) '}' / # capture the article id and title and then
-> $/ { put "$0: $1\n" } # for each article, print "article id: title".
Don't worry if the above still seems like pure gobbledygook. Later sections explain the above while also introducing code that's more general, clean, and readable.[2]
Four statements instead of one
my \input = slurp;
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
my \articles = input .match: pattern, :global;
for articles -> $/ { put "$0: $1\n" }
my declares a lexical variable. Raku supports sigils at the start of variable names. But it also allows devs to "slash them out" as I have done.
my \pattern ...
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
I've switched the pattern syntax from / ... / in the original one-liner to rule { ... }. I did this to:
Eliminate the risk of pathological backtracking
Classic regexes risk pathological backtracking. That's fine if you can just kill a program that's gone wild, but click the link to read how bad it can get! 🤪 We don't need backtracking to match the .bib format.
Communicate that the pattern is a rule
If you write a good deal of pattern matching code, you'll frequently want to use rule { ... }. A rule eliminates any risk of the classic regex problem just described (pathological backtracking), and has another superpower. I'll cover both aspects below, after first introducing the adverbs corresponding to those superpowers.
Raku regexes/rules can be (often are) used with "adverbs". These are convenient shortcuts that modify how patterns are applied.
I've already used an adverb in the earlier versions of this code. The "global" adverb (specified using :global or its shorthand alias :g) directs the matching engine to consume all of the input, generating a list of as many matches as it contains, instead of returning just the first match.
While there are shorthand aliases for adverbs, some are used so repeatedly that it's a lot tidier to bundle them up into distinct rule declarators. That's why I've used rule. It bundles up two adverbs appropriate for matching many data formats like .bib files:
:ratchet (alias :r)
:sigspace (alias :s)
Ratcheting (:r / :ratchet) tells the compiler that when an "atom" (a sub-pattern in a rule that is treated as one unit) has matched, there can be no going back on that. If an atom further on in the pattern in the same rule fails, then the whole rule immediately fails.
This eliminates any risk of the "pathological backtracking" discussed earlier.
Significant space handling (:s / :sigspace) tells the compiler that an atom followed by literal spacing that is in the pattern indicates that a "token" boundary pattern, aka ws should be appended to the atom.
Thus this adverb deals with tokenizing. Did you spot that I'd dropped the \s+ from the pattern compared to the original one in the one-liner? That's because :sigspace, which use of rule implies, takes care of that automatically:
say 'x#y x # y' ~~ m:g:s /x\#y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g /x \# y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g:s /x \# y/; # (「x#y」 「x # y」) <-- two matches
You might wonder why I've reverted to using / ... / to show these two examples. Turns out that while you can use rule { ... } with the .match method (described in the next section), you can't use rule with m. No problem; I just used :s instead to get the desired effect. (I didn't bother to use :r for ratcheting because it makes no difference for this pattern/input.)
To round out this dive into the difference between classic regexes (which can also be written regex { ... }) and rule rules, let me mention the other main option: token. The token declarator implies the :ratchet adverb, but not the :sigspace one. So it also eliminates the pathological backtracking risk of a regex (or / ... /) but, just like a regex, and unlike a rule, a token ignores whitespace used by a dev in writing out the rule's pattern.
my \articles = input .match: pattern, :global
This line uses the method form (.match) of the m routine used in the one-liner solution.
The result of a match when :global is used is a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.
for articles -> $/ { put "$0: $1\n" }
This for statement successively binds a Match object corresponding to each of the three articles in your sample file to the symbol $/ inside the code block ({ ... }).
Per Raku doc on $/, "$/ is the match variable, so it usually contains objects of type Match.". It also provides some other conveniences; we take advantage of one of these conveniences related to numbered captures:
The pattern that was matched earlier contained two pairs of parentheses;
The overall Match object ($/) provides access to these two Positional captures via Positional subscripting (postfix []), so within the for's block, $/[0] and $/[1] provide access to the two Positional captures for each article;
Raku aliases $0 to $/[0] (and so on) for convenience, so most devs use the shorter syntax.
Interlude
This would be a good time to take a break. Maybe just a cuppa, or return here another day.
The last part of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above and will show how to extend Raku's parsing to more complex scenarios.
But first...
A "boring" practical approach
I want to parse this with Raku. Can anyone help?
Raku may make writing parsers less tedious than with other tools. But less tedious is still tedious. And Raku parsing is currently slow.
In most cases, the practical answer when you want to parse well known formats and/or really big files is to find and use an existing parser. This might mean not using Raku at all, or using an existing Raku module, or using an existing non-Raku parser in Raku.
A suggested starting point is to search for the file format on modules.raku.org or raku.land. Look for a publicly shared parsing module already specifically packaged for Raku for the given file format. Then do some simple testing to see if you have a good solution.
At the time of writing there are no matches for 'bib'.
Even if you don't know C, there's almost certainly a 'bib' parsing C library already available that you can use. And it's likely to be the fastest solution. It's typically surprisingly easy to use an external library in your own Raku code, even if it's written in another programming language.
Using C libs is done using a feature called NativeCall. The doc I just linked may well be too much or too little, but please feel free to visit the freenode IRC channel #raku and ask for help. (Or post an SO question.) We're friendly folk. :)
If a C lib isn't right for a particular use case, then you can probably still use packages written in some other language such as Perl, Python, Ruby, Lua, etc. via their respective Inline::* language adapters.
The steps are:
Install a package (that's written in Perl, Python or whatever);
Make sure it runs on your system using a compiler of the language it's written for;
Install the appropriate Inline language adapter that lets Raku run packages in that other language;
Use the "foreign" package as if it were a Raku package containing exported Raku functions, classes, objects, values, etc.
(At least, that's the theory. Again, if you need help, please pop on the IRC channel or post an SO question.)
The Perl adapter is the most mature so I'll use that as an example. Let's say you use Perl's Text::BibTex packages and now wish to use Raku with that package. First, setup it up as it's supposed to be per its README. Then, in Raku, write something like:
use Text::BibTeX::BibFormat:from<Perl5>;
...
#blocks = $entry.format;
Explanation of these two lines:
The first line is how you tell Raku that you wish to load a Perl module.
(It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Raku bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)
The last line is just a mechanical Raku translation of the #blocks = $entry->format; line from the SYNOPSIS of the Perl package Text::BibTeX::BibFormat.
A Raku grammar / parser
OK. Enough "boring" practical advice. Let's now try have some fun creating a grammar based Raku parser good enough for the example from your question.
► use glot.io to run the code below
unit grammar bib;
rule TOP { <article>* }
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
rule kv-pairs { <kv-pair>* % ',' }
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
With this grammar in place, we can now write something like:
die "Use CommaIDE?" unless bib .parsefile: 'derm.bib';
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }
to generate exactly the same output as the previous solutions.
When a match or parse fails, by default Raku just returns Nil, which is, well, rather terse feedback.
There are several nice debugging options to figure out what's going on with a regex or grammar, but the best option by far is to use CommaIDE's Grammar-Live-View.
If you haven't already installed and used Comma, you're missing one of the best parts of using Raku. The features built in to the free version of Comma ("Community Edition") include outstanding grammar development / tracing / debugging tools.
Explanation of the 'bib' grammar
unit grammar bib;
The unit declarator is used at the start of a source file to tell Raku that the rest of the file declares a named package of code of a particular type.
The grammar keyword specifies a grammar. A grammar is like a class, but contains named "rules" -- not just named methods, but also named regexs, tokens, and rules. A grammar also inherits a bunch of general purpose rules from a base grammar.
rule TOP {
Unless you specify otherwise, parsing methods (.parse and .parsefile) that are called on a grammar start by calling the grammar's rule named TOP (declared with a rule, token, regex, or method declarator).
As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't risk pathological backtracking.)
But I've used a rule. Like token patterns, rules also avoid the pathological backtracking risk. But, in addition rules interpret some whitespace in the pattern to be significant, in a natural manner. (See this SO answer for precise details.)
rules are typically appropriate towards the top of the parse tree. (Tokens are typically appropriate towards the leaves.)
rule TOP { <article>* }
The space at the end of the rule (between the * and pattern closing }) means the grammar will match any amount of whitespace at the end of the input.
<article> invokes another named rule in this grammar.
Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
If you compare this article pattern with the ones I wrote for the earlier Raku rules based solutions, you'll see various changes:
Rule in original one-liner
Rule in this grammar
Kept pattern as simple as possible.
Introduced <kv-pairs> and closing }
No attempt to echo layout of your input.
Visually echoes your input.
<[...]> is the Raku syntax for a character class, like[...] in traditional regex syntax. It's more powerful, but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in the [^,] syntax of ye olde regex. So <-[,]>+ attempts a match of one or more characters, none of which are ,.
$<id>=<-[,]>+ tells Raku to attempt to match the quantified "atom" on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key <id> within the current match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.
rule kv-pairs { <kv-pair>* % ',' }
This pattern illustrates one of several convenient Raku regex features. It declares you want to match zero or more kv-pairs separated by commas.
(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be clearer.
I could have used the ~ approach in the /.../ regex in the one-liner, and vice-versa. But I wanted this grammar solution to continue the progression toward illustrating "better practice" idioms.
Constructing / deconstructing the parse tree
for $<article> { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
$<article>, $<id> etc. refer to named match objects that are stored somewhere in the "parse tree". But how did they get there? And exactly where is "there"?
Returning to the top of the grammar:
rule TOP {
If a .parse is successful, a single 'TOP' level match object is returned. (After a parse is complete the variable $/ is also bound to that top match object.) During parsing a tree will have been formed by hanging other match objects off this top match object, and then others hung off those, and so on.
Addition of match objects to a parse tree is done by adding either a single generated match object, or a list of them, to either a Positional (numbered) or Associative (named) capture of a "parent" match object. This process is explained below.
rule TOP { <article>* }
<article> invokes a match of the rule named article. An invocation of the rule <article> has two effects:
Raku tries to match the rule.
If it matches, Raku captures that match by generating a corresponding match object and adding it to the parse tree under the key <article> of the parent match object. (In this case the parent is the top match object.)
If the successfully matched pattern had been specified as just <article>, rather than as <article>*, then only one match would have been attempted, and only one value, a single match object, would have been generated and added under the key <article>.
But the pattern was <article>*, not merely <article>. So Raku attempts to match the article rule as many times as it can. If it matches at all, then a list of one or more match objects is stored as the value of the <article> key. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)
$<article> is short for $/<article>. It refers to the value stored under the <article> key of the current match object (which is stored in $/). In this case that value is a list of 3 match objects corresponding to the 3 articles in the input.
rule article { '#article{' $<id>=<-[,]>+ ','
Just as the top match object has several match objects hung off of it (the three captures of article matches that are stored under the top match object's <article> key), so too do each of those three article match objects have their own "child" match objects hanging off of them.
To see how that works, let's consider just the first of the three article match objects, the one corresponding to the text that starts "#article{garg2017patch,...". The article rule matches this article. As it's doing that matching, the $<id>=<-[,]>+ part tells Raku to store the match object corresponding to the id part of the article ("garg2017patch") under that article match object's <id> key.
Hopefully this is enough (quite possibly way too much!) and I can at last exhaustively (exhaustingly?) explain the last line of code, which, once again, was:
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
At the level of the for, the variable $/ refers to the top of the parse tree generated by the parse that just completed. Thus $<article>, which is shorthand for $/<article>, refers to the list of three article match objects.
The for then iterates over that list, binding $/ within the lexical scope of the -> $/ { ... } block to each of those 3 article match objects in turn.
The $<id> bit is shorthand for $/<id>, which inside the block refers to the <id> key within the article match object that $/ has been bound to. In other words, $<id> inside the block is equivalent to $<article><id> outside the block.
The $<kv-pairs><kv-pair>[0]<value> follows the same scheme, albeit with more levels and a positional child (the [0]) in the midst of all the key (named/ associative) children.
(Note that there was no need for the article pattern to include a $<kv-pairs>=<kv-pairs> because Raku just presumes a pattern of the form <foo> should store its results under the key <foo>. If you wish to disable that, write a pattern with a non-alpha character as the first symbol. For example, use <.foo> if you want to have exactly the same matching effect as <foo> but just not store the matched input in the parse tree.)
Phew!
When the automatically generated parse tree isn't what you want
As if all the above were not enough, I need to mention one more thing.
The parse tree strongly reflects the tree structure of the grammar's rules calling one another from the top rule down to leaf rules. But the resulting structure is sometimes inconvenient.
Often one still wants a tree, but a simpler one, or perhaps some non-tree data structure.
The primary mechanism for generating exactly what you want from a parse, when the automatic results aren't suitable, is make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)
In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree, such as an AST.
Footnotes
[1] Basic Raku is good for exploratory programming, spikes, one-offs, PoCs and other scenarios where the emphasis is on quickly producing working code that can be refactored later if need be.
[2] Raku's regexes/rules scale up to arbitrary parsing, as introduced in the latter half of this answer. This contrasts with past generations of regex which could not.[3]
[3] That said, ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ remains a great and relevant read. Not because Raku rules can't parse (X)HTML. In principle they can. But for a task as monumental as correctly handling full arbitrary in-the-wild XHTML I would strongly recommend you use an existing parser written expressly for that purpose. And this applies generally for existing formats; it's best not to reinvent the wheel. But the good news with Raku rules is that if you need to write a full parser, not just a bunch of regexes, you can do so, and it need not involve going insane!

Observer pattern usage in syntax tree parsing

I have detailed the specifications of the problem for reasons that will become clear after I ask my question, at the end. The program I am building is a parser in Java for a language with the following syntax (although this is not very relevant to the question):
<expr> ::= [<op> <expr> <expr>] | <symbol> | <value>
<symbol> ::= [a-zA-Z]+
<value> ::= [0-9]+
<op> ::= '+' | '*' | '==' | ‘<’
<assignment> ::= [= <symbol> <expr>]
<prog> ::= <assignment> |
[; <prog> <prog>] |
[if <expr> <prog> <prog>] |
[for <assignment> <expr> <assignment> <prog>] |
[assert <expr>] |
[return <expr>]`
This is an example of code in said language:
[; [= x 0] [; [if [== x 5] [= x 7] [= x [+ x 1]]] [return x]]]
Which is equivalent to:
x = 0;
if (x == 5)
x = 7;
else
x = x + 1;
return x;`
The code is guaranteed to be give in correct syntax; incorrectness of the given code is defined only by having:
a) An used variable (symbol) not previously declared (by declared meaning assigned something to it), even if the variable is used in a branch of an if or some other place that is never reached in the execution of the program;
b) Having a "return" instruction on each path the program could take, meaning the program cannot end without returning on any execution path it may take.
The requirements are that the program should parse the code.
My parser must:
a) Check for said correctness;
b) Parse the code and compute what is the returned value.
My take on this is:
1) Parse the code given into a tree of instructions and expressions;
2) Check for correctness by traversing the tree and seeing if a variable was declared in an upper scope before it was used;
3) Check for correctness by traversing the tree and seeing if any execution branch ends in a "return" instruction;
4) If all previous conditions hold, evaluate the returned value of the code by traversing the tree and remembering the value of all the variables in a HashMap or some other storage.
Now, my problem is that I must implement the parser using the Visitor and Observer design patterns. This is a key requirement for the project. I am fairly new to design patterns and only have a basic grasp about these two.
My question is: where should/can I use the Observer design patter in my parser?
It makes sense to use a Visitor to visit the nodes of the tree for steps 2, 3 and 4. I cannot figure out where I must use the Observer pattern, though.
Is there any place I can use it in my implementation? From my understanding, the Observer pattern takes care of data that can be read and modified by many "observers", the central idea being that an object modifying the data will announce the other objects that may be affected by the modification.
The main data being modified in my program is the tree and the HashMap in which I store the values for the variables. Both of there are accessed in a linear fashion, by only one thing. The tree is built one node at a time, and no other node, or object, for that matter, cares that a node is added or modified. In the evaluation phase, each node is visited and variables are added or modified in the hash table, but no object other than the current visitor from the current node cares about this. I suppose I can make each node an observer which upon observing a change does nothing, or something like that, forcing an Observer pattern, but that isn't really useful.
So, is there an obvious answer which I am completely missing? Is there a not so obvious one, but still giving an useful implementation of Observer? Can I use a half useful slightly forced Observer pattern somewhere in my algorithms, or is fully forced, completely useless way the only way to implement it? Is there a completely different way of approaching the problem which will allow me to use the Visitor and, more importantly, the Observer pattern?
Notes:
I am yet to implement the evaluation of the tree (steps 2, 3 and 4) with Visitor; I have only thought about how I should do it. I will implement it tomorrow and see if there is a way to use Observer somewhere, but having thought about how I could use it for a few hours, I still have no idea. I am hoping, though, that there is a way, which I haven't been able to discover but which will become clear after writing that part.
I apologize for writing so much. I couldn't summarize it better and still give details about the situation any better.
Also, I apologize if I am not clear in explanations. It is quite late, I have though about this for some hours and got tired, and I can't say I have a perfect grasp on the concepts. If anything is unclear or want further details on some matter, don't hesitate to ask. Also, don't hesitate in highlighting any mistakes or wrong paths in my judgement about the problem.
Here are some ideas how you could use well-known patterns and concepts to build an interpreter for your language:
Start processing an input stream by splitting it up into tokens ([;, =, x, 0, ], etc.). This first component (a.k.a. lexer, scanner, tokenizer) strips out irrelevant detail such as whitespace and produces tokens.
You could implement this as a simple state machine that is fed one character at a time. It can therefore be implemented as an observer of the input character sequence.
In the next stage (a.k.a. parsing) you build an abstract syntax tree (AST) from the generated tokens.
The parser is an observer of the tokenizer, i.e. it gets notified of any new tokens the tokenizer produces.(*) It is fed one token at a time. In the case of a fairly simple grammar, the scanner itself can also be some stack-based state machine. (For example, if it needs to match opening and closing brackets, it needs to be able to remember what context / state it was in outside the brackets, thus it'll probably use some kind of stack for context management.)
Once the parser has built an AST, perhaps almost all subsequent steps can be implemented using the visitor pattern. Every algorithm that traverses the AST, locates specific nodes or subtrees, or transforms (parts of) the AST can be modelled as a visitor. Some visitors will only model actions that do not return a value, others return a new AST node for a visited node such that a new transformed AST can be put together. For example:
Check whether the AST or any subtree of it describes a valid program fragment (visit a (sub-) tree, reduce it to a boolean or a list of errors).
Simplify / optimize the program (visit a (sub-) tree, generate a smaller or otherwise more optimal subtree, finally reassemble a new tree from the transformed subtrees).
Execute the AST (visit and interpret each node, execute some code accoeding to the node's meaning).
(*) Calling the parser an observer of the scanner is perhaps somewhat inaccurate. This Software Engineering SE post has a good summary of closely related design patterns. Could be that the scanner implements a strategy for dealing with the tokens.

YACC|BISON :How do I manipulate parse tree?

The goal of my application is to validate an sql code and generate,in the mean time, from that code a formatted one with some modification.For example this where clause :
where e.student_name= c.contact_name and ( c.address = " nefta"
or c.address=" tozeur ") and e.age <18
we will have as formatted output something like that :
where e.student_name= c.contact_name and (c.address=trim("nefta")
or c.address=trim("tozeur") ) and e.age <18
I hope I've explained my aim well
The problem is grammars may contain recursive rules which make the rewrite task unreliable ; for instance in my sql grammar i have this :
search_condition : search_condition OR search_condition{clbck_or}
| search_condition AND search_condition{clbck_and}
| NOT search_condition {clbck_not}
| '(' search_condition ')'{clbck__}
| predicate {clbck_pre}
;
Knowing that I specified a precedence priority to solve shift reduce problems
%left OR
%left AND
%left NOT
So back on the last example ; my clause where will be consumed this way:
c.address="nefta"or c.address="tozeur" -> search_condition
(c.address="nefta"or c.address="tozeur")->search_condition
e.student_name= c.contact_name and (c.address="nefta"or c.address="tozeur")-> search_condition
... and e.age<18-> search_condition
You can by the way understand that it's tough to rebuild the input stream referring to callbacks triggered by each reduction cause the order is not the same.
Any help for this problem ?
Your question is a bit vague, so I'm guessing that you actually print in your clbck_or (), etc. The "common" way to which wildplasser has alluded is to use "semantic values", i. e. (untested):
search_condition : search_condition OR search_condition{$$ = clbck_or($1, $3);}
| search_condition AND search_condition{$$ = clbck_and($1, $3);}
| NOT search_condition {$$ = clbck_not($2);}
| '(' search_condition ')'{$$ = clbck__($2);}
| predicate {$$ = clbck_pre($1);}
;
If you're using Bison, the manual has a fine example in the section "Infix Notation Calculator: `calc'". With strings and C, you will have to add some memory handling.
Bison is good at parsing, and with some manual help, good at building a custom syntax tree. After that, its up to you to do what you want with the tree. The good news is you can do whatever you want. The bad news is you still have to build a lot of machinery to do what you want. Your basic problem of regenerating source code is called "prettyprinting"; see my SO answer on how to prettyprint to understand what it takes to do this, including all the peccadillos of lexical syntax (you don't to lose the escapes in your literal strings, right?). You didn't at all address how to find the construct you wanted to change in the tree, or how you'd smash the tree to change it.
If you don't want to do all of that, then what you really want is a program transformation system, which is good at parsing, building a syntax tree for you (so you don't have to think about it, SQL is pretty big grammar), will let you find patterns in the tree in terms of SQL syntax you are used to, make tree changes without knowing much about the shape of the tree, and can finally regenerate valid source text by prettyprint as I describe in my answer link above. (A program transformation systems essentially includes a parser as a subroutine).
Our DMS Software Reengineering Toolkit is such a program transformation system. It has a set of predefined language definitions including SQL2011 and means for configuring for a particular dialect.
Using DMS source-to-source syntax rules, you could carry out the change in your example with the following rule:
domain SQL;
rule trim_c_members(f: identifier, s: string):condition->condition
= " c.\f = \s " -> " c.\f = trim(\s) ";
This is DMS Rule language (meta) syntax to describe a rewrite on ("domain") SQL code.
The rule has a name (because in complex application there's lot of rules) and it
as syntactic place holders "f" and "s"; it rewrites only conditions in the code.
The quotes are RSL meta-quotes; stuff inside is SQL with RSL metavariables "\f"
and "\s"; stuff outside is RSL rule syntax. What the rule says is,
"for any condition on a variable explicitly named 'c', with any field f,
if that field is compared by equality to some literal string, then replace
the literal string by 'trim' applied to the literal string".
I left out some code that basically says, "apply this rule to the entire tree, and don't apply it twice in the same place". That "strategy" is one of many built into DMS.
There's the question of how does the rule work. that is accomplished by DMS applying the SQL parser to the meta-quoted strings, to produce "pattern" syntax trees with placeholders where the metavariables are written. The left hand side pattern tree is then matched against the target tree with placeholder referring to subtrees; the right hand tree is spliced in where the left tree matched, and the placeholder subtrees transferred. So, you the programmer see surface sytax that you know and love; the tool works with trees and so it isn't confused by text.
Now, I don't think my rule matches exactly your intent, but that's partly because I can't guess your actual intent. You can write other rules if this isn't what you wanted.
This rule is purely driven by syntax; one can add a semantic predicate (not shown) if you want more complicated conditions to apply to the rule (e.g, the variable has to be ones only in certain scopes you define), and that gets messier to say. But it is much simpler and far easier to read than C code that climbs over the AST (notice you didn't see the AST here?) and tries to figure all this out.
The parsing and prettyprinting happens before and after rule application; there's a lot of machinery required to implement all that, but that machinery is built into DMS (e.g., it has something like [but more powerful] than Bison built in), and for predefined domains such as SQL, all the pretty printing works has been preconfigured, too.
If you want to get a better sense of what it takes to go full cycle with DMS (define your own language parser, define a pretty printer, define complicated rules), here's a nice and complete example of defining and symbolically simplifying calculus using DMS.

How should I construct and walk the AST output of an ANTLR3 grammar?

Documentation and general advice is that the abstract syntax tree should omit tokens that have no meaning. ("Record the meaningful input tokens (and only the meaningful tokens" - The Definitive ANTLR Reference) IE: In a C++ AST, you would omit the braces at the start and end of the class, as they have no meaning and are simply a mechanism for delineating class start and end for parsing purposes. I understand that, for the sake of rapidly and efficiently walking the tree, culling out useless token nodes like that is useful, but in order to appropriately colorize code, I would need that information, even if it doesn't contribute to the meaning of the code. A) Is there any reason why I shouldn't have the AST serve multiple purposes and choose not to omit said tokens?
It seems to me that what ANTLRWorks interpreter outputs is what I'm looking for. In the ANTLRWorks interpreter, it outputs a tree diagram where, for each rule matched, a node is made, as well as a child node for each token and/or subrule. The parse tree, I suppose it's called.
If manually walking a tree, wouldn't it be more useful to have nodes marking a rule? By having a node marking a rule, with it's subrules and tokens as children, a manual walker doesn't need to look ahead several nodes to know the context of what node it's on. Tree grammars seem redundant to me. Given a tree of AST nodes, the tree grammar "parses" the nodes all over again in order to produce some other output. B) Given that the parser grammar was responsible for generating correctly formed ASTs and given the inclusion of rule AST nodes, shouldn't a manual walker avoid the redundant AST node pattern matching of a tree grammar?
I fear I'm wildly misunderstanding the tree grammar mechanism's purpose. A tree grammar more or less defines a set of methods that will run through a tree, look for a pattern of nodes that matches the tree grammar rule, and execute some action based on that. I can't depend on forming my AST output based on what's neat and tidy for the tree grammar (omitting meaningless tokens for pattern matching speed) yet use the AST for color-coding even the meaningless tokens. I'm writing an IDE as well; I also can't write every possible AST node pattern that a plugin author might want to match, nor do I want to require them using ANTLR to write a tree grammar. In the case of plugin authors walking the tree for their own criteria, rule nodes would be rather useful to avoid needing pattern matching.
Thoughts? I know this "question" might push the limits of being an SO question, but I'm not sure how else to formulate my inquiries or where else to inquire.
Sion Sheevok wrote:
A) Is there any reason why I shouldn't have the AST serve multiple purposes and choose not to omit said tokens?
No, you mind as well keep them in there.
Sion Sheevok wrote:
It seems to me that what ANTLRWorks interpreter outputs is what I'm looking for. In the ANTLRWorks interpreter, it outputs a tree diagram where, for each rule matched, a node is made, as well as a child node for each token and/or subrule. The parse tree, I suppose it's called.
Correct.
Sion Sheevok wrote:
B) Given that the parser grammar was responsible for generating correctly formed ASTs and given the inclusion of rule AST nodes, shouldn't a manual walker avoid the redundant AST node pattern matching of a tree grammar?
Tree grammars are often used to mix custom code in to evaluate/interpret the input source. If you mix this code inside the parser grammar and there's some backtracking going on in the parser, this custom code could be executed more than it's supposed to. Walking the tree using a tree grammar is (if done properly) only possible in one way causing custom code to be executed just once.
But if a separate tree walker/iterator is necessary, there are two camps out there that advocate using tree grammars and others opt for walking the tree manually with a custom iterator. Both camps raise valid points about their preferred way of walking the AST. So there's no clear-cut way to do it in one specific way.
Sion Sheevok wrote:
Thoughts?
Since you're not evaluating/interpreting, you mind as well not use a tree grammar.
But to create a parse tree as ANTLRWorks does (which you have no access to, btw), you will need to mix AST rewrite rules inside your parser grammar though. Here's a Q&A that explains how to do that: How to output the AST built using ANTLR?
Good luck!

What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree?

I've been reading a bit about how interpreters/compilers work, and one area where I'm getting confused is the difference between an AST and a CST. My understanding is that the parser makes a CST, hands it to the semantic analyzer which turns it into an AST. However, my understanding is that the semantic analyzer simply ensures that rules are followed. I don't really understand why it would actually make any changes to make it abstract rather than concrete.
Is there something that I'm missing about the semantic analyzer, or is the difference between an AST and CST somewhat artificial?
A concrete syntax tree represents the source text exactly in parsed form. In general, it conforms to the context-free grammar defining the source language.
However, the concrete grammar and tree have a lot of things that are necessary to make source text unambiguously parseable, but do not contribute to actual meaning. For example, to implement operator precedence, your CFG usually has several levels of expression components (term, factor, etc.), with the operators connecting them at the different levels (you add terms to get expressions, terms are composed of factors optionally multipled, etc.). To actually interpret or compile the language, however, you don't need this; you just need Expression nodes that have operators and operands. The abstract syntax tree is the result of simplifying the concrete syntax tree down to the things actually needed to represent the meaning of the program. This tree has a much simpler definition and is thus easier to process in the later stages of execution.
You usually don't need to actually build a concrete syntax tree. The action routines in your YACC (or Antlr, or Menhir, or whatever...) grammar can directly build the abstract syntax tree, so the concrete syntax tree only exists as a conceptual entity representing the parse structure of your source text.
A concrete syntax tree matches what the grammar rules say is the syntax. The purpose of the abstract syntax tree is have a "simple" representation of what's essential in "the syntax tree".
A real value in the AST IMHO is that it is smaller than the CST, and therefore takes less time to process. (You might say, who cares? But I work with a tool where we have
tens of millions of nodes live at once!).
Most parser generators that have any support for building syntax trees insist that you personally specify exactly how they get built under the assumption that your tree nodes will be "simpler" than the CST (and in that, they are generally right, as programmers are pretty lazy). Arguably it means you have to code fewer tree visitor functions, and that's valuable, too, in that it minimizes engineering energy. When you have 3500 rules (e.g., for COBOL) this matters. And this "simpler"ness leads to the good property of "smallness".
But having such ASTs creates a problem that wasn't there: it doesn't match the grammar, and now you have to mentally track both of them. And when there are 1500 AST nodes for a 3500 rule grammar, this matters a lot. And if the grammar evolves (they always do!), now you have two giant sets of things to keep in synch.
Another solution is to let the parser simply build CST nodes for you and just use those. This is a huge advantage when building the grammars: there's no need to invent 1500 special AST nodes to model 3500 grammar rules. Just think about the tree being isomorphic to the grammar. From the point of view of the grammar engineer this is completely brainless, which lets him focus on getting the grammar right and hacking at it to his heart's content. Arguably you have to write more node visitor rules, but that can be managed. More on this later.
What we do with the DMS Software Reengineering Toolkit is to automatically build a CST based on the results of a (GLR) parsing process. DMS then automatically constructs an "compressed" CST for space efficiency reasons, by eliminating non-value carrying terminals (keywords, punctation), semantically useless unary productions, and forming directly-indexable lists for grammar rule pairs that are list like:
L = e ;
L = L e ;
L2 = e2 ;
L2 = L2 ',' e2 ;
and a wide variety of variations of such forms. You think in terms of the grammar rules and the virtual CST; the tool operates on the compressed representation. Easy on your brain, faster/smaller at runtime.
Remarkably, the compressed CST built this way looks a lot an AST that you might have designed by hand (see link at end to examples). In particular, the compressed CST doesn't carry any nodes that are just concrete syntax.
There are minor bits of awkwardness: for example while the concrete nodes for '(' and ')' classically found in expression subgrammars are not in the tree, a "parentheses node" does appear in the compressed CST and has to be handled. A true AST would not have this. This seems like a pretty small price to pay for the convenience of not have to specify the AST construction, ever. And the documentation for the tree is always available and correct: the grammar is the documentation.
How do we avoid "extra visitors"? We don't entirely, but DMS provides an AST library that walks the AST and handles the differences between the CST and the AST transparently. DMS also offers an "attribute grammar" evaluator (AGE), which is a method for passing values computed at nodes up and down the tree; the AGE handles all the tree representation issues and so the tool engineer only worries about writing computations effectively directly on the grammar rules themselves. Finally, DMS also provides "surface-syntax" patterns, which allows code fragments from the grammar to used to find specific types of subtrees, without knowing most of the node types involved.
One of the other answers observes that if you want to build tools that can regenerate source, your AST will have to match the CST. That's not really right, but it is far easier to regenerate the source if you have CST nodes. DMS generates most of the prettyprinter automatically because it has access to both :-}
Bottom line: ASTs are good for small, both phyiscal and conceptual. Automated AST construction from the CST provides both, and lets you avoid the problem of tracking two different sets.
EDIT March 2015: Link to examples of CST vs. "AST" built this way
This is based on the Expression Evaluator grammar by Terrence Parr.
The grammar for this example:
grammar Expr002;
options
{
output=AST;
ASTLabelType=CommonTree; // type of $stat.tree ref etc...
}
prog : ( stat )+ ;
stat : expr NEWLINE -> expr
| ID '=' expr NEWLINE -> ^('=' ID expr)
| NEWLINE ->
;
expr : multExpr (( '+'^ | '-'^ ) multExpr)*
;
multExpr
: atom ('*'^ atom)*
;
atom : INT
| ID
| '('! expr ')'!
;
ID : ('a'..'z' | 'A'..'Z' )+ ;
INT : '0'..'9'+ ;
NEWLINE : '\r'? '\n' ;
WS : ( ' ' | '\t' )+ { skip(); } ;
Input
x=1
y=2
3*(x+y)
Parse Tree
The parse tree is a concrete representation of the input. The parse tree retains all of the information of the input. The empty boxes represent whitespace, i.e. end of line.
AST
The AST is an abstract representation of the input. Notice that parens are not present in the AST because the associations are derivable from the tree structure.
EDIT
For a more through explanation see Compilers and Compiler Generators pg. 23
This blog post may be helpful.
It seems to me that the AST "throws away" a lot of intermediate grammatical/structural information that wouldn't contribute to semantics. For example, you don't care that 3 is an atom is a term is a factor is a.... You just care that it's 3 when you're implementing the exponentiation expression or whatever.
The concrete syntax tree follows the rules of the grammar of the language. In the grammar, "expression lists" are typically defined with two rules
expression_list can be: expression
expression_list can be: expression, expression_list
Followed literally, these two rules gives a comb shape to any expression list that appears in the program.
The abstract syntax tree is in the form that's convenient for further manipulation. It represents things in a way that makes sense for someone that understand the meaning of programs, and not just the way they are written. The expression list above, which may be the list of arguments of a function, may conveniently be represented as a vector of expressions, since it's better for static analysis to have the total number of expression explicitly available and be able to access each expression by its index.
Simply, AST only contains semantics of the code, Parse tree/CST also includes information on how exactly code was written.
The concrete syntax tree contains all information like superfluous parenthesis and whitespace and comments, the abstract syntax tree abstracts away from this information.
NB: funny enough, when you implement a refactoring engine your AST will again contain all the concrete information, but you'll keep referring to it as an AST because that has become the standard term in the field (so one could say it has long ago lost its original meaning).
CST(Concrete Syntax Tree) is a tree representation of the Grammar(Rules of how the program should be written).
Depending on compiler architecture, it can be used by the Parser to produce an AST.
AST(Abstract Syntax Tree) is a tree representation of Parsed source, produced by the Parser part of the compiler. It stores information about tokens+grammar.
Depending on architecture of your compiler, The CST can be used to produce an AST. It is fair to say that CST evolves into AST. Or, AST is a richer CST.
More explanations can be found on this link: http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees#id6
It is a difference which doesn't make a difference.
An AST is usually explained as a way to approximate the semantics of a programming language expression by throwing away lexical content. For example in a context free grammar you might write the following EBNF rule
term: atom (('*' | '/') term )*
whereas in the AST case you only use mul_rule and div_rule which expresses the proper arithmetic operations.
Can't those rules be introduced in the grammar in the first place? Of course. You can rewrite the above compact and abstract rule by breaking it into a more concrete rules used to mimic the mentioned AST nodes:
term: mul_rule | div_rule
mul_rule: atom ('*' term)*
div_rule: atom ('/' term)*
Now, when you think in terms of top-down parsing then the second term introduces a FIRST/FIRST conflict between mul_rule and div_rule something an LL(1) parser cannot deal with. The first rule form was the left factored version of the second one which effectively eliminated structure. You have to pay some prize for using LL(1) here.
So ASTs are an ad hoc supplement used to fix the deficiencies of grammars and parsers. The CST -> AST transformation is a refactoring move. No one ever bothered when an additional comma or colon is stored in the syntax tree. On the contrary some authors retrofit them into ASTs because they like to use ASTs for doing refactorings instead of maintaining various trees at the same time or write an additional inference engine. Programmers are lazy for good reasons. Actually they store even line and column information gathered by lexical analysis in ASTs for error reporting. Very abstract indeed.

Resources