How can I traverse the parse tree generated by yacc? - parsing

Suppose I have given a valid arithmetic expression to my yacc file. And now I want to show how the parse tree looks by traversing it in pre or post order. Is it possible to traverse the parse tree.Just a rookie in compiler design.

Only if you build the tree yourself while you are parsing. Bison/yacc won't do that for you.
However, the reduce actions are effectively a depth-first pre-order walk, since the reductions happen bottom-up left-to-right. So you can get something very similar to a parse tree walk by putting your "visit" code in each reduction rule.
But you're really better off learning how to create ASTs.

Related

LL top-down parser, from CST to AST

I am currently learning about syntax analysis, and more especially, top-down parsing.
I know the terminology and the difference with bottom-up LR parsers, and since top-down LL parsers are easier to implement by hand, I am looking forward to make my own.
I have seen two kinds of approach:
The recursive-descent one using a collection of recursive functions.
The stack-based and table-driven automaton as shown here on Wikipedia.
I am more interested by the latter, for its power and its elimination of call-stack recursion. However, I don't understand how to build the AST from the implicit parse tree.
This code example of a stack-based finite automaton show the parser analyzing the input buffer, only giving a yes/no answer if the syntax has been accepted.
I have heard of stack annotations in order to build the AST, but I can't figure out how to implement them. Can someone provide a practical implementation of such technique ?
"Top-down" and "bottom-up" are excellent descriptions of the two parsing strategies, because they describe precisely how the syntax tree would be constructed if it were constructed. (You can also think of it as the traversal order over the implicit parse tree but here we're actually interested in real parse trees.)
It seems clear that there is an advantage to bottom-up tree construction. When it is time to add a node to the tree, you already know what its children are. You can construct the node fully-formed in one (functional) action. All the child information is right there waiting for you, so you can add semantic information to the node based on the semantic information of its children, even using the children in an order other than left-to-right.
By contrast, the top-down parser constructs the node without any children, and then needs to add each child in turn to the already constructed node. That's certainly possible, but it's a bit ugly. Also, the incremental nature of the node constructor means that semantic information attached to the node also needs to be computed incrementally, or deferred until the node is fully constructed.
In many ways, this is similar to the difference between evaluating expressions written in Reverse Polish Notation (RPN) from expressions written in (Forward) Polish Notation [Note 1]. RPN was invented precisely to ease evaluation, which is possible with a simple value stack. Forward Polish expressions can be evaluated, obviously: the easiest way is to use a recursive evaluator but in environments where the call stack can not be relied upon it is possible to do it using an operator stack, which effectively turns the expression into RPN on the fly.
So that's probably the mechanism of choice for building syntax trees from top-down parsers as well. We add a "reduction" marker to the end of every right-hand side. Since the marker goes at the end of the right-hand side, so it is pushed first.
We also need a value stack, to record the AST nodes (or semantic values) being constructed.
In the basic algorithm, we now have one more case. We start by popping the top of the parser stack, and then examine this object:
The top of the parser stack was a terminal. If the current input symbol is the same terminal, we remove the input symbol from the input, and push it (or its semantic value) onto the value stack.
The top of the parser stack was a marker. The associated reduction action is triggered, which will create the new AST node by popping an appropriate number of values from the value stack and combining them into a new AST node which is then pushed onto the value stack. (As a special case, the marker action for the augmented start symbol's unique production S' -> S $ causes the parse to be accepted, returning the (only) value in the value stack as the AST.)
The top of the parser stack was a non-terminal. We then identify the appropriate right-hand side using the current input symbol, and push that right-hand side (right-to-left) onto the parser stack.
You need to understand the concept behind. You need to understand the concept of pushdown automaton. After you understand how to make computation on paper with pencil you will be able to understand multiple ways to implement its idea, via recursive descent or with stack. The ideas are the same, when you use recursive descent you implicitly have the stack that the program use for execution, where the execution data is combined with the parsing automaton data.
I suggest you to start with the course taught by Ullman (automata) or Dick Grune, this one is the best focused on parsing. (the book of Grune is this one), look for the 2nd edition.
For LR parsing the essential is to understand the ideas of Earley, from these ideas Don Knuth created the LR method.
For LL parsing, the book of Grune is excellent, and Ullman presents the computation on paper, the math background of parsing that is essential to know if you want to implement your own parsers.
Concerning the AST, this is the output of parsing. A parser will generate a parsing tree that is transformed in AST or can generate and output directly the AST.

FParsec parse unordered clauses

I want to parse some grammar like the following
OUTPUT data
GROUPBY key
TO location
USING object
the order of the GROUPBY TO USING clauses is allowed to vary, but each clause can occur at most once.
Is there a convenient or built-in way to parse this in FParsec? I read some questions and answers that mentions Haskell Parsec permute. There doesn't seem to be a permute in FParsec. If this is the way to go, what would I go about building a permute in FParsec?
I don't think there's a permutation parser in FParsec. I see a few directions you could take it though.
In general, what #FuleSnabel suggests is pretty sound, and probably simplest to implement. Don't make the parser responsible for asserting the property that each clause appears at most once. Instead parse each clause separately, allowing for duplicates, then inspect the resulting AST and error out if your property doesn't hold.
You could generate all permutations of your parsers and combine them with choice. Obviously this approach doesn't scale, but for three parsers I'd say it's fair game.
You could write your own primitive for parsing using a collection of parsers applied in any order. This would be a variant of many where in each step you make a choice of a parser, then discard that parser. So in each step you choose from a shrinking list of parsers until you can't parse anymore, finally returning the results collected along the way.
You could use user state to keep track of parsers already used and fail if a parser would be used twice within the same context. Not sure if this would yield a particularly nice solution - haven't really tried it before.

Converting Parse tree to Abstract Parse Tree

I have a parse tree that I wish to convert to an abstract parse tree.
I found found examples online but that are normally just simple addition.
I understand that I need to remove unnecessary information but I don't know how to lay it out with regards to the repeat and until.
Is this the correct APT for the concrete parse tree?
There is no standard which defines a "correct" AST. ASTs are used instead of parse trees only to make life easier for the application which is going to process the tree.
In short, the decision is entirely yours, and you should make it on the basis of what you intend to do with the AST.

Trying to understand lexers, parse trees and syntax trees

I'm reading the "Dragon Book" and I think I understand the main point of a lexer, parse tree and syntax tree and what errors they're typically supposed to catch (assuming we're using a context-free language), but I need someone to catch me if I'm wrong. My understanding is that a lexer simply tokenizes the input and catches errors that have to do with invalid constructs in code, such as a semi-colon being passed in language that doesn't contain semi-colons. the parse tree is used to verify that the syntax is followed and the code is in the correct order, and the syntax tree is used to actually evaluate the statements and expressions in the code and generate things like 3-address code or machine code. Are any of these correct?
Side-note: Are a concrete-syntax tree and a parse tree the same thing?
Side-Side note: When constructing the AST, does the entire program get built into one giant AST, or is a different AST constructed per statement/expression?
Strictly speaking, lexers are parsers too. The difference between lexers and parsers is in what they operate on. In the world of a lexer, everything is made of individual characters, which it then tokenizes by matching them to the regular grammar it understands. To a parser, the world is made up of tokens which it makes a syntax tree out of by matching them to the context-free grammar it understands. In this sense, they are both doing the same kind of thing but on different levels. In fact, you could build parsers on top of parsers, operating at higher and higher levels so one symbol in the highest level grammar could represent something incredibly complex at the lowest-level.
To your other questions:
Yes, a concrete syntax tree is a parse tree.
Generally, parsers make
one tree for the entire file, since that represents a sentence in the
CFG. This isn't necessarily always the case though.

How should I construct and walk the AST output of an ANTLR3 grammar?

Documentation and general advice is that the abstract syntax tree should omit tokens that have no meaning. ("Record the meaningful input tokens (and only the meaningful tokens" - The Definitive ANTLR Reference) IE: In a C++ AST, you would omit the braces at the start and end of the class, as they have no meaning and are simply a mechanism for delineating class start and end for parsing purposes. I understand that, for the sake of rapidly and efficiently walking the tree, culling out useless token nodes like that is useful, but in order to appropriately colorize code, I would need that information, even if it doesn't contribute to the meaning of the code. A) Is there any reason why I shouldn't have the AST serve multiple purposes and choose not to omit said tokens?
It seems to me that what ANTLRWorks interpreter outputs is what I'm looking for. In the ANTLRWorks interpreter, it outputs a tree diagram where, for each rule matched, a node is made, as well as a child node for each token and/or subrule. The parse tree, I suppose it's called.
If manually walking a tree, wouldn't it be more useful to have nodes marking a rule? By having a node marking a rule, with it's subrules and tokens as children, a manual walker doesn't need to look ahead several nodes to know the context of what node it's on. Tree grammars seem redundant to me. Given a tree of AST nodes, the tree grammar "parses" the nodes all over again in order to produce some other output. B) Given that the parser grammar was responsible for generating correctly formed ASTs and given the inclusion of rule AST nodes, shouldn't a manual walker avoid the redundant AST node pattern matching of a tree grammar?
I fear I'm wildly misunderstanding the tree grammar mechanism's purpose. A tree grammar more or less defines a set of methods that will run through a tree, look for a pattern of nodes that matches the tree grammar rule, and execute some action based on that. I can't depend on forming my AST output based on what's neat and tidy for the tree grammar (omitting meaningless tokens for pattern matching speed) yet use the AST for color-coding even the meaningless tokens. I'm writing an IDE as well; I also can't write every possible AST node pattern that a plugin author might want to match, nor do I want to require them using ANTLR to write a tree grammar. In the case of plugin authors walking the tree for their own criteria, rule nodes would be rather useful to avoid needing pattern matching.
Thoughts? I know this "question" might push the limits of being an SO question, but I'm not sure how else to formulate my inquiries or where else to inquire.
Sion Sheevok wrote:
A) Is there any reason why I shouldn't have the AST serve multiple purposes and choose not to omit said tokens?
No, you mind as well keep them in there.
Sion Sheevok wrote:
It seems to me that what ANTLRWorks interpreter outputs is what I'm looking for. In the ANTLRWorks interpreter, it outputs a tree diagram where, for each rule matched, a node is made, as well as a child node for each token and/or subrule. The parse tree, I suppose it's called.
Correct.
Sion Sheevok wrote:
B) Given that the parser grammar was responsible for generating correctly formed ASTs and given the inclusion of rule AST nodes, shouldn't a manual walker avoid the redundant AST node pattern matching of a tree grammar?
Tree grammars are often used to mix custom code in to evaluate/interpret the input source. If you mix this code inside the parser grammar and there's some backtracking going on in the parser, this custom code could be executed more than it's supposed to. Walking the tree using a tree grammar is (if done properly) only possible in one way causing custom code to be executed just once.
But if a separate tree walker/iterator is necessary, there are two camps out there that advocate using tree grammars and others opt for walking the tree manually with a custom iterator. Both camps raise valid points about their preferred way of walking the AST. So there's no clear-cut way to do it in one specific way.
Sion Sheevok wrote:
Thoughts?
Since you're not evaluating/interpreting, you mind as well not use a tree grammar.
But to create a parse tree as ANTLRWorks does (which you have no access to, btw), you will need to mix AST rewrite rules inside your parser grammar though. Here's a Q&A that explains how to do that: How to output the AST built using ANTLR?
Good luck!

Resources