Antlr4 - Best way to switch between Listener and Visitor modes? - parsing

I'm currently in the process of planning out the structure of a language interpreter and I am realizing that I do not like the idea of exclusively using a Visitor or Listener tree traversal method.
Since both tree traversal methods have their merits, Ideally, I would like to use a mix of both:
Using a listener makes the most sense when traversing arbitrary language block definitions (function/class definitions, struct/enum-like definitions) especially when they can be nested.
Visitors seem to naturally lend themselves to situations such as expression evaluation, where the context is far more predictable, and result values can be returned back up the chain.
What is the most "correct" method to switch between the two traversal methods?
So far, my ideas are as follows:
Switch from Listener to Visitor for a portion of a parse tree
Say that, when the listener reaches the node "Foo", I want to handle its children more explicitly using a Visitor. One way I can think of doing this is:
Parse Tree walker calls enterFoo(ctx)
Create an instance of myFooVisitor
Explicitly visit children, store result, etc.
Set ctx.children = [] (or equivalent)
When enterFoo() returns, the parse tree walker sees that there are no more children for this node, and does not needlessly walk though all of Foo's children
Switch from Visitor to Listener for a portion of a parse tree
This is a little more obvious to me. Since tree traversal is explicitly controlled when using Visitors, switching over seems trivial.
visitFoo() gets called
Create an instance of a new parse tree walker and myFooListener
Start the walker using the listener as usual.

You seem to have gotten an idea where listeners and visitors are modes, kinda states you can switch. This is wrong.
Both listener and visitor are classes that allow you act on rule traversal. The listener does this by getting called from the parser during the parse process when a rule is "entered" or "left". There's no parse tree involved.
The visitor however uses a parse tree to walk over every node and call methods for them. You can override any of these methods to do any associated work. That doesn't necessary have to do with the evaluation result. You can use that independently.
ANTLR4 generates method bodies for each of your rule (both in listeners and visitors), which makes it easy for you to only implement those rules you are interested in.
Now that you know that listeners are used during parsing while visitors are used after parsing, it should be obvious that you cannot switch between them. And in fact switching wouldn't help much, since both classes do essentially the same (call methods for encountered rules).
I could probably give you more information if your question would actually contain what you want to achieve, not how.

Related

Antlr4 Is there a way to walk on a node with a listener

I am working with antlr4 to parse some inner language.
The way I do it is the standard way of creating a stream of tokens, then creating a parse tree with set tokens. And the last step is creating a tree.
Now that I have a tree I can use the walk command to walk over it.
The walk command and the parse tree being created are an extension of the BaseListener class.
For my project I use many lookaheads, I mostly use the visitors as a sort of look ahead.
A visitor function can be applied on a node from the parser tree. (Assuming you implement a visitor class)
I have tried creating a smaller listener class to act as my look ahead on a specific node. However, I'm not managing to figure out if this is even possible.
Or is there another / better / smarter way to create a look ahead?
To make the question seem shorter:
Can a listener class be used on a node in the antlr4 parser tree-like visitor can?
There’s no real reason to do anything fancy with a listener here. To handle your example of resolving the type of an expression, this can be done in a couple of steps. You’ll need a listener that builds a symbol table (presumably a scoped symbol table so that each scope in your parse tree has a symbol table and references to “parent” scopes it might search to resolve the data types of any variables.). Depending upon your needs, you can build these scoped symbol tables in your listener, pushing and popping them on a stack as you enter/exit scopes, or you may do this is a separate listener attaching a reference to scope nodes to your parse tree. This symbol table will be required for type resolution, so if you can reference variables in your source prior to encountering the declaration, you’d need too pass your tree with a listener building and retaining the symbol table structure.
With the symbol table structure available to your listener for resolving expression types. If you always use the exit* method overrides then all of the data types for any sub-expressions will have already been resolved. With all of the sub-expression types, you can infer the type of the expression you’re about to “exit”. Evaluating the type of child expressions should cover all of the “lookahead” you need to resolve an expression’s type.
There’s really nothing about this approach that rules out a visitor either, it would just be up to you to visit each child, determining it’s type, and then, at the end of the listener, resolve the expression type from the types of it’s children.

How to walk whole Parse Tree and print it's content with slight changes in ANTLR4?

So as stated in the title, my task is to traverse the Parse Tree generated for code written in Java (grammar is a standard Java grammar), print most of it unchanged and modify only some words, for example type declarations.
My current approach was to create ParseTreeListener and implement the logic in the enterEveryRule method, but unfortunately it doesn't appear to work even for basic printing. The output is very messy and there are a lot of repetitions, as if every node was visited multiple times.
My another try was to implement appropriate methods in BaseListener that would do the changes to the type declarations I need, but from there I see no possibility to print the rest of the code unchanged.
Looking forward to your help!
You could use ANTLR's string templates to produce code from the ASTs.
In general, you start with set of "standard" string templates that can regenerate source code corresponding to the underlying tree.
To get the effect you want, you judiciously choose the standard string templates on AST nodes where you don't want changes, and variant templates where you do want changes.
IMHO, it is better to modify the AST, and then simply apply the standard templates.

unifying model for 'static' / 'historical' /streaming data in F#

An API I use exposes data with different characteristics :
'Static' reference data, that is you ask for them, get one value which supposedly does not change
'historical' values, where you can query for a date range
'subscription' values, where you register yourself for receiving updates
From a semantic point of view, though, those fields are one and the same, and only the consumption mode differs. Reference data can be viewed as a constant function yielding the same result through time. Historical data is just streaming data that happened in the past.
I am trying to find a unifying model against which to program all the semantic of my queries, and distinguish it from its consumption mode.
That mean, the same quotation could evaluated in a "real-time" way which would turn fields into their appropriate IObservable form (when available), or in 'historical' way, which takes a 'clock' as an argument and yield values when ticked, or a 'reference' way, which just yield 1 value (still decorated by the historical date at which the query is ran..)
I am not sure which programming tools in F# would be the most natural fit for that purpose, but I am thinking of quotation, which I never really used.
Would it be well suited for such a task ?
you said it yourself: just go with IObservable
your static case is just OnNext with 1 value
in your historical case you OnNext for each value in your query (at once when a observer is registered)
and the subscription case is just the ordinary IObservable pattern - you don't need something like quotations for this.
I have done something very similar (not the static case, but the streaming and historical case), and IObservable is definitely the right tool for the job. In reality, IEnumerable and IObservable are dual and most things you can write for one you can also write for the other. However, a push based model (IObservable) is more flexible, and the operators you get as part of Rx are more complete and appropriate than those as part of regular IEnumerable LINQ.
Using quotations just means you need to build the above from scratch.
You'll find the following useful:
Observable.Return(value) for a single static value
list.ToObservable() for turning a historical enumerable to an observable
Direct IObservable for streaming values into IObservable
Also note that you can use virtual schedulers for ticking the observable if this helps (most of the above accept a scheduler). I imagine this is what you want for the historical case.
http://channel9.msdn.com/Series/Rx-Workshop/Rx-Workshop-Schedulers

Is it acceptable to store the previous state as a global variable?

One of the biggest problems with designing a lexical analyzer/parser combination is overzealousness in designing the analyzer. (f)lex isn't designed to have parser logic, which can sometimes interfere with the design of mini-parsers (by means of yy_push_state(), yy_pop_state(), and yy_top_state().
My goal is to parse a document of the form:
CODE1 this is the text that might appear for a 'CODE' entry
SUBCODE1 the CODE group will have several subcodes, which
may extend onto subsequent lines.
SUBCODE2 however, not all SUBCODEs span multiple lines
SUBCODE3 still, however, there are some SUBCODES that span
not only one or two lines, but any number of lines.
this makes it a challenge to use something like \r\n
as a record delimiter.
CODE2 Moreover, it's not guaranteed that a SUBCODE is the
only way to exit another SUBCODE's scope. There may
be CODE blocks that accomplish this.
In the end, I've decided that this section of the project is better left to the lexical analyzer, since I don't want to create a pattern that matches each line (and identifies continuation records). Part of the reason is that I want the lexical parser to have knowledge of the contents of each line, without incorporating its own tokenizing logic. That is to say, if I match ^SUBCODE[ ][ ].{71}\r\n (all records are blocked in 80-character records) I would not be able to harness the power of flex to tokenize the structured data residing in .{71}.
Given these constraints, I'm thinking about doing the following:
Entering a CODE1 state from the <INITIAL> start condition results
in calls to:
yy_push_state(CODE_STATE)
yy_push_state(CODE_CODE1_STATE)
(do something with the contents of the CODE1 state identifier, if such contents exist)
yy_push_state(SUBCODE_STATE) (to tell the analyzer to expect SUBCODE states belonging to the CODE_CODE1_STATE. This is where the analyzer begins to masquerade as a parser.
The <SUBCODE1_STATE> start condition is nested as follows: <CODE_STATE>{ <CODE_CODE1_STATE> { <SUBCODE_STATE>{ <SUBCODE1_STATE> { (perform actions based on the matching patterns) } } }. It also sets the global previous_state variable to yy_top_state(), to wit SUBCODE1_STATE.
Within <SUBCODE1_STATE>'s scope, \r\n will call yy_pop_state(). If a continuation record is present (which is a pattern at the highest scope against which all text is matched), yy_push_state(continuation_record_states[previous_state]) is called, bringing us back to the scope in 2. continuation_record_states[] maps each state with its continuation record state, which is used by the parser.
As you can see, this is quite complicated, which leads me to conclude that I'm massively over-complicating the task.
Questions
For states lacking an extremely clear token signifying the end of its scope, is my proposed solution acceptable?
Given that I want to tokenize the input using flex, is there any way to do so without start conditions?
The biggest problem I'm having is that each record (beginning with the (SUB)CODE prefix) is unique, but the information appearing after the (SUB)CODE prefix is not. Therefore, it almost appears mandatory to have multiple states like this, and the abstract CODE_STATE and SUBCODE_STATE states would act as groupings for each of the concrete SUBCODE[0-9]+_STATE and CODE[0-9]+_STATE states.
I would look at how the OMeta parser handles these things.

Why build an AST walker instead of having the nodes responsible for their own output?

Given an AST, what would be the reason behind making a Walker class that walks over the tree and does the output, as opposed to giving each Node class a compile() method and having it responsible for its own output?
Here are some examples:
Doctrine 2 (an ORM) uses a SQLWalker to walk over an AST and generate SQL from nodes.
Twig (a templating language) has the nodes output their own code (this is an if statement node).
Using a separate Walker for code generation avoids combinatorial explosion in the number of AST node classes as the number of target representations increases. When a Walker is responsible for code generation, you can retarget it to a different representation just by altering the Walker class. But when the AST nodes themselves are responsible for compilation, you need a different version of each node for each separate target representation.
Mostly because of old literature and available tools. Experimenting with both methods you can easily find that AST traversal produces very slow and convoluted code. Moreover, code separated from immediate syntax doesn't resemble it anymore. It's very much like supporting two synchronized code bases, which is always a bad idea. Debugging, maintenance become difficult.
Of course, it can be also difficult to process semantics on the nodes unless you have a well designed state machine. In fact you are never worse than having to traverse AST after the fact, because it's just one particular case of processing semantics on nodes.
You can often hear that AST traversal allows for implementation of multiple semantics for the same syntax. In reality you would never want that, not only because it's rarely needed, but also for performance reasons. And frankly, there is no difficulty in writing separate syntax for a different semantics. The results were always better when both designed together.
And finally, in every non-trivial task, get syntax parsed is the easiest part, getting semantics correct and process actions fast is a challenge. Focusing on AST is approaching the task backwards.
To have support for a feature that the "internal AST walker" doesn't have.
For example, there are several ways to trasnverse a "hierarchical" or "tre" structure,
like "walk thru the leafs first", or "walk thru the branches first".
Or if the nodes siblings have a sort index, and you want to "walk" / "visit" them decremantally by their index, instead of incrementally.
If the AST class or structure you have only works with one method, you may want to use another method using your custom "walker" / "visitor".

Resources