ANTLR Parse tree modification - parsing

I'm using ANTLR4 to create a parse tree for my grammar, what I want to do is modify certain nodes in the tree. This will include removing certain nodes and inserting new ones. The purpose behind this is optimization for the language I am writing. I have yet to find a solution to this problem. What would be the best way to go about this?

While there is currently no real support or tools for tree rewriting, it is very possible to do. It's not even that painful.
The ParseTreeListener or your MyBaseListener can be used with a ParseTreeWalker to walk your parse tree.
From here, you can remove nodes with ParserRuleContext.removeLastChild(), however when doing this, you have to watch out for ParseTreeWalker.walk:
public void walk(ParseTreeListener listener, ParseTree t) {
if ( t instanceof ErrorNode) {
listener.visitErrorNode((ErrorNode)t);
return;
}
else if ( t instanceof TerminalNode) {
listener.visitTerminal((TerminalNode)t);
return;
}
RuleNode r = (RuleNode)t;
enterRule(listener, r);
int n = r.getChildCount();
for (int i = 0; i<n; i++) {
walk(listener, r.getChild(i));
}
exitRule(listener, r);
}
You must replace removed nodes with something if the walker has visited parents of those nodes, I usually pick empty ParseRuleContext objects (this is because of the cached value of n in the method above). This prevents the ParseTreeWalker from throwing a NPE.
When adding nodes, make sure to set the mutable parent on the ParseRuleContext to the new parent. Also, because of the cached n in the method above, a good strategy is to detect where the changes need to be before you hit where you want your changes to go in the walk, so the ParseTreeWalker will walk over them in the same pass (other wise you might need multiple passes...)
Your pseudo code should look like this:
public void enterRewriteTarget(#NotNull MyParser.RewriteTargetContext ctx){
if(shouldRewrite(ctx)){
ArrayList<ParseTree> nodesReplaced = replaceNodes(ctx);
addChildTo(ctx, createNewParentFor(nodesReplaced));
}
}
I've used this method to write a transpiler that compiled a synchronous internal language into asynchronous javascript. It was pretty painful.

Another approach would be to write a ParseTreeVisitor that converts the tree back to a string. (This can be trivial in some cases, because you are only calling TerminalNode.getText() and concatenate in aggregateResult(..).)
You then add the modifications to this visitor so that the resulting string representation contains the modifications you try to achieve.
Then parse the string and you get a parse tree with the desired modifications.
This is certainly hackish in some ways, since you parse the string twice. On the other hand the solution does not rely on antlr implementation details.

I needed something similar for simple transformations. I ended up using a ParseTreeWalker and a custom ...BaseListener where I overwrote the enter... methods. Inside this method the ParserRuleContext.children is available and can be manipulated.
class MyListener extends ...BaseListener {
#Override
public void enter...(...Context ctx) {
super.enter...(ctx);
ctx.children.add(...);
}
}
new ParseTreeWalker().walk(new MyListener(), parseTree);

Related

How to set timeout for parser when it takes too long?

I'm using ANTLR4 in C# with the following code sample:
AntlrInputStream antlrStream = new AntlrInputStream(text);
MyLexer myLexer = new(new AntlrInputStream());
myLexer.SetInputStream(antlrStream);
CommonTokenStream myTokens = new CommonTokenStream(myLexer);
parser = new MyParser(myTokens)
{
BuildParseTree = true,
};
IParseTree tree = parser.startRule();
Class MyLexer/MyParser are derived from the classes Lexer/Parser of Anlr4.Runtime and were auto generated by ANTLR4.
In some rare cases, with specific text, startRule() takes forever and never finishes. I want to be able to set some kind of a "Timeout" for the parsing and throw an Exception.
Any advice what is the recommended way to do it?
I looked at this temporarily a while back. You can essentially create a wrapper over the generated parser and override one of the methods. I use ANTLR with Kotlin, so excuse the below example.
class InterruptibleParser : YourParser() {
override fun enterRule() {
if (Thread.interrupted()) {
throw InterruptedException()
}
return super.enterRule()
}
}
I tried it with either enterRule, or consume, or getContext -- I don't remember which function gets called frequently enough.
But with the above working, you can instantiate the parser and interrupt its thread after a certain amount of time. Would likely make your parsing fairly slower (maybe around 25% slower if I'm remembering correctly). Anyways, hope this helps.

vector<reference_wrapper> .. things going out of scope? how does it work?

Use case: I am converting data from a very old program of mine to a database friendly format. There are parts where I have to do multiple passes over the old data, because in particular the keys have to first exist before I can reference them in relationships. So I thought why not put the incomplete parts in a vector of references during the first pass and return it from the working function, so I can easily use that vector to make the second pass over whatever is still incomplete. I like to avoid pointers when possible so I looked into std::reference_wrapper<T> which seemes like exactly what I need .. except I don't understand it's behavior at all.
I have both vector<OldData> old_data and vector<NewData> new_data as member of my conversion class. The converting member function essentially does:
//...
vector<reference_wrapper<NewData>> incomplete;
for(const auto& old_elem : old_data) {
auto& new_ref = *new_data.insert(new_data.end(), convert(old_elem));
if(is_incomplete(new_ref)) incomplete.push_back(ref(new_ref));
}
return incomplete;
However, incomplete is already broken immediately after the for loop. The program compiles, but crashes and produces gibberish. Now I don't know if I placed ref correctly, but this is only one of many tries where I tried to put it somewhere else, use push_back or emplace_back instead, etc. ..
Something seems to be going out of scope, but what? both new_data and old_data are class members, incomplete also lives outside the loop, and according to the documentation, reference_wrapper is copyable.
Here's a simplified MWE that compiles, crashes, and produces gibberish:
// includes ..
using namespace std;
int main() {
int N = 2; // works correctly for N = 1 without any other changes ... ???
vector<string> strs;
vector<reference_wrapper<string>> refs;
for(int i = 0; i < N; ++i) {
string& sref = ref(strs.emplace_back("a"));
refs.push_back(sref);
}
for (const auto& r : refs) cout << r.get(); // crash & gibberish
}
This is g++ 10.2.0 with -std=c++17 if it means anything. Now I will probably just use pointers and be done, but I would like to understand what is going on here, documentation / search does not seem to help..
The problem here is that you are using vector data structure which might re-allocate memory for the entire vector any time that you add an element, so all previous references on that vector most probably get invalidated, you can resolve your problem by using list instead of vector.

Storing line number in ANTLR Parse Tree

Is there any way of storing line numbers in the created parse tree, using ANTLR 4? I came across this article, which does it but I think it's for older ANTLR version, because
parser.setASTFactory(factory);
It does not seem to be applicable to ANTLR 4.
I am thinking of having something like
treenode.getLine()
, like we can have
treenode.getChild()
With Antlr4, you normally implement either a listener or a visitor.
Both give you a context where you find the location of the tokens.
For example (with a visitor), I want to keep the location of an assignment defined by a Uppercase identifier (UCASE_ID in my token definition).
The bit you're interested in is ...
ctx.UCASE_ID().getSymbol().getLine()
The visitor looks like ...
static class TypeAssignmentVisitor extends ASNBaseVisitor<TypeAssignment> {
#Override
public TypeAssignment visitTypeAssignment(TypeAssignmentContext ctx) {
String reference = ctx.UCASE_ID().getText();
int line = ctx.UCASE_ID().getSymbol().getLine();
int column = ctx.UCASE_ID().getSymbol().getCharPositionInLine()+1;
Type type = ctx.type().accept(new TypeVisitor());
TypeAssignment typeAssignment = new TypeAssignment();
typeAssignment.setReference(reference);
typeAssignment.setReferenceToken(new Token(ctx.UCASE_ID().getSymbol().getLine(), ctx.UCASE_ID().getSymbol().getCharPositionInLine()+1));
typeAssignment.setType(type);
return typeAssignment;
}
}
I was new to Antlr4 and found this useful to get started with listeners and visitors ...
https://github.com/JakubDziworski/AntlrListenerVisitorComparison/

Caching streams in Functional Reactive Programming

I have an application which is written entirely using the FRP paradigm and I think I am having performance issues due to the way that I am creating the streams. It is written in Haxe but the problem is not language specific.
For example, I have this function which returns a stream that resolves every time a config file is updated for that specific section like the following:
function getConfigSection(section:String) : Stream<Map<String, String>> {
return configFileUpdated()
.then(filterForSectionChanged(section))
.then(readFile)
.then(parseYaml);
}
In the reactive programming library I am using called promhx each step of the chain should remember its last resolved value but I think every time I call this function I am recreating the stream and reprocessing each step. This is a problem with the way I am using it rather than the library.
Since this function is called everywhere parsing the YAML every time it is needed is killing the performance and is taking up over 50% of the CPU time according to profiling.
As a fix I have done something like the following using a Map stored as an instance variable that caches the streams:
function getConfigSection(section:String) : Stream<Map<String, String>> {
var cachedStream = this._streamCache.get(section);
if (cachedStream != null) {
return cachedStream;
}
var stream = configFileUpdated()
.filter(sectionFilter(section))
.then(readFile)
.then(parseYaml);
this._streamCache.set(section, stream);
return stream;
}
This might be a good solution to the problem but it doesn't feel right to me. I am wondering if anyone can think of a cleaner solution that maybe uses a more functional approach (closures etc.) or even an extension I can add to the stream like a cache function.
Another way I could do it is to create the streams before hand and store them in fields that can be accessed by consumers. I don't like this approach because I don't want to make a field for every config section, I like being able to call a function with a specific section and get a stream back.
I'd love any ideas that could give me a fresh perspective!
Well, I think one answer is to just abstract away the caching like so:
class Test {
static function main() {
var sideeffects = 0;
var cached = memoize(function (x) return x + sideeffects++);
cached(1);
trace(sideeffects);//1
cached(1);
trace(sideeffects);//1
cached(3);
trace(sideeffects);//2
cached(3);
trace(sideeffects);//2
}
#:generic static function memoize<In, Out>(f:In->Out):In->Out {
var m = new Map<In, Out>();
return
function (input:In)
return switch m[input] {
case null: m[input] = f(input);
case output: output;
}
}
}
You may be able to find a more "functional" implementation for memoize down the road. But the important thing is that it is a separate thing now and you can use it at will.
You may choose to memoize(parseYaml) so that toggling two states in the file actually becomes very cheap after both have been parsed once. You can also tweak memoize to manage the cache size according to whatever strategy proves the most valuable.

How to map between push parser and pull parser

I have implemented a pull parser that reads a data stream and emits tokens on selected content via a callback handler. This abstract technique is also known as observer pattern (with the callback handler also known as observer) and used for instance in SAX for parsing XML.
The contrary design pattern (is there a name for it?) is to pull the next data token as used for instance in XML parsing with StAX.
One can easily map to a push parser by looping a pull parser:
// push
parser.parse( callback: handler );
// pull
while( token = parser.next ) {
handler(token)
}
But how do I map a push parser to a pull parser?
To adapt a push parser into a pull parser, you have to collect several (all ? depending on what is being parsed and the order of the elements being pushed) into Event objects. And then allow those Events to be pulled.
We can use XML as an example and adapt a SAXHandler into a StAX parser. We also have to implement the methods for XMLStreamReader for iterating over the StAX XMLEvents.
I've never used StAX but it looks like it stores the current state in the XMLStreamReader object. Each call to reader.next() updates the state and the returned values from reader.getName() and reader.getText() etc are updated accordingly.
We can do this several ways from parsing the entire thing in memory first, then iterating through what we've stored in memory, to more complicated techniques such as using multiple threads to parse the XML and block reading the next tag until the user calls next().
For simplicity, I'll just show storing everything in memory
class SAXHandler extends DefaultHandler implements XMLSTreamReader {
//Stax Event objects
List<XMLEvent> events = new ArrayList<>;
int counter=0;
//Stax current tag name and text data updated with calls to next()
private String name, text;
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
//create a new XMLEvent for the start of the new tag
XMLEvent newEvent = ....
events.add(newEvent);
}
//other SAX methods implemented similarly
...
Now for the StAX methods:
#Override
public XMLEvent next(){
if( !hasNext() ){
throw NoSuchElementException();
}
counter++;
XMLEvent next =events(counter);
//update our content
this.name = next.name;
this.text = next.text;
...
return next;
}
#Override
public boolean hasNext(){
return counter < events.size();
}
...
#Override
public String getName(){
return name;
}
#Override
public String getText(){
return text;
}
}
Hope this helps
What I think you are looking for is Control Inversion, which is not easy in languages which are tied to a stacklike execution model.
C is not quite welded to execution stacks, so you could do this with the (deprecated) Posix getcontext/setcontext/makecontext, or slightly more portably with threads.
In other languages, it is easier, if no less mind-bending. See Scheme's call/cc primitive, this piece of Lua ancient history, or take a look at Python generators (although the latter is not able to invert control without help from the function whose control is to be inverted.)

Resources