Storing line number in ANTLR Parse Tree - parsing

Is there any way of storing line numbers in the created parse tree, using ANTLR 4? I came across this article, which does it but I think it's for older ANTLR version, because
parser.setASTFactory(factory);
It does not seem to be applicable to ANTLR 4.
I am thinking of having something like
treenode.getLine()
, like we can have
treenode.getChild()

With Antlr4, you normally implement either a listener or a visitor.
Both give you a context where you find the location of the tokens.
For example (with a visitor), I want to keep the location of an assignment defined by a Uppercase identifier (UCASE_ID in my token definition).
The bit you're interested in is ...
ctx.UCASE_ID().getSymbol().getLine()
The visitor looks like ...
static class TypeAssignmentVisitor extends ASNBaseVisitor<TypeAssignment> {
#Override
public TypeAssignment visitTypeAssignment(TypeAssignmentContext ctx) {
String reference = ctx.UCASE_ID().getText();
int line = ctx.UCASE_ID().getSymbol().getLine();
int column = ctx.UCASE_ID().getSymbol().getCharPositionInLine()+1;
Type type = ctx.type().accept(new TypeVisitor());
TypeAssignment typeAssignment = new TypeAssignment();
typeAssignment.setReference(reference);
typeAssignment.setReferenceToken(new Token(ctx.UCASE_ID().getSymbol().getLine(), ctx.UCASE_ID().getSymbol().getCharPositionInLine()+1));
typeAssignment.setType(type);
return typeAssignment;
}
}
I was new to Antlr4 and found this useful to get started with listeners and visitors ...
https://github.com/JakubDziworski/AntlrListenerVisitorComparison/

Related

CoreNLP : provide pos tags

I have text that is already tokenized, sentence-split, and POS-tagged.
I would like to use CoreNLP to additionally annotate lemmas (lemma), named entities (ner), contituency and dependency parse (parse), and coreferences (dcoref).
Is there a combination of commandline options and option file specifications that makes this possible from the command line?
According to this question, I can ask the parser to view whitespace as delimiting tokens, and newlines as delimiting sentences by adding this to my properties file:
tokenize.whitespace = true
ssplit.eolonly = true
This works well, so all that remains is to specify to CoreNLP that I would like to provide POS tags too.
When using the Stanford Parser standing alone, it seems to be possible to have it use existing POS tags, but copying that syntax to the invocation of CoreNLP doesn't seem to work. For example, this does not work:
java -cp *:./* -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props my-properties-file -outputFormat xml -outputDirectory my-output-dir -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -file my-annotated-text.txt
While this question covers programmatic invocation, I'm invoking CoreNLP form the commandline as part of a larger system, so I'm really asking whether this is possible to achieve this with commandline options.
I don't think this is possible with command line options.
If you want you can make a custom annotator and include it in your pipeline you could go that route.
Here is some sample code:
package edu.stanford.nlp.pipeline;
import edu.stanford.nlp.util.logging.Redwood;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.concurrent.MulticoreWrapper;
import edu.stanford.nlp.util.concurrent.ThreadsafeProcessor;
import java.util.*;
public class ProvidedPOSTaggerAnnotator {
public String tagSeparator;
public ProvidedPOSTaggerAnnotator(String annotatorName, Properties props) {
tagSeparator = props.getProperty(annotatorName + ".tagSeparator", "_");
}
public void annotate(Annotation annotation) {
for (CoreLabel token : annotation.get(CoreAnnotations.TokensAnnotation.class)) {
int tagSeparatorSplitLength = token.word().split(tagSeparator).length;
String posTag = token.word().split(tagSeparator)[tagSeparatorSplitLength-1];
String[] wordParts = Arrays.copyOfRange(token.word().split(tagSeparator), 0, tagSeparatorSplitLength-1);
String tokenString = String.join(tagSeparator, wordParts);
// set the word with the POS tag removed
token.set(CoreAnnotations.TextAnnotation.class, tokenString);
// set the POS
token.set(CoreAnnotations.PartOfSpeechAnnotation.class, posTag);
}
}
}
This should work if you provide your token with POS tokens separated by "_". You can change it with the forcedpos.tagSeparator property.
If you set customAnnotator.forcedpos = edu.stanford.nlp.pipeline.ProvidedPOSTaggerAnnotator
to the property file, include the above class in your CLASSPATH, and then include "forcedpos" in your list of annotators after "tokenize", you should be able to pass in your own pos tags.
I may clean this up some more and actually include it in future releases for people!
I have not had time to actually test this code out, if you try it out and find errors please let me know and I'll fix it!

Dataflow output parameterized type to avro file

I have a pipeline that successfully outputs an Avro file as follows:
#DefaultCoder(AvroCoder.class)
class MyOutput_T_S {
T foo;
S bar;
Boolean baz;
public MyOutput_T_S() {}
}
#DefaultCoder(AvroCoder.class)
class T {
String id;
public T() {}
}
#DefaultCoder(AvroCoder.class)
class S {
String id;
public S() {}
}
...
PCollection<MyOutput_T_S> output = input.apply(myTransform);
output.apply(AvroIO.Write.to("/out").withSchema(MyOutput_T_S.class));
How can I reproduce this exact behavior except with a parameterized output MyOutput<T, S> (where T and S are both Avro code-able using reflection).
The main issue is that Avro reflection doesn't work for parameterized types. So based on these responses:
Setting Custom Coders & Handling Parameterized types
Using Avrocoder for Custom Types with Generics
1) I think I need to write a custom CoderFactory but, I am having difficulty figuring out exactly how this works (I'm having trouble finding examples). Oddly enough, a completely naive coder factory appears to let me run the pipeline and inspect proper output using DataflowAssert:
cr.RegisterCoder(MyOutput.class, new CoderFactory() {
#Override
public Coder<?> create(List<? excents Coder<?>> componentCoders) {
Schema schema = new Schema.Parser().parse("{\"type\":\"record\,"
+ "\"name\":\"MyOutput\","
+ "\"namespace\":\"mypackage"\","
+ "\"fields\":[]}"
return AvroCoder.of(MyOutput.class, schema);
}
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
List components = new ArrayList();
return components;
}
While I can successfully assert against the output now, I expect this will not cut it for writing to a file. I haven't figured out how I'm supposed to use the provided componentCoders to generate the correct schema and if I try to just shove the schema of T or S into fields I get:
java.lang.IllegalArgumentException: Unable to get field id from class null
2) Assuming I figure out how to encode MyOutput. What do I pass to AvroIO.Write.withSchema? If I pass either MyOutput.class or the schema I get type mismatch errors.
I think there are two questions (correct me if I am wrong):
How do I enable the coder registry to provide coders for various parameterizations of MyOutput<T, S>?
How do I values of MyOutput<T, S> to a file using AvroIO.Write.
The first question is to be solved by registering a CoderFactory as in the linked question you found.
Your naive coder is probably allowing you to run the pipeline without issues because serialization is being optimized away. Certainly an Avro schema with no fields will result in those fields being dropped in a serialization+deserialization round trip.
But assuming you fill in the schema with the fields, your approach to CoderFactory#create looks right. I don't know the exact cause of the message java.lang.IllegalArgumentException: Unable to get field id from class null, but the call to AvroCoder.of(MyOutput.class, schema) should work, for an appropriately assembled schema. If there is an issue with this, more details (such as the rest of the stack track) would be helpful.
However, your override of CoderFactory#getInstanceComponents should return a list of values, one per type parameter of MyOutput. Like so:
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
return ImmutableList.of(myOutput.foo, myOutput.bar);
}
The second question can be answered using some of the same support code as the first, but otherwise is independent. AvroIO.Write.withSchema always explicitly uses the provided schema. It does use AvroCoder under the hood, but this is actually an implementation detail. Providing a compatible schema is all that is necessary - such a schema will have to be composed for each value of T and S for which you want to output MyOutput<T, S>.

ANTLR Parse tree modification

I'm using ANTLR4 to create a parse tree for my grammar, what I want to do is modify certain nodes in the tree. This will include removing certain nodes and inserting new ones. The purpose behind this is optimization for the language I am writing. I have yet to find a solution to this problem. What would be the best way to go about this?
While there is currently no real support or tools for tree rewriting, it is very possible to do. It's not even that painful.
The ParseTreeListener or your MyBaseListener can be used with a ParseTreeWalker to walk your parse tree.
From here, you can remove nodes with ParserRuleContext.removeLastChild(), however when doing this, you have to watch out for ParseTreeWalker.walk:
public void walk(ParseTreeListener listener, ParseTree t) {
if ( t instanceof ErrorNode) {
listener.visitErrorNode((ErrorNode)t);
return;
}
else if ( t instanceof TerminalNode) {
listener.visitTerminal((TerminalNode)t);
return;
}
RuleNode r = (RuleNode)t;
enterRule(listener, r);
int n = r.getChildCount();
for (int i = 0; i<n; i++) {
walk(listener, r.getChild(i));
}
exitRule(listener, r);
}
You must replace removed nodes with something if the walker has visited parents of those nodes, I usually pick empty ParseRuleContext objects (this is because of the cached value of n in the method above). This prevents the ParseTreeWalker from throwing a NPE.
When adding nodes, make sure to set the mutable parent on the ParseRuleContext to the new parent. Also, because of the cached n in the method above, a good strategy is to detect where the changes need to be before you hit where you want your changes to go in the walk, so the ParseTreeWalker will walk over them in the same pass (other wise you might need multiple passes...)
Your pseudo code should look like this:
public void enterRewriteTarget(#NotNull MyParser.RewriteTargetContext ctx){
if(shouldRewrite(ctx)){
ArrayList<ParseTree> nodesReplaced = replaceNodes(ctx);
addChildTo(ctx, createNewParentFor(nodesReplaced));
}
}
I've used this method to write a transpiler that compiled a synchronous internal language into asynchronous javascript. It was pretty painful.
Another approach would be to write a ParseTreeVisitor that converts the tree back to a string. (This can be trivial in some cases, because you are only calling TerminalNode.getText() and concatenate in aggregateResult(..).)
You then add the modifications to this visitor so that the resulting string representation contains the modifications you try to achieve.
Then parse the string and you get a parse tree with the desired modifications.
This is certainly hackish in some ways, since you parse the string twice. On the other hand the solution does not rely on antlr implementation details.
I needed something similar for simple transformations. I ended up using a ParseTreeWalker and a custom ...BaseListener where I overwrote the enter... methods. Inside this method the ParserRuleContext.children is available and can be manipulated.
class MyListener extends ...BaseListener {
#Override
public void enter...(...Context ctx) {
super.enter...(ctx);
ctx.children.add(...);
}
}
new ParseTreeWalker().walk(new MyListener(), parseTree);

Is there anything like "CheckSum" in Dart (on Objects)?

For Testing purposes I'm trying to design a way to verify that the results of statistical tests are identical across versions, platforms and such. There are a lot things that go on that include ints, nums, dates, Strings and more inside our collections of Objects.
In the end I want to 'know' that the whole set of instantiated objects sum to the same value (by just doing something like adding the checkSum of all internal properties).
I can write low level code for each internal value to return a checkSum but I was thinking that perhaps something like this already exists.
Thanks!
_swarmii
This sounds like you should be using the serialization library (install via Pub).
Here's a simple example to get you started:
import 'dart:io';
import 'package:serialization/serialization.dart';
class Address {
String street;
int number;
}
main() {
var address = new Address()
..number = 5
..street = 'Luumut';
var serialization = new Serialization()
..addRuleFor(address);
Map output = serialization.write(address, new SimpleJsonFormat());
print(output);
}
Then depending on what you want to do exactly, I'm sure you can fine tune the code for your purpose.

Irony AST generation throws nullreference excepttion

I'm getting started with Irony (version Irony_2012_03_15) but I pretty quickly got stuck when trying to generate an AST. Below is a completely strpped language that throws the exception:
[Language("myLang", "0.1", "Bla Bla")]
public class MyLang: Grammar {
public NModel()
: base(false) {
var number = TerminalFactory.CreateCSharpNumber("number");
var binExpr = new NonTerminal("binExpr", typeof(BinaryOperationNode));
var binOp = new NonTerminal("BinOp");
binExpr.Rule = number + binOp + number;
binOp.Rule = ToTerm("+");
RegisterOperators(1, "+");
//MarkTransient(binOp);
this.Root = binExpr;
this.LanguageFlags = Parsing.LanguageFlags.CreateAst; // if I uncomment this line it throws the error
}
}
As soon as I uncomment the last line it throws a NullReferenceException in the grammar explorer or when i want to parse a test. The error is on AstBuilder.cs line 96:
parseNode.AstNode = config.DefaultNodeCreator();
DefaultNodeCreator is a delegate that has not been set.
I've tried setting things with MarkTransient etc but no dice.
Can someone help me afloat here? I'm proably missing something obvious. Looked for AST tutorials all over the webs but I can't seem to find an explanation on how that works.
Thanks in advance,
Gert-Jan
Once you set the LanguageFlags.CreateAst flag on the grammar, you must provide additional information about how to create the AST.
You're supposed to be able to set AstContext.Default*Type for the whole language, but that is currently bugged.
Set TermFlags.NoAstNode. Irony will ignore this node and its children.
Set AstConfig.NodeCreator. This is a delegate that can do the right thing.
Set AstConfig.NodeType to the type of the AstNode. This type should be accessible, implement IAstInit, and have a public, no-parameters constructor. Accessible in this case means either public or internal with the InternalsVisibleTo attribute.
To be honest, I was facing the same problem and did not understand Jay Bazuzi answer, though it looks like valid one(maybe it's outdated).
If there's anyone like me;
I just inherited my Grammar from Irony.Interpreter.InterpretedLanguageGrammar class, and it works. Also, anyone trying to get AST working, make sure your nodes are "public" :- )
On top of Jay's and Erti-Chris's responses, this thread is also useful:
https://irony.codeplex.com/discussions/361018
The creator of Irony points out the relevant configuration code in InterpretedLanguageGrammar.BuildAst.
HTH

Resources