How to map between push parser and pull parser - parsing

I have implemented a pull parser that reads a data stream and emits tokens on selected content via a callback handler. This abstract technique is also known as observer pattern (with the callback handler also known as observer) and used for instance in SAX for parsing XML.
The contrary design pattern (is there a name for it?) is to pull the next data token as used for instance in XML parsing with StAX.
One can easily map to a push parser by looping a pull parser:
// push
parser.parse( callback: handler );
// pull
while( token = parser.next ) {
handler(token)
}
But how do I map a push parser to a pull parser?

To adapt a push parser into a pull parser, you have to collect several (all ? depending on what is being parsed and the order of the elements being pushed) into Event objects. And then allow those Events to be pulled.
We can use XML as an example and adapt a SAXHandler into a StAX parser. We also have to implement the methods for XMLStreamReader for iterating over the StAX XMLEvents.
I've never used StAX but it looks like it stores the current state in the XMLStreamReader object. Each call to reader.next() updates the state and the returned values from reader.getName() and reader.getText() etc are updated accordingly.
We can do this several ways from parsing the entire thing in memory first, then iterating through what we've stored in memory, to more complicated techniques such as using multiple threads to parse the XML and block reading the next tag until the user calls next().
For simplicity, I'll just show storing everything in memory
class SAXHandler extends DefaultHandler implements XMLSTreamReader {
//Stax Event objects
List<XMLEvent> events = new ArrayList<>;
int counter=0;
//Stax current tag name and text data updated with calls to next()
private String name, text;
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
//create a new XMLEvent for the start of the new tag
XMLEvent newEvent = ....
events.add(newEvent);
}
//other SAX methods implemented similarly
...
Now for the StAX methods:
#Override
public XMLEvent next(){
if( !hasNext() ){
throw NoSuchElementException();
}
counter++;
XMLEvent next =events(counter);
//update our content
this.name = next.name;
this.text = next.text;
...
return next;
}
#Override
public boolean hasNext(){
return counter < events.size();
}
...
#Override
public String getName(){
return name;
}
#Override
public String getText(){
return text;
}
}
Hope this helps

What I think you are looking for is Control Inversion, which is not easy in languages which are tied to a stacklike execution model.
C is not quite welded to execution stacks, so you could do this with the (deprecated) Posix getcontext/setcontext/makecontext, or slightly more portably with threads.
In other languages, it is easier, if no less mind-bending. See Scheme's call/cc primitive, this piece of Lua ancient history, or take a look at Python generators (although the latter is not able to invert control without help from the function whose control is to be inverted.)

Related

How to pass data down the reactive chain

Whenever I need to pass data down the reactive chain I end up doing something like this:
public Mono<String> doFooAndPassDtoAsMono(Dto dto) {
return Mono.just(dto)
.flatMap(dtoMono -> {
Mono<String> result = // remote call returning a Mono
return Mono.zip(Mono.just(dtoMono), result);
})
.flatMap(tup2 -> {
return doSomething(tup2.getT1().getFoo(), tup2.getT2()); // do something that requires foo and result and returns a Mono
});
}
Given the below sample Dto class:
class Dto {
private String foo;
public String getFoo() {
return this.foo;
}
}
Because it often gets tedious to zip the data all the time to pass it down the chain (especially a few levels down) I was wondering if it's ok to simply reference the dto directly like so:
public Mono<String> doFooAndReferenceParam(Dto dto) {
Mono<String> result = // remote call returning a Mono
return result.flatMap(result -> {
return doSomething(dto.getFoo(), result); // do something that requires foo and result and returns a Mono
});
}
My concern about the second approach is that assuming a subscriber subscribes to this Mono on a thread pool would I need to guarantee that Dto is thread safe (the above example is simple because it just carries a String but what if it's not)?
Also, which one is considered "best practice"?
Based on what you have shared, you can simply do following:
public Mono<String> doFooAndPassDtoAsMono(Dto dto) {
return Mono.just(dto.getFoo());
}
The way you are using zip in the first option doesn't solve any purpose. Similarly, the 2nd option will not work either as once the mono is empty then the next flat map will not be triggered.
The case is simple if
The reference data is available from the beginning (i.e. before the creation of the chain), and
The chain is created for processing at most one event (i.e. starts with a Mono), and
The reference data is immutable.
Then you can simple refer to the reference data in a parameter or local variable – just like in your second solution. This is completely okay, and there are no concurrency issues.
Using mutable data in reactive flows is strongly discouraged. If you had a mutable Dto class, you might still be able to use it (assuming proper synchronization) – but this will be very surprising to readers of your code.

How to create read transform using ParDo and DoFn in Apache Beam

According to the Apache Beam documentation the recommended way
to write simple sources is by using Read Transforms and ParDo. Unfortunately the Apache Beam docs has let me down here.
I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object:
message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'
My attempt:
public class TestIO extends PTransform<PBegin, PCollection<Event>> {
#Override
public PCollection<Event> expand(PBegin input) {
return input.apply(ParDo.of(new ReadFn()));
}
private static class ReadFn extends DoFn<PBegin, Event> {
#ProcessElement
public void process(#TimerId("poll") Timer pollTimer) {
Event testEvent = new Event(...);
//custom logic, this can happen infinitely
for(...) {
context.output(testEvent);
}
}
}
}
A DoFn performs element-wise processing. As written, ParDo.of(new ReadFn()) will have type PTransform<PCollection<PBegin>, PCollection<Event>>. Specifically, the ReadFn indicates it takes an element of type PBegin and returns 0 or more elements of type Event.
Instead, you should use an actual Read operation. There are a variety provided. You can also use Create if you have a specific set of in-memory collections to use.
If you need to create a custom source you should use the Read transform. Since you're using timers, you likely want to create an Unbounded Source (a stream of elements).

JSR 352 :How to collect data from the Writer of each Partition of a Partitioned Step?

So, I have 2 partitions in a step which writes into a database. I want to record the number of rows written in each partition, get the sum, and print it to the log;
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when I tried it I got null. I am able to get these values in close() of the Reader.
Is this the right way to go about it? Or should I use Partition Collector/Reducer/ Analyzer?
I am using a java batch in Websphere Liberty. And I am developing in Eclipse.
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when i tried it i got null.
The ItemWriter might already be destroyed at this point, but I'm not sure.
Is this the right way to go about it?
Yes, it should be good enough. However, you need to ensure the total row count is shared for all partitions because the batch runtime maintains a StepContext clone per partition. You should rather use JobContext.
I think using PartitionCollector and PartitionAnalyzer is a good choice, too. Interface PartitionCollector has a method collectPartitionData() to collect data coming from its partition. Once collected, batch runtime passes this data to PartitionAnalyzer to analyze the data. Notice that there're
N PartitionCollector per step (1 per partition)
N StepContext per step (1 per partition)
1 PartitionAnalyzer per step
The records written can be passed via StepContext's transientUserData. Since the StepContext is reserved for its own step-partition, the transient user data won't be overwritten by other partition.
Here's the implementation :
MyItemWriter :
#Inject
private StepContext stepContext;
#Override
public void writeItems(List<Object> items) throws Exception {
// ...
Object userData = stepContext.getTransientUserData();
stepContext.setTransientUserData(partRowCount);
}
MyPartitionCollector
#Inject
private StepContext stepContext;
#Override
public Serializable collectPartitionData() throws Exception {
// get transient user data
Object userData = stepContext.getTransientUserData();
int partRowCount = userData != null ? (int) userData : 0;
return partRowCount;
}
MyPartitionAnalyzer
private int rowCount = 0;
#Override
public void analyzeCollectorData(Serializable fromCollector) throws Exception {
rowCount += (int) fromCollector;
System.out.printf("%d rows processed (all partitions).%n", rowCount);
}
Reference : JSR352 v1.0 Final Release.pdf
Let me offer a bit of an alternative on the accepted answer and add some comments.
PartitionAnalyzer variant - Use analyzeStatus() method
Another technique would be to use analyzeStatus which only gets called at the end of each entire partition, and is passed the partition-level exit status.
public void analyzeStatus(BatchStatus batchStatus, String exitStatus)
In contrast, the above answer using analyzeCollectorData gets called at the end of each chunk on each partition.
E.g.
public class MyItemWriteListener extends AbstractItemWriteListener {
#Inject
StepContext stepCtx;
#Override
public void afterWrite(List<Object> items) throws Exception {
// update 'newCount' based on items.size()
stepCtx.setExitStatus(Integer.toString(newCount));
}
Obviously this only works if you weren't using the exit status for some other purpose. You can set the exit status from any artifact (though this freedom might be one more thing to have to keep track of).
Comments
The API is designed to facilitate an implementation dispatching individual partitions across JVMs, (e.g. in Liberty you can see this here.) But using a static ties you to a single JVM, so it's not a recommended approach.
Also note that both the JobContext and the StepContext are implemented in the "thread-local"-like fashion we see in batch.

ANTLR Parse tree modification

I'm using ANTLR4 to create a parse tree for my grammar, what I want to do is modify certain nodes in the tree. This will include removing certain nodes and inserting new ones. The purpose behind this is optimization for the language I am writing. I have yet to find a solution to this problem. What would be the best way to go about this?
While there is currently no real support or tools for tree rewriting, it is very possible to do. It's not even that painful.
The ParseTreeListener or your MyBaseListener can be used with a ParseTreeWalker to walk your parse tree.
From here, you can remove nodes with ParserRuleContext.removeLastChild(), however when doing this, you have to watch out for ParseTreeWalker.walk:
public void walk(ParseTreeListener listener, ParseTree t) {
if ( t instanceof ErrorNode) {
listener.visitErrorNode((ErrorNode)t);
return;
}
else if ( t instanceof TerminalNode) {
listener.visitTerminal((TerminalNode)t);
return;
}
RuleNode r = (RuleNode)t;
enterRule(listener, r);
int n = r.getChildCount();
for (int i = 0; i<n; i++) {
walk(listener, r.getChild(i));
}
exitRule(listener, r);
}
You must replace removed nodes with something if the walker has visited parents of those nodes, I usually pick empty ParseRuleContext objects (this is because of the cached value of n in the method above). This prevents the ParseTreeWalker from throwing a NPE.
When adding nodes, make sure to set the mutable parent on the ParseRuleContext to the new parent. Also, because of the cached n in the method above, a good strategy is to detect where the changes need to be before you hit where you want your changes to go in the walk, so the ParseTreeWalker will walk over them in the same pass (other wise you might need multiple passes...)
Your pseudo code should look like this:
public void enterRewriteTarget(#NotNull MyParser.RewriteTargetContext ctx){
if(shouldRewrite(ctx)){
ArrayList<ParseTree> nodesReplaced = replaceNodes(ctx);
addChildTo(ctx, createNewParentFor(nodesReplaced));
}
}
I've used this method to write a transpiler that compiled a synchronous internal language into asynchronous javascript. It was pretty painful.
Another approach would be to write a ParseTreeVisitor that converts the tree back to a string. (This can be trivial in some cases, because you are only calling TerminalNode.getText() and concatenate in aggregateResult(..).)
You then add the modifications to this visitor so that the resulting string representation contains the modifications you try to achieve.
Then parse the string and you get a parse tree with the desired modifications.
This is certainly hackish in some ways, since you parse the string twice. On the other hand the solution does not rely on antlr implementation details.
I needed something similar for simple transformations. I ended up using a ParseTreeWalker and a custom ...BaseListener where I overwrote the enter... methods. Inside this method the ParserRuleContext.children is available and can be manipulated.
class MyListener extends ...BaseListener {
#Override
public void enter...(...Context ctx) {
super.enter...(ctx);
ctx.children.add(...);
}
}
new ParseTreeWalker().walk(new MyListener(), parseTree);

Clean up SAX Handler

I've made a SAX parser for parsing XML files with a number of different tags. For performance reasons, I chose SAX over DOM. And I'm glad I did because it works fast and good. The only issue I currently have is that the main class (which extends DefaultHandler) is a bit large and not very easy on the eyes. It contains a huge if/elseif block where I check the tag name, with some nested if's for reading specific attributes. This block is located in the StartElement method.
Is there any nice clean way to split this up? I would like to have a main class which reads the files, and then a Handler for every tag. In this Tag Handler, I'd like to read the attributes for that tag, do something with them, and then go back to the main handler to read the next tag which again gets redirected to the appropriate handler.
My main handler also has a few global Collection variables, which gather information regarding all the documents I parse with it. Ideally, I would be able to add something to those collections from the Tag Handlers.
A code example would be very helpful, if possible. I read something on this site about a Handler Stack, but without code example I was not able to reproduce it.
Thanks in advance :)
I suggest setting up a chain of SAX filters. A SAX filter is just like any other SAX Handler, except that it has another SAX handler to pass events into when it's done. They're frequently used to perform a sequence of transformations to an XML stream, but they can also be used to factor things the way you want.
You don't mention the language you're using, but you mention DefaultHandler so I'll assume Java. The first thing to do is to code up your filters. In Java, you do this by implementing XMLFilter (or, more simply, by subclassing XMLFilterImpl)
import java.util.Collection;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.XMLFilterImpl;
public class TagOneFilter extends XMLFilterImpl {
private Collection<Object> collectionOfStuff;
public TagOneFilter(Collection<Object> collectionOfStuff) {
this.collectionOfStuff = collectionOfStuff;
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
if ("tagOne".equals(qName)) {
// Interrogate the parameters and update collectionOfStuff
}
// Pass the event to downstream filters.
if (getContentHandler() != null)
getContentHandler().startElement(uri, localName, qName, atts);
}
}
Next, your main class, which instantiates all of the filters and chains them together.
import java.util.ArrayList;
import java.util.Collection;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
public class Driver {
public static void main(String[] args) throws Exception {
Collection<Object> collectionOfStuff = new ArrayList<Object>();
XMLReader parser = XMLReaderFactory.createXMLReader();
TagOneFilter tagOneFilter = new TagOneFilter(collectionOfStuff);
tagOneFilter.setParent(parser);
TagTwoFilter tagTwoFilter = new TagTwoFilter(collectionOfStuff);
tagTwoFilter.setParent(tagOneFilter);
// Call parse() on the tail of the filter chain. This will finish
// tying the filters together before executing the parse at the
// XMLReader at the beginning.
tagTwoFilter.parse(args[0]);
// Now do something interesting with your collectionOfStuff.
}
}

Resources