Clean up SAX Handler - xml-parsing

Clean up SAX Handler - xml-parsing

I've made a SAX parser for parsing XML files with a number of different tags. For performance reasons, I chose SAX over DOM. And I'm glad I did because it works fast and good. The only issue I currently have is that the main class (which extends DefaultHandler) is a bit large and not very easy on the eyes. It contains a huge if/elseif block where I check the tag name, with some nested if's for reading specific attributes. This block is located in the StartElement method.
Is there any nice clean way to split this up? I would like to have a main class which reads the files, and then a Handler for every tag. In this Tag Handler, I'd like to read the attributes for that tag, do something with them, and then go back to the main handler to read the next tag which again gets redirected to the appropriate handler.
My main handler also has a few global Collection variables, which gather information regarding all the documents I parse with it. Ideally, I would be able to add something to those collections from the Tag Handlers.
A code example would be very helpful, if possible. I read something on this site about a Handler Stack, but without code example I was not able to reproduce it.
Thanks in advance :)

I suggest setting up a chain of SAX filters. A SAX filter is just like any other SAX Handler, except that it has another SAX handler to pass events into when it's done. They're frequently used to perform a sequence of transformations to an XML stream, but they can also be used to factor things the way you want.
You don't mention the language you're using, but you mention DefaultHandler so I'll assume Java. The first thing to do is to code up your filters. In Java, you do this by implementing XMLFilter (or, more simply, by subclassing XMLFilterImpl)
import java.util.Collection;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.XMLFilterImpl;
public class TagOneFilter extends XMLFilterImpl {
private Collection<Object> collectionOfStuff;
public TagOneFilter(Collection<Object> collectionOfStuff) {
this.collectionOfStuff = collectionOfStuff;
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
if ("tagOne".equals(qName)) {
// Interrogate the parameters and update collectionOfStuff
}
// Pass the event to downstream filters.
if (getContentHandler() != null)
getContentHandler().startElement(uri, localName, qName, atts);
}
}
Next, your main class, which instantiates all of the filters and chains them together.
import java.util.ArrayList;
import java.util.Collection;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;
public class Driver {
public static void main(String[] args) throws Exception {
Collection<Object> collectionOfStuff = new ArrayList<Object>();
XMLReader parser = XMLReaderFactory.createXMLReader();
TagOneFilter tagOneFilter = new TagOneFilter(collectionOfStuff);
tagOneFilter.setParent(parser);
TagTwoFilter tagTwoFilter = new TagTwoFilter(collectionOfStuff);
tagTwoFilter.setParent(tagOneFilter);
// Call parse() on the tail of the filter chain. This will finish
// tying the filters together before executing the parse at the
// XMLReader at the beginning.
tagTwoFilter.parse(args[0]);
// Now do something interesting with your collectionOfStuff.
}
}

Related

Add custom information to Spock Global Extension

I have configured Spock Global Extension and static class ErrorListener inside it. Works fine for test errors when I want to catch feature title and errors if they happen. But how can I add some custom information to the listener?
For example I have test that calls some API. In case it fails I want to add request/response body to the listener (and report it later). Obviously I have request/response inside the feature or I can get it. How can I pass this information to the Listener and read later in the handling code?
package org.example
import groovy.json.JsonSlurper
import org.spockframework.runtime.AbstractRunListener
import org.spockframework.runtime.extension.AbstractGlobalExtension
import org.spockframework.runtime.model.ErrorInfo
import org.spockframework.runtime.model.IterationInfo
import org.spockframework.runtime.model.SpecInfo
import spock.lang.Specification
class OpenBrewerySpec extends Specification{
def getBreweryTest(){
def breweryText = new URL('https://api.openbrewerydb.org/breweries/1').text
def breweryJson = new JsonSlurper().parseText(breweryText)
//TODO catch breweryText for test result reporting if it is possible
expect:
breweryJson.country == 'United States'
}
def cleanup() {
specificationContext.currentSpec.listeners
.findAll { it instanceof TestResultExtension.ErrorListener }
.each {
def errorInfo = (it as TestResultExtension.ErrorListener).errorInfo
if (errorInfo)
println "Test failure in feature '${specificationContext.currentIteration.name}', " +
"exception class ${errorInfo.exception.class.simpleName}"
else
println "Test passed in feature '${specificationContext.currentIteration.name}'"
}
}
}
class TestResultExtension extends AbstractGlobalExtension {
#Override
void visitSpec(SpecInfo spec) {
spec.addListener(new ErrorListener())
}
static class ErrorListener extends AbstractRunListener {
ErrorInfo errorInfo
#Override
void beforeIteration(IterationInfo iteration) {
errorInfo = null
}
#Override
void error(ErrorInfo error) {
errorInfo = error
}
}
}
Create file src/test/resources/META-INF/services/org.spockframework.runtime.extension.IGlobalExtension
and place string "org.example.TestResultExtension" there to enable extension.

I am pretty sure you found my solution here. Then you also know that it is designed to know in a cleanup() methods if the test succeeded or failed because otherwise Spock does not make the information available. I do not understand why deliberately omitted that information and posted a fragment instead of the whole method or at least mentioned where your code snippet gets executed. That is not a helpful way of asking a question. Nobody would know except for me because I am the author of this global extension.
So now after having established that you are inside a cleanup() method, I can tell you: The information does not belong into the global extension because in the cleanup() method you have access to information from the test such as fields. Why don't you design your test in such a way that whatever information cleanup() needs it stored in a field as you would normally do without using any global extensions? The latter is only meant to help you establish the error status (passed vs. failed) as such.
BTW, I even doubt if you need additional information in the cleanup() method at all because its purpose it cleaning up, not reporting or logging anything. For that Spock has a reporting system which you can also write extensions for.
Sorry for not being more specific in my answer, but your question is equally unspecific. It is an instance of the XY problem, explaining how you think you should do something instead of explaining what you want to achieve. Your sample code omits important details, e.g. the core test code as such.

flatMap vs map, basic explanation is ok, but what happens when my transformation function isn't synchronous by itself?

I like basic explanations of complex concepts in reactor all over the web, they are not particularly useful in production code, so following piece of code I wrote which sends a message to kafka using reactor kafka + spring boot:
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import reactor.kafka.sender.KafkaSender;
import reactor.kafka.sender.SenderOptions;
import reactor.kafka.sender.SenderRecord;
import reactor.kafka.sender.SenderResult;
import java.util.Properties;
public class CallbackSender {
private ObjectMapper objectMapper;
private String topic;
private static final Logger log = LoggerFactory.getLogger(CallbackSender.class.getName());
private final KafkaSender<String, String> sender;
public CallbackSender(ObjectMapper objectMapper, Properties senderProps, String topic) {
this.sender = KafkaSender.create(SenderOptions.create(senderProps));
this.objectMapper = objectMapper;
this.topic = topic;
}
public Mono<SenderResult<String>> sendMessage(ProcessContext<? extends AbstractMessage> processContext) throws JsonProcessingException {
ProducerRecord<String, String> producerRecord = new ProducerRecord<>(topic,
objectMapper.writeValueAsString(processContext.getMessage()));
SenderRecord<String, String, String> senderRecord = SenderRecord.create(producerRecord, processContext.getId());
return sender.send(Flux.just(senderRecord))
.doOnError(e -> log.error("Send failed", e))
.last();
}
}
What I can't grasp in my mind is what exactly is the difference between calling this.sendMessage as .map vs .flatMap from the outer pipeline, so what for the explanation that map applying synchronous transformation to the emitted element if my synchronous function is not really doing anything synchronous apart from basic fields fetch?
Here Kafka sender is already reactive and async , so it doesn't matter which one I use? Is that correct assumption?
Is my code non-idiomatic?
Or for this particular it would be just a safe wrap of everything I am doing inside .sendMessage in .flatMap in case someone would add synchronous code in future, i.e. sugar-safety syntax.
My understanding is that .map will simply prepare pipeline in this case which returns Mono, and subscriber for outer calling pipeline will trigger entire domino effect, is that correct?

What I can't grasp in my mind is what exactly is the difference between calling this.sendMessage as .map vs .flatMap from the outer pipeline
map() applies a synchronous function (i.e. one "in-place" with no subscriptions or callbacks) and just returns the result as is. flatMap() applies an asynchronous transformer function, and unwraps the Publisher when done. So:
My understanding is that .map will simply prepare pipeline in this case which returns Mono, and subscriber for outer calling pipeline will trigger entire domino effect, is that correct?
Yes, that's correct (if by "domino effect" you mean that the returning mono will be subscribed to and its result returned.)
so what for the explanation that map applying synchronous transformation to the emitted element if my synchronous function is not really doing anything synchronous apart from basic fields fetch?
Quite simply, because that's what you've told it to do. There's nothing inherently asynchronous about setting up a publisher, just its execution once it's been subscribed to (which doesn't happen with a map() call.)

How to create read transform using ParDo and DoFn in Apache Beam

According to the Apache Beam documentation the recommended way
to write simple sources is by using Read Transforms and ParDo. Unfortunately the Apache Beam docs has let me down here.
I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object:
message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'
My attempt:
public class TestIO extends PTransform<PBegin, PCollection<Event>> {
#Override
public PCollection<Event> expand(PBegin input) {
return input.apply(ParDo.of(new ReadFn()));
}
private static class ReadFn extends DoFn<PBegin, Event> {
#ProcessElement
public void process(#TimerId("poll") Timer pollTimer) {
Event testEvent = new Event(...);
//custom logic, this can happen infinitely
for(...) {
context.output(testEvent);
}
}
}
}

A DoFn performs element-wise processing. As written, ParDo.of(new ReadFn()) will have type PTransform<PCollection<PBegin>, PCollection<Event>>. Specifically, the ReadFn indicates it takes an element of type PBegin and returns 0 or more elements of type Event.
Instead, you should use an actual Read operation. There are a variety provided. You can also use Create if you have a specific set of in-memory collections to use.
If you need to create a custom source you should use the Read transform. Since you're using timers, you likely want to create an Unbounded Source (a stream of elements).

How to force log4j2 rolling file appender to roll over?

To my best knowledge, RollingFileAppender in log4j2 will not roll over at the specified time (let's say - at the end of an hour), but at the first log event that arrives after the time threshold has been exceeded.
Is there a way to trigger an event, that on one hand will cause the file to roll over, and on another - will not append to the log (or will append something trivial, like an empty string)?

No there isn't any (built-in) way to do this. There are no background threads monitoring rollover time.
You could create a log4j2 plugin that implements org.apache.logging.log4j.core.appender.rolling.TriggeringPolicy (See the built-in TimeBasedTriggeringPolicy and SizeBasedTriggeringPolicy classes for sample code.)
If you configure your custom triggering policy, log4j2 will check for every log event whether it should trigger a rollover (so take care when implementing the isTriggeringEvent method to avoid impacting performance). Note that for your custom plugin to be picked up, you need to specify the package of your class in the packages attribute of the Configuration element of your log4j2.xml file.
Finally, if this works well for you and you think your solution may be useful to others too, consider contributing your custom triggering policy back to the log4j2 code base.

Following Remko's idea, I wrote the following code, and it's working.
package com.stony;
import org.apache.logging.log4j.core.LogEvent;
import org.apache.logging.log4j.core.appender.rolling.*;
import org.apache.logging.log4j.core.config.plugins.Plugin;
import org.apache.logging.log4j.core.config.plugins.PluginFactory;
#Plugin(name = "ForceTriggerPolicy", category = "Core")
public class ForceTriggerPolicy implements TriggeringPolicy {
private static boolean isRolling;
#Override
public void initialize(RollingFileManager arg0) {
setRolling(false);
}
#Override
public boolean isTriggeringEvent(LogEvent arg0) {
return isRolling();
}
public static boolean isRolling() {
return isRolling;
}
public static void setRolling(boolean _isRolling) {
isRolling = _isRolling;
}
#PluginFactory
public static ForceTriggerPolicy createPolicy(){
return new ForceTriggerPolicy();
}
}

If you have access to the Object RollingFileAppender you could do something like:
rollingFileAppender.getManager().rollover();
Here you can see the manager class:
https://github.com/apache/logging-log4j2/blob/d368e294d631e79119caa985656d0ec571bd24f5/log4j-core/src/main/java/org/apache/logging/log4j/core/appender/rolling/RollingFileManager.java

How to map between push parser and pull parser

I have implemented a pull parser that reads a data stream and emits tokens on selected content via a callback handler. This abstract technique is also known as observer pattern (with the callback handler also known as observer) and used for instance in SAX for parsing XML.
The contrary design pattern (is there a name for it?) is to pull the next data token as used for instance in XML parsing with StAX.
One can easily map to a push parser by looping a pull parser:
// push
parser.parse( callback: handler );
// pull
while( token = parser.next ) {
handler(token)
}
But how do I map a push parser to a pull parser?

To adapt a push parser into a pull parser, you have to collect several (all ? depending on what is being parsed and the order of the elements being pushed) into Event objects. And then allow those Events to be pulled.
We can use XML as an example and adapt a SAXHandler into a StAX parser. We also have to implement the methods for XMLStreamReader for iterating over the StAX XMLEvents.
I've never used StAX but it looks like it stores the current state in the XMLStreamReader object. Each call to reader.next() updates the state and the returned values from reader.getName() and reader.getText() etc are updated accordingly.
We can do this several ways from parsing the entire thing in memory first, then iterating through what we've stored in memory, to more complicated techniques such as using multiple threads to parse the XML and block reading the next tag until the user calls next().
For simplicity, I'll just show storing everything in memory
class SAXHandler extends DefaultHandler implements XMLSTreamReader {
//Stax Event objects
List<XMLEvent> events = new ArrayList<>;
int counter=0;
//Stax current tag name and text data updated with calls to next()
private String name, text;
#Override
//Triggered when the start of tag is found.
public void startElement(String uri, String localName,
String qName, Attributes attributes)
throws SAXException {
//create a new XMLEvent for the start of the new tag
XMLEvent newEvent = ....
events.add(newEvent);
}
//other SAX methods implemented similarly
...
Now for the StAX methods:
#Override
public XMLEvent next(){
if( !hasNext() ){
throw NoSuchElementException();
}
counter++;
XMLEvent next =events(counter);
//update our content
this.name = next.name;
this.text = next.text;
...
return next;
}
#Override
public boolean hasNext(){
return counter < events.size();
}
...
#Override
public String getName(){
return name;
}
#Override
public String getText(){
return text;
}
}
Hope this helps

What I think you are looking for is Control Inversion, which is not easy in languages which are tied to a stacklike execution model.
C is not quite welded to execution stacks, so you could do this with the (deprecated) Posix getcontext/setcontext/makecontext, or slightly more portably with threads.
In other languages, it is easier, if no less mind-bending. See Scheme's call/cc primitive, this piece of Lua ancient history, or take a look at Python generators (although the latter is not able to invert control without help from the function whose control is to be inverted.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart