Saxon - s9api - setParameter as node and access in transformation - saxon

we are trying to add parameters to a transformation at the runtime. The only possible way to do so, is to set every single parameter and not a node. We don't know yet how to create a node for the setParameter.
Current setParameter:
QName TEST XdmAtomicValue 24
Expected setParameter:
<TempNode> <local>Value1</local> </TempNode>
We searched and tried to create a XdmNode and XdmItem.

If you want to create an XdmNode by parsing XML, the best way to do it is:
DocumentBuilder db = processor.newDocumentBuilder();
XdmNode node = db.build(new StreamSource(
new StringReader("<doc><elem/></doc>")));
You could also pass a string containing lexical XML as the parameter value, and then convert it to a tree by calling the XPath parse-xml() function.
If you want to construct the XdmNode programmatically, there are a number of options:
DocumentBuilder.newBuildingStreamWriter() gives you an instance of BuildingStreamWriter which extends XmlStreamWriter, and you can create the document by writing events to it using methods such as writeStartElement, writeCharacters, writeEndElement; at the end call getDocumentNode() on the BuildingStreamWriter, which gives you an XdmNode. This has the advantage that XmlStreamWriter is a standard API, though it's not actually a very nice one, because the documentation isn't very good and as a result implementations vary in their behaviour.
Another event-based API is Saxon's Push class; this differs from most push-based event APIs in that rather than having a flat sequence of methods like:
builder.startElement('x');
builder.characters('abc');
builder.endElement();
you have a nested sequence:
Element x = Document.elem('x');
x.text('abc');
x.close();
As mentioned by Martin, there is the "sapling" API: Saplings.doc().withChild(elem(...).withChild(elem(...)) etc. This API is rather radically different from anything you might be familiar with (though it's influenced by the LINQ API for tree construction on .NET) but once you've got used to it, it reads very well. The Sapling API constructs a very light-weight tree in memory (hance the name), and converts it to a fully-fledged XDM tree with a final call of SaplingDocument.toXdmNode().
If you're familiar with DOM, JDOM2, or XOM, you can construct a tree using any of those libraries and then convert it for use by Saxon. That's a bit convoluted and only really intended for applications that are already using a third-party tree model heavily (or for users who love these APIs and prefer them to anything else).

In the Saxon Java s9api, you can construct temporary trees as SaplingNode/SaplingElement/SaplingDocument, see https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingDocument.html and https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingElement.html.
To give you a simple example constructing from a Map, as you seem to want to do:
Processor processor = new Processor();
Map<String, String> xsltParameters = new HashMap<>();
xsltParameters.put("foo", "value 1");
xsltParameters.put("bar", "value 2");
SaplingElement saplingElement = new SaplingElement("Test");
for (Map.Entry<String, String> param : xsltParameters.entrySet())
{
saplingElement = saplingElement.withChild(new SaplingElement(param.getKey()).withText(param.getValue()));
}
XdmNode paramNode = saplingElement.toXdmNode(processor);
System.out.println(paramNode);
outputs e.g. <Test><bar>value 2</bar><foo>value 1</foo></Test>.
So the key is to understand that withChild() returns a new SaplingElement.
The code can be compacted using streams e.g.
XdmNode paramNode2 = Saplings.elem("root").withChild(
xsltParameters
.entrySet()
.stream()
.map(p -> Saplings.elem(p.getKey()).withText(p.getValue()))
.collect(Collectors.toList())
.toArray(SaplingElement[]::new))
.toXdmNode(processor);
System.out.println(paramNode2);

Related

Jena read hook not invoked upon duplicate import read

My problem will probably be explained better with code.
Consider the snippet below:
// First read
OntModel m1 = ModelFactory.createOntologyModel();
RDFDataMgr.read(m1,uri0);
m1.loadImports();
// Second read (from the same URI)
OntModel m2 = ModelFactory.createOntologyModel();
RDFDataMgr.read(m2,uri0);
m2.loadImports();
where uri0 points to a valid RDF file describing an ontology model with n imports.
and the following custom ReadHook (which has been set in advance):
#Override
public String beforeRead(Model model, String source, OntDocumentManager odm) {
System.out.println("BEFORE READ CALLED: " + source);
}
Global FileManager and OntDocumentManager are used with the following settings:
processImports = true;
caching = true;
If I run the snippet above, the model will be read from uri0 and beforeRead will be invoked exactly n times (once for each import).
However, in the second read, beforeRead won't be invoked even once.
How, and what should I reset in order for Jena to invoke beforeRead in the second read as well?
What I have tried so far:
At first I thought it was due to caching being on, but turning it off or clearing it between the first and second read didn't do anything.
I have also tried removing all ignoredImport records from m1. Nothing changed.
Finally got to solve this. The problem was in ModelFactory.createOntologyModel(). Ultimately, this gets translated to ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_RDFS_INF,null).
All ontology models created with the static OntModelSpec.OWL_MEM_RDFS_INF will have their ImportsModelMaker and some of its other objects shared, which results in a shared state. Apparently, this state has blocked the reading hook to be invoked twice for the same imports.
This can be prevented by creating a custom, independent and non-static OntModelSpec instance and using it when creating an OntModel, for example:
new OntModelSpec( ModelFactory.createMemModelMaker(), new OntDocumentManager(), RDFSRuleReasonerFactory.theInstance(), ProfileRegistry.OWL_LANG );

Implementing Jena Dataset provider over MongoDB

I have started to implement a set of classes that provide a direct interface to MongoDB for persistence, similar in spirit to the now-unmaintained SDB persistor implementation for RDBMS.
I am using the time-honored technique of creating the necessary concrete classes from the interfaces and doing a println in each method, therein allowing me to trace the execution. I have gotten all the way to where the engine is calling out to my cursor set up:
public ExtendedIterator<Triple> find(Node s, Node p, Node o) {
System.out.println("+++ MongoGraph:extenditer:find(" + s + p + o + ")");
// TBD Need to turn s,p,o into a match expression! Easy!
MongoCursor cur = this.coll.find().iterator();
ExtendedIterator<Triple> curs = new JenaMongoCursorIterator(cur);
return curs;
}
Sadly, when I later call this:
while(rs.hasNext()) {
QuerySolution soln = rs.nextSolution() ;
System.out.println(soln);
}
It turns out rs.hasNext() is always false even though material is present in the MongoCursor (I can debug-print it in the find() method). Also, the trace print in the next() function in my concrete iterator JenaMongoCursorIterator (which extends NiceIterator which I believe is OK) is never hit. In short, the basic setup seems good but then the engine never cranks the iterator on find()
Trying to use SDB as a guide is completely overwhelming for someone not intimately familiar with the software architecture. It's fully factored and filled with interfaces and factories and although that is excellent, it is difficult to nav.
Has anyone tried to create their own persistor implementation and if so, what are the basic steps to getting a "hello world" running? Hello World in this case is ANY implementation, non-optimized, that can call next() on something to produce a Triple.
TLDR: It is now working.
I was coding too eagerly and JenaMongoCursorIterator contained a method hasNexl which of course did not override hasNext (with a t ) in the default implementation of NiceIterator which returns false.
This is the sort of problem that eclipse and visual debugging and tracing makes a lot easier to resolve than regular jdb. jdb is fine if you know the software architecture pretty well but if you don't, the multiple open source files and being able to mouse over vars and such provides a tremendous boost in the amount of context that can be created to home in on the problem.

How to Get Filename when using file pattern match in google-cloud-dataflow

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I'm newbee to use dataflow. How to get filename when use file patten match, in this way.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))
I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
Writing a pipeline like this (the tf-idf example uses this approach):
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.
Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.
Update based on latest SDK
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))
One approach is to build a List<PCollection> where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:
public static class FooParserFn extends DoFn<String, Foo> {
private String fileName;
public FooParserFn(String fileName) {
this.fileName = fileName;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
String line = processContext.element();
// here you have access to both the line of text and the name of the file
// from which it came.
}
}
public static void main(String[] args) {
...
List<String> inputFiles = ...;
List<PCollection<Foo>> foosByFile =
Lists.transform(inputFiles,
new Function<String, PCollection<Foo>>() {
#Override
public PCollection<Foo> apply(String fileName) {
return p.apply(TextIO.Read.from(fileName))
.apply(new ParDo().of(new FooParserFn(fileName)));
}
});
PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
...
}
One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to #danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());
This might be a very late post for the above question, but I wanted to add answer with Beam bundled classes.
This could also be seen as an extracted code from the solution provided by #Reza Rokni.
PCollection<String> listOfFilenames =
pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
.apply(FileIO.readMatches())
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
(FileIO.ReadableFile file) -> {
String f = file.getMetadata().resourceId().getFilename();
System.out.println(f);
return f;
}));
pipe.run().waitUntilFinish();
Above PCollection<String> will have a list of files available at any provided directory.
I was struggling with the same use case while using wildcard to read files from GCS but also needed to modify the collection based on the file name.The key is to use ReadFromTextWithFilename instead of readfromtext In java you already have a way out and you can use:
String filename =context.element().getMetadata().resourceId().getCurrentDirectory().toString()
inside your processElement method.
But for Python below technique will work:
-> Use beam.io.ReadFromTextWithFilename for reading the wildcard path from GCS
-> As per the document, ReadFromTextWithFilename returns the file's name and the file's content.
Below is the code snippet:
class GetFileNameFromWildcard(beam.DoFn):
def process(self, element, *args, **kwargs):
file_path, content = element
schema = ["id","name","mob","email","dept","store"]
store_name = file_path.split("/")[-2]
content_list = content.split(",")
content_list.append(store_name)
out_dict = dict(zip(schema,content_list))
print(out_dict)
yield out_dict
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# saving main session so that it can load global namespace on the Cloud Dataflow Worker
init = p | 'Begin Pipeline With Initiator' >> beam.Create(
["pcollection initializer"]) | 'Read From GCS' >> beam.io.ReadFromTextWithFilename(
"gs://<bkt-name>/20220826/*/dlp*", skip_header_lines=1) | beam.ParDo(
GetFileNameFromWildcard()) | beam.io.WriteToText(
'df_out.csv')

ANTLR Parse tree modification

I'm using ANTLR4 to create a parse tree for my grammar, what I want to do is modify certain nodes in the tree. This will include removing certain nodes and inserting new ones. The purpose behind this is optimization for the language I am writing. I have yet to find a solution to this problem. What would be the best way to go about this?
While there is currently no real support or tools for tree rewriting, it is very possible to do. It's not even that painful.
The ParseTreeListener or your MyBaseListener can be used with a ParseTreeWalker to walk your parse tree.
From here, you can remove nodes with ParserRuleContext.removeLastChild(), however when doing this, you have to watch out for ParseTreeWalker.walk:
public void walk(ParseTreeListener listener, ParseTree t) {
if ( t instanceof ErrorNode) {
listener.visitErrorNode((ErrorNode)t);
return;
}
else if ( t instanceof TerminalNode) {
listener.visitTerminal((TerminalNode)t);
return;
}
RuleNode r = (RuleNode)t;
enterRule(listener, r);
int n = r.getChildCount();
for (int i = 0; i<n; i++) {
walk(listener, r.getChild(i));
}
exitRule(listener, r);
}
You must replace removed nodes with something if the walker has visited parents of those nodes, I usually pick empty ParseRuleContext objects (this is because of the cached value of n in the method above). This prevents the ParseTreeWalker from throwing a NPE.
When adding nodes, make sure to set the mutable parent on the ParseRuleContext to the new parent. Also, because of the cached n in the method above, a good strategy is to detect where the changes need to be before you hit where you want your changes to go in the walk, so the ParseTreeWalker will walk over them in the same pass (other wise you might need multiple passes...)
Your pseudo code should look like this:
public void enterRewriteTarget(#NotNull MyParser.RewriteTargetContext ctx){
if(shouldRewrite(ctx)){
ArrayList<ParseTree> nodesReplaced = replaceNodes(ctx);
addChildTo(ctx, createNewParentFor(nodesReplaced));
}
}
I've used this method to write a transpiler that compiled a synchronous internal language into asynchronous javascript. It was pretty painful.
Another approach would be to write a ParseTreeVisitor that converts the tree back to a string. (This can be trivial in some cases, because you are only calling TerminalNode.getText() and concatenate in aggregateResult(..).)
You then add the modifications to this visitor so that the resulting string representation contains the modifications you try to achieve.
Then parse the string and you get a parse tree with the desired modifications.
This is certainly hackish in some ways, since you parse the string twice. On the other hand the solution does not rely on antlr implementation details.
I needed something similar for simple transformations. I ended up using a ParseTreeWalker and a custom ...BaseListener where I overwrote the enter... methods. Inside this method the ParserRuleContext.children is available and can be manipulated.
class MyListener extends ...BaseListener {
#Override
public void enter...(...Context ctx) {
super.enter...(ctx);
ctx.children.add(...);
}
}
new ParseTreeWalker().walk(new MyListener(), parseTree);

Is there anything like "CheckSum" in Dart (on Objects)?

For Testing purposes I'm trying to design a way to verify that the results of statistical tests are identical across versions, platforms and such. There are a lot things that go on that include ints, nums, dates, Strings and more inside our collections of Objects.
In the end I want to 'know' that the whole set of instantiated objects sum to the same value (by just doing something like adding the checkSum of all internal properties).
I can write low level code for each internal value to return a checkSum but I was thinking that perhaps something like this already exists.
Thanks!
_swarmii
This sounds like you should be using the serialization library (install via Pub).
Here's a simple example to get you started:
import 'dart:io';
import 'package:serialization/serialization.dart';
class Address {
String street;
int number;
}
main() {
var address = new Address()
..number = 5
..street = 'Luumut';
var serialization = new Serialization()
..addRuleFor(address);
Map output = serialization.write(address, new SimpleJsonFormat());
print(output);
}
Then depending on what you want to do exactly, I'm sure you can fine tune the code for your purpose.

Resources