How to add Apache Any23 RDF Statements to Apache Jena? - parsing

Basically, I use the Any23 distiller to extract RDF statements from files embedded with RDFa (The actual files where created by DBpedia Spotlight using the xhtml+xml output option). By using Any23 RDFa distiller I can extract the RDF statements (I also tried using Java-RDFa but I could only extract the prefixes!). However, when I try to pass the statements to a Jena model and print the results to the console, nothing happens!
This is the code I am using :
File myFile = new File("T1");
Any23 runner= new Any23();
DocumentSource source = new FileDocumentSource(myFile);
ByteArrayOutputStream outA = new ByteArrayOutputStream();
InputStream decodedInput=new ByteArrayInputStream(outA.toByteArray()); //convert the output stream to input so i can pass it to jena model
TripleHandler writer = new NTriplesWriter(outA);
try {
runner.extract(source, writer);
} finally {
writer.close();
}
String ttl = outA.toString("UTF-8");
System.out.println(ttl);
System.out.println();
System.out.println();
Model model = ModelFactory.createDefaultModel();
model.read(decodedInput, null, "N-TRIPLE");
model.write(System.out, "TURTLE"); // prints nothing!
Can anyone tell me what I have done wrong? Probably multiple things!
Is there any easy way i can extract the subjects of the RDF statements directly from any23 (bypassing Jena)?
As I am quite inexperienced in programming any help would be really appreciated!

You are calling
InputStream decodedInput=new ByteArrayInputStream(outA.toByteArray()) ;
before calling any23 to insert triples. At the point of the call, it's empty.
Move this after the try-catch block.

Related

Saxon - s9api - setParameter as node and access in transformation

we are trying to add parameters to a transformation at the runtime. The only possible way to do so, is to set every single parameter and not a node. We don't know yet how to create a node for the setParameter.
Current setParameter:
QName TEST XdmAtomicValue 24
Expected setParameter:
<TempNode> <local>Value1</local> </TempNode>
We searched and tried to create a XdmNode and XdmItem.
If you want to create an XdmNode by parsing XML, the best way to do it is:
DocumentBuilder db = processor.newDocumentBuilder();
XdmNode node = db.build(new StreamSource(
new StringReader("<doc><elem/></doc>")));
You could also pass a string containing lexical XML as the parameter value, and then convert it to a tree by calling the XPath parse-xml() function.
If you want to construct the XdmNode programmatically, there are a number of options:
DocumentBuilder.newBuildingStreamWriter() gives you an instance of BuildingStreamWriter which extends XmlStreamWriter, and you can create the document by writing events to it using methods such as writeStartElement, writeCharacters, writeEndElement; at the end call getDocumentNode() on the BuildingStreamWriter, which gives you an XdmNode. This has the advantage that XmlStreamWriter is a standard API, though it's not actually a very nice one, because the documentation isn't very good and as a result implementations vary in their behaviour.
Another event-based API is Saxon's Push class; this differs from most push-based event APIs in that rather than having a flat sequence of methods like:
builder.startElement('x');
builder.characters('abc');
builder.endElement();
you have a nested sequence:
Element x = Document.elem('x');
x.text('abc');
x.close();
As mentioned by Martin, there is the "sapling" API: Saplings.doc().withChild(elem(...).withChild(elem(...)) etc. This API is rather radically different from anything you might be familiar with (though it's influenced by the LINQ API for tree construction on .NET) but once you've got used to it, it reads very well. The Sapling API constructs a very light-weight tree in memory (hance the name), and converts it to a fully-fledged XDM tree with a final call of SaplingDocument.toXdmNode().
If you're familiar with DOM, JDOM2, or XOM, you can construct a tree using any of those libraries and then convert it for use by Saxon. That's a bit convoluted and only really intended for applications that are already using a third-party tree model heavily (or for users who love these APIs and prefer them to anything else).
In the Saxon Java s9api, you can construct temporary trees as SaplingNode/SaplingElement/SaplingDocument, see https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingDocument.html and https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingElement.html.
To give you a simple example constructing from a Map, as you seem to want to do:
Processor processor = new Processor();
Map<String, String> xsltParameters = new HashMap<>();
xsltParameters.put("foo", "value 1");
xsltParameters.put("bar", "value 2");
SaplingElement saplingElement = new SaplingElement("Test");
for (Map.Entry<String, String> param : xsltParameters.entrySet())
{
saplingElement = saplingElement.withChild(new SaplingElement(param.getKey()).withText(param.getValue()));
}
XdmNode paramNode = saplingElement.toXdmNode(processor);
System.out.println(paramNode);
outputs e.g. <Test><bar>value 2</bar><foo>value 1</foo></Test>.
So the key is to understand that withChild() returns a new SaplingElement.
The code can be compacted using streams e.g.
XdmNode paramNode2 = Saplings.elem("root").withChild(
xsltParameters
.entrySet()
.stream()
.map(p -> Saplings.elem(p.getKey()).withText(p.getValue()))
.collect(Collectors.toList())
.toArray(SaplingElement[]::new))
.toXdmNode(processor);
System.out.println(paramNode2);

Jena read hook not invoked upon duplicate import read

My problem will probably be explained better with code.
Consider the snippet below:
// First read
OntModel m1 = ModelFactory.createOntologyModel();
RDFDataMgr.read(m1,uri0);
m1.loadImports();
// Second read (from the same URI)
OntModel m2 = ModelFactory.createOntologyModel();
RDFDataMgr.read(m2,uri0);
m2.loadImports();
where uri0 points to a valid RDF file describing an ontology model with n imports.
and the following custom ReadHook (which has been set in advance):
#Override
public String beforeRead(Model model, String source, OntDocumentManager odm) {
System.out.println("BEFORE READ CALLED: " + source);
}
Global FileManager and OntDocumentManager are used with the following settings:
processImports = true;
caching = true;
If I run the snippet above, the model will be read from uri0 and beforeRead will be invoked exactly n times (once for each import).
However, in the second read, beforeRead won't be invoked even once.
How, and what should I reset in order for Jena to invoke beforeRead in the second read as well?
What I have tried so far:
At first I thought it was due to caching being on, but turning it off or clearing it between the first and second read didn't do anything.
I have also tried removing all ignoredImport records from m1. Nothing changed.
Finally got to solve this. The problem was in ModelFactory.createOntologyModel(). Ultimately, this gets translated to ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_RDFS_INF,null).
All ontology models created with the static OntModelSpec.OWL_MEM_RDFS_INF will have their ImportsModelMaker and some of its other objects shared, which results in a shared state. Apparently, this state has blocked the reading hook to be invoked twice for the same imports.
This can be prevented by creating a custom, independent and non-static OntModelSpec instance and using it when creating an OntModel, for example:
new OntModelSpec( ModelFactory.createMemModelMaker(), new OntDocumentManager(), RDFSRuleReasonerFactory.theInstance(), ProfileRegistry.OWL_LANG );

How can I push data written by a Serilog File Sink on the disk to another Serilog sink?

We are trying to load test our infrastructure of logstash/elastic. Since the actual logs are generated by a software that uses hardware, we are unable to simulate it at scale.
I am wondering if we can store the logs using file sink and later write a program that reads the log files and send data through the actual sink. Since, we are trying different setup, it would be great if we can swap different sinks for testing. Say http sink and elastic sink.
I thought of reading the json file one line at a time and then invoking Write method on the Logger. However I am not sure how to get the properties array from the json. Also, it would be great to hear if there are better alternatives in Serilog world for my needs.
Example parsing
var events= File.ReadAllLines(#"C:\20210520.json")
.Select(line => JsonConvert.DeserializeObject<dynamic>(line));
foreach (var o in objects)
{
DateTime timeStamp = o.Timestamp;
LogEventLevel level = o.Level;
string messageTemplate = o.MessageTemplate;
string exception = o.Exception;
var properties = (o.Properties as JObject);
List<object> parameters = new List<object>();
foreach (var property in properties)
{
if(messageTemplate.Contains(property.Key))
parameters.Add(property.Value.ToString());
}
logInstance.Write(level, messageTemplate, parameters.ToArray());
count++;
}
Example Json Event written to the file
{"Timestamp":"2021-05-20T13:15:49.5565372+10:00","Level":"Information","MessageTemplate":"Text dialog with {Title} and {Message} displayed, user selected {Selected}","Properties":{"Title":"Unload Device from Test","Message":"Please unload the tested device from test jig","Selected":"Methods.Option","SinkRepository":null,"SourceRepository":null,"TX":"TX2937-002 ","Host":"Host1","Session":"Host1-2021.05.20 13.12.44","Seq":87321,"ThreadId":3}}
UPDATE
Though this works for simple events,
it is not able to handle Context properties (there is a work around though using ForContext),
also it forces all the properties to be of type string and
not to mention that destucturing (#property) is not handled properly
If you can change the JSON format to Serilog.Formatting.Compact's CLEF format, then you can use Serilog.Formatting.Compact.Reader for this.
In the source app:
// dotnet add package Serilog.Formatting.Compact
Log.Logger = new LoggerConfiguration()
.WriteTo.File(new CompactJsonFormatter(), "./logs/myapp.clef")
.CreateLogger();
In the load tester:
// dotnet add package Serilog.Formatting.Compact.Reader
using (var target = new LoggerConfiguration()
.MinimumLevel.Verbose()
.WriteTo.Console()
.CreateLogger())
{
using (var file = File.OpenText("./logs/myapp.clef"))
{
var reader = new LogEventReader(file);
while (reader.TryRead(out var evt))
target.Write(evt);
}
}
Be aware though that load testing results won't be accurate for many sinks if you use repeated timestamps. You should consider re-mapping the events you read in to use current timestamps.
E.g. once you've loaded up evt:
var current = new LogEvent(DateTimeOffset.Now,
evt.Level,
evt.Exception,
evt.MessageTemplate,
evt.Properties);
target.Write(current);

How to read a file in src/main/resources in grails

This has been asked before, and I have tried each proposed solution, but all fail.
I have put a javascript file (hl.js) in myapp/src/main/resources
I have tried to read it with the following code taken from the "solutions":
1 - getRsourcesAsStream. returns null inputstream.
InputStream is = this.class.classLoader.getResourceAsStream("hl.js")
2 - getResource - returns null
File myFile = grailsApplication.mainContext.getResource("hl.js").file
3 - getResourceAsStream with classloader - returns null.
ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
InputStream is = classLoader.getResourceAsStream("hl.js");
Interestingly, if I do the following:
String fileNameAndPath = this.class.classLoader.getResource("hl.js").getFile()
System.out.println(fileNameAndPath);
File file = new File(fileNameAndPath)
InputStream is = file.newInputStream();
This prints out:
/Users/me/dev/grails_projects/myapp/src/main/resources/hl.js
But "is" is always null.
I an trying to get an input stream so I can evaluate the javascript via nashorn:
ScriptEngine engine = new ScriptEngineManager().getEngineByName("nashorn");
engine.eval(is)
Grails 3.3.8
Any ideas?
Get the resource and open a stream on it.
def resource = this.class.classLoader.getResource('conf.json')
def path = resource.file // absolute file path
return resource.openStream() // input stream for the file
Source: https://www.damirscorner.com/blog/posts/20160313-AccessingApplicationFilesFromCodeInGrails.html
Well, I dont know why the solutions 1, 2 and 3 do not work, but I found a more long winded way which does work. The main issue is that there are lots of different implementations of eval(), and netbeans "go to declaration" has never worked (presumably some configuration issue in netbeans).
It turns out that the eval() version i happen to be using is expecting a Reader, where as the default documentation shows it needs in InputStream. Also, reader is not the same as InputStreamReader.
This is the solution I found:
import javax.script.ScriptEngine
import javax.script.ScriptEngineManager
import org.grails.core.io.ResourceLocator
ScriptEngine engine = new ScriptEngineManager().getEngineByName("nashorn");
String fileNameAndPath = this.class.classLoader.getResource("hl.js").getFile()
System.out.println(fileNameAndPath);
File file = new File(fileNameAndPath)
System.out.println("exists: " + file.exists())
Reader reader = file.newReader();
engine.eval(reader)

SideInputs corrupt the data in DataFlow's Pipeline

I have a Dataflow pipeline (SDK 2.1.0, Apache Beam 2.2.0) which simply reads RDF (in N-Triples, so it's just text files) from GCS, transforms it somehow and writes it back to GCS, but in a different bucket. In this pipeline I employ side inputs which are three single files (one file per side input) and use them in a ParDo.
To work with RDF in Java I use Apache Jena, so each file is read into an instance of Model class. Since Dataflow doesn't have Coder for it, I developed it myself (RDFModelCoder, see below). It works fine in number of other pipelines I created.
The problem with this particular pipeline is when I add the side inputs, the execution fails with an exception indicating a corruption of the data, i.e. some garbage is added. Once I remove the side inputs, the pipeline finishes execution successfully.
The exception (it's thrown from RDFModelCoder, see below):
Caused by: org.apache.jena.atlas.RuntimeIOException: java.nio.charset.MalformedInputException: Input length = 1
at org.apache.jena.atlas.io.IO.exception(IO.java:233)
at org.apache.jena.atlas.io.CharStreamBuffered$SourceReader.fill(CharStreamBuffered.java:77)
at org.apache.jena.atlas.io.CharStreamBuffered.fillArray(CharStreamBuffered.java:154)
at org.apache.jena.atlas.io.CharStreamBuffered.advance(CharStreamBuffered.java:137)
at org.apache.jena.atlas.io.PeekReader.advanceAndSet(PeekReader.java:235)
at org.apache.jena.atlas.io.PeekReader.init(PeekReader.java:229)
at org.apache.jena.atlas.io.PeekReader.peekChar(PeekReader.java:151)
at org.apache.jena.atlas.io.PeekReader.makeUTF8(PeekReader.java:92)
at org.apache.jena.riot.tokens.TokenizerFactory.makeTokenizerUTF8(TokenizerFactory.java:48)
at org.apache.jena.riot.lang.RiotParsers.createParser(RiotParsers.java:57)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:198)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:298)
at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:288)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:237)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:417)
at org.apache.jena.riot.RDFDataMgr.parseFromInputStream(RDFDataMgr.java:870)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:268)
at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:254)
at org.apache.jena.riot.adapters.RDFReaderRIOT.read(RDFReaderRIOT.java:69)
at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:305)
And here you can see the garbage (at the end):
<http://example.com/typeofrepresentative/08> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#NamedIndividual> . ������** �����I��.�������������u�������
The pipeline:
val one = p.apply(TextIO.read().from(config.getString("source.one")))
.apply(Combine.globally(SingleValue()))
.apply(ParDo.of(ConvertToRDFModel(RDFLanguages.NTRIPLES)))
val two = p.apply(TextIO.read().from(config.getString("source.two")))
.apply(Combine.globally(SingleValue()))
.apply(ParDo.of(ConvertToRDFModel(RDFLanguages.NTRIPLES)))
val three = p.apply(TextIO.read().from(config.getString("source.three")))
.apply(Combine.globally(SingleValue()))
.apply(ParDo.of(ConvertToRDFModel(RDFLanguages.NTRIPLES)))
val sideInput = PCollectionList.of(one).and(two).and(three)
.apply(Flatten.pCollections())
.apply(View.asList())
p.apply(RDFIO.Read
.from(options.getSource())
.withSuffix(RDFLanguages.strLangNTriples))
.apply(ParDo.of(SparqlConstructETL(config, sideInput))
.withSideInputs(sideInput))
.apply(RDFIO.Write
.to(options.getDestination())
.withSuffix(RDFLanguages.NTRIPLES))
And just to provide the whole picture here are implementations of SingleValue and ConvertToRDFModel ParDos:
class SingleValue : SerializableFunction<Iterable<String>, String> {
override fun apply(input: Iterable<String>?): String {
if (input != null) {
return input.joinToString(separator = " ")
}
return ""
}
}
class ConvertToRDFModel(outputLang: Lang) : DoFn<String, Model>() {
private val lang: String = outputLang.name
#ProcessElement
fun processElement(c: ProcessContext?) {
if (c != null) {
val model = ModelFactory.createDefaultModel()
model.read(StringReader(c.element()), null, lang)
c.output(model)
}
}
}
The implementation of RDFModelCoder:
class RDFModelCoder(private val decodeLang: String = RDFLanguages.strLangNTriples,
private val encodeLang: String = RDFLanguages.strLangNTriples)
: AtomicCoder<Model>() {
private val LOG = LoggerFactory.getLogger(RDFModelCoder::class.java)
override fun decode(inStream: InputStream): Model {
val bytes = StreamUtils.getBytes(inStream)
val model = ModelFactory.createDefaultModel()
model.read(ByteArrayInputStream(bytes), null, decodeLang) // the exception is thrown from here
return model
}
override fun encode(value: Model, outStream: OutputStream?) {
value.write(outStream, encodeLang, null)
}
}
I checked the side input files multiple times, they're fine, they have UTF-8 encoding.
Most likely the error is in the implementation of RDFModelCoder. When implementing encode/decode one has to remember that the provided InputStream and OutputStream are not exclusively owned by the current instance being encoded/decoded. E.g. there might be more data in the InputStream after the encoded form of your current Model. When using StreamUtils.getBytes(inStream) you are grabbing both data of the current encoded Model and anything else that was in the stream.
Generally when writing a new Coder it's a good idea to only combine existing Coder's rather than hand-parsing the stream: that is less error-prone. I would suggest to convert the model to/from byte[] and use ByteArrayCoder.of() to encode/decode it.
Apache Jena provides the Elephas IO modules which have Hadoop IO support, since Beam supports Hadoop InputFormat IO you should be able to use that to read in your NTriples file.
This will likely be far more efficient since the NTriples support in Elephas is able to parallelise the IO and avoid caching the entire model into memory (in fact it won't use Model at all):
Configuration myHadoopConfiguration = new Configuration(false);
// Set Hadoop InputFormat, key and value class in configuration
myHadoopConfiguration.setClass("mapreduce.job.inputformat.class",
NTriplesInputFormat.class, InputFormat.class);
myHadoopConfiguration.setClass("key.class", LongWritable.class, Object.class);
myHadoopConfiguration.setClass("value.class", TripleWritable.class, Object.class);
// Set any other Hadoop config you might need
// Read data only with Hadoop configuration.
p.apply("read",
HadoopInputFormatIO.<LongWritable, TripleWritable>read()
.withConfiguration(myHadoopConfiguration);
Of course this may require you to refactor your overall pipeline somewhat.

Resources