Graph Serialization (Blank Node labels) - jena

I'm attempting to do some graph normalization following the URDNA2015 algorithm.
If I'm understanding the spec, blank nodes should have labels like _:c14nX where X is an incrementing counter.
I can produce a graph that has blank nodes with these labels, but when serializing the graph to NTRIPLES these run through NodeFmtLib#encodeBNodeLabel which performs some encoding -- at the very least always prefixes the resulting node with 'B'. For example c14n92 -> Bc14n92 or _:c14n92 -> BX5FX3Ac14n92 due to hex encoding.
My serialization code is very basic currently:
StringWriter sw = new StringWriter();
RDFDataMgr.write(sw, normalizedGraph, Lang.NTRIPLES);
What is the suggested way of having finer control over this serialization?
EDIT:
One approach I found that works, but I'm not sure if it is the recommended way:
RDFWriterRegistry.register(RDFFormat.NTRIPLES_UTF8, new CustomWriterGraphRIOTFactory());
Then implement a chain of classes that override:
WriterGraphRIOTFactory
NTriplesWriter
StreamRDFLib
WriterStreamRDFPlain
NodeFormatter
to ultimately get to a place of overriding formatBNode:
public class CustomNodeFormatter extends NodeFormatterNT {
public CustomNodeFormatter(CharSpace charSpace) {
super(charSpace);
}
#Override
public void formatBNode(AWriter w, String label) {
w.print(label);
}
}

The Jena writers work on graphs and a graph is a set of triple - unordered. As triples can be deleted and re-added, order isn't so easy even in a single threaded program because changes to the graph may reorder hash tables.
If you are doing this from JSON-LD - Jena currently uses jsonld-java - check whether the JSON-LD parsing is in a consistent order and labelling.
If you want to respect the order in syntax in other formats, look at parsing files to a StreamRDF object (- the parser output stream, - as well as having a custom FactoryRDF (which controls the label used for the blank node - you could for example make them 1,2,3 at this point).
RDFParser.create().source(...).factory(FactoryRDF).parse(StreamRDF);
Note when doing it via output without control of the input the order of output may change from run to run as blank nodes get different ids each parser run.

Related

Can you generate a scene graph after a plant has been finalized?

I'm working on a project that requires me to add a model through Parser (which requires the plant to be of the same type as the array used) before Setting the position of the model in said plant and taking distance queries. These queries only work when the query object generated from the scene graph is of type float.
I've run into a problem where setting the position doesn't work due to the array being used being of type AutoDiff. A possible solution would then be converting the plant of type float to Autodiff with plant.ToAutoDiff(), but this only creates a copy of the plant without coupling it to the scene graph (and in turn the query object) from which the queries are derived. Taking queries with a query object generated from the original plant would then fail to reflect the new position passed to the AutoDiff copy.
Is there a way to create a new scene graph from the already finalized symbolic copy of the original plant, so that I can perform the queries with it?
A couple of thoughts:
Don't just convert the plant to autodiff. Convert the whole diagram. That will give you a converted, connected network.
You're stuck with the current workflow. Presumably, your proximity geometries are specified in your parsed file (as <collision> tags). The parsing process is ephemeral. The declaration is consumed, passed through MultibodyPlant into SceneGraph. If there is no SceneGraph at parse time, all knowledge of the declared collision geometry is forgotten.
So, the typical workflow is:
Create a float-valued diagram.
Scalar convert it to an AutoDiff-valued diagram.
Keep both around to serve the different roles.
We don't have a tutorial that directly shows scalar converting an entire diagram, but it's akin to what is shown in this MultibodyPlant-specific tutorial. Just call ToScalarType() on the Diagram root.

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?
If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)

Can Google dataflow GroupByKey handle hot keys?

Input is PCollection<KV<String,String>>
I have to write files by the key and each line as value of the KV group.
In order to group based on Key, I have 2 options :
1. GroupByKey --> PCollection<KV<String, Iterable<String>>>
2. Combine.perKey.withhotKeyFanout --> PCollection
where value String is accumulated Strings from all pairs.
(Combine.CombineFn<String, List<String>, CustomStringObJ>)
I can have a millon records per key.The collection of keyed-data is optimised using Windows and Trigger, still can have thousands of entries per key.
I worry the max size of String will cause issue if Combine.perKey.withHotKeyFanout is used to create a CustomStringObJ which has List<String> as member to be written in the file.
If we use GroupByKey, how to handle hot keys?
You should use the approach with GroupByKey, not use Combine to concatenate a large string. The actual implementation (not unique to Dataflow) is that elements are shuffled according to their key and in the output KV<K, Iterable<V>> the iterable of values is a particular lazy/streamed view on the elements shuffled to that key. There is no actual iterable constructed - this is just as good as routing each element to the worker that owns each file and writing it directly.
Your use of windows and triggers might actually force buffering and make this less efficient. You should only use event time windowing if it is part of your business case; it isn't a mechanism for controlling performance. Triggers are good for managing how data is batched up and sent downstream, but most useful for aggregations where triggering less frequently saves a lot of data volume. For a raw grouping of the elements, triggers tend to be less useful.

General principle to implement node-based workflow as seen in Unreal, Blender, Alteryx and the like?

This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.
This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.

Beam/Dataflow design pattern to enrich documents based on database queries

Evaluating Dataflow, and am trying to figure out if/how to do the following.
My apologies if anything in the above is trivial--trying to wrap our heads around Dataflow before we make a decision on using Beam, or something else like Spark, etc.
General use case is for machine learning:
Ingesting documents which are individually processed.
In addition to easy-to-write transforms, we'd like to enrich each document based on queries against databases (that are largely key-value stores).
A simple example would be a gazetteer: decompose the text into ngrams, and then check if those ngrams reside in some database, and record (within a transformed version of the original doc) the entity identifier given phrases map to.
How to do this efficiently?
NAIVE (although possibly tricky with the serialization requirement?):
Each document could simply query the database individually (similar to Querying a relational database through Google DataFlow Transformer), but, given that most of these are simple key-value stores, it seems like there should be a more efficient way to do this (given the real problems with database query latency).
SCENARIO #1: Improved?:
Current strawman is to store the tables in Bigquery, pull them down (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py), and then use them as side inputs, that are used as key-value lookups within the per-doc function(s).
Key-value tables range from generally very small to not-huge (100s of MBs, maybe low GBs). Multiple CoGroupByKey with same key apache beam ("Side inputs can be arbitrarily large - there is no limit; we have seen pipelines successfully run using side inputs of 1+TB in size") suggests this is reasonable, at least from a size POV.
1) Does this make sense? Is this the "correct" design pattern for this scenario?
2) If this is a good design pattern...how do I actually implement this?
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L53 shows feeding the result to the document function as an AsList.
i) Presumably, AsDict is more appropriate here, for the above use case? So I'd probably need to run some transformations first on the Bigquery output to separate it into key, value tuple; and make sure that the keys are unique; and then use it as a side input.
ii) Then I need to use the side input in the function.
What I'm not clear on:
for both of these, how to manipulate the output coming off of the Bigquery pull is murky to me. How would I accomplish (i) (assuming it is necessary)? Meaning, what does the data format look like (raw bytes? strings? is there a good example I can look into?)
Similarly, if AsDict is the correct way to pass it into the func, can I just reference things like a dict normally is used in python? e.g., side_input.get('blah') ?
SCENARIO #2: Even more improved? (for specific cases):
The above scenario--if achievable--definitely does seem like it is superior continuous remote calls (given the simple key-value lookup), and would be very helpful for some of our scenarios. But if I take a scenario like a gazetteer lookup (like above)...is there an even more optimized solution?
Something like, for every doc, writing our all the ngrams as keys, with values as the underlying indices (docid+indices within the doc), and then doing some sort of join between these ngrams and the phrases in our gazeteer...and then doing another set of transforms to recover the original docs (now w/ their new annotations).
I.e., let Beam handle all of the joins/lookups directly?
Theoretical advantage is that Beam may be a lot quicker in doing this than, for each doc, looping over all of the ngrams and doing a check if the ngram is in the side_input.
Other key issues:
3) If this is a good way to do things, is there any trick to making this work well in the streaming scenario? Text elsewhere suggests that the side input caching works more poorly outside the batch scenario. Right now, we're focused on batch, but streaming will become relevant in serving live predictions.
4) Any Beam-related reason to prefer Java>Python for any of the above? We've got a good amount of existing Python code to move to Dataflow, so would heavily prefer Python...but not sure if there are any hidden issues with Python in the above (e.g., I've noticed Python doesn't support certain features or I/O).
EDIT: Strawman? for the example ngram lookup scenario (should generalize strongly to general K:V lookup)
Phrases = get from bigquery
Docs (indexed by docid) (direct input from text or protobufs, e.g.)
Transform: phrases -> (phrase, entity) tuples
Transform: docs -> ngrams (phrase, docid, coordinates [in document])
CoGroupByKey key=phrase: (phrase, entity, docid, coords)
CoGroupByKey key=docid, group((phrase, entity, docid, coords), Docs)
Then we can iteratively finalize each doc, using the set of (phrase, entity, docid, coords) and each Doc
Regarding the scenarios for your pipeline:
Naive scenario
You are right that per-element querying of a database is undesirable.
If your key-value store is able to support low-latency lookups by reusing an open connection, you can define a global connection that is initialized once per worker instead of once per bundle. This should be acceptable your k-v store supports efficient lookups over existing connections.
Improved scenario
If that's not feasible, then BQ is a great way to keep and pull in your data.
You can definitely use AsDict side inputs, and simply go side_input[my_key] or side_input.get(my_key).
Your pipeline could look something like so:
kv_query = "SELECT key, value FROM my:table.name"
p = beam.Pipeline()
documents_pcoll = p | ReadDocuments()
additional_data_pcoll = (p
| beam.io.BigQuerySource(query=kv_query)
# Make row a key-value tuple.
| 'format bq' >> beam.Map(lambda row: (row['key'], row['value'])))
enriched_docs = (documents_pcoll
| 'join' >> beam.Map(lambda doc, query: enrich_doc(doc, query[doc['key']]),
query=AsDict(additional_data_pcoll)))
Unfortunately, this has one shortcoming, and that's the fact that Python does not currently support arbitrarily large side inputs (it currently loads all of the K-V into a single Python dictionary). If your side-input data is large, then you'll want to avoid this option.
Note This will change in the future, but we can't be sure ATM.
Further Improved
Another way of joining two datasets is to use CoGroupByKey. The loading of documents, and of K-V additional data should not change, but when joining, you'd do something like so:
# Turn the documents into key-value tuples as well[
documents_kv_pcoll = (documents_pcoll
| 'format docs' >> beam.Map(lambda doc: (doc['key'], doc)))
enriched_docs = ({'docs': documents_kv_pcoll, 'additional_data': additional_data_pcoll}
| beam.CoGroupByKey()
| 'enrich' >> beam.Map(lambda x: enrich_doc(x['docs'][0], x['additional_data'][0]))
CoGroupByKey will allow you to use arbitrarily large collections on either side.
Answering your questions
You can see an example of using BigQuery as a side input in the cookbook. As you can see there, the data comes parsed (I believe that it comes in their original data types, but it may come in string/unicode). Check the docs (or feel free to ask) if you need to know more.
Currently, Python streaming is in alpha, and it does not support side inputs; but it does support shuffle features such as CoGroupByKey. Your pipeline using CoGroupByKey should work well in streaming.
A reason to prefer Java over Python is that all these features work in Java (unlimited-size side inputs, streaming side inputs). But it seems that for your use case, Python may have all you need.
Note: The code snippets are approximate, but you should be able to debug them using the DirectRunner.
Feel free to ask for clarification, or to ask about other aspects if you feel like it'd help.

Resources