Schema-awareness and streaming - saxon

I am in the process of converting a schema-aware stylesheet into streaming XSLT3. One thing I'm pleasantly surprised about is that it appears that streaming and schema-awareness work together. I can do instance of checks on grounded nodes (or even template match rules) for example and get the correct result. What is less clear is how this happens. Is the document run through multiple passes? Also, I notice that I can do id() and idref() operations on a captured accumulator (a very good thing), but it is unclear why (I'm not copying a document-node).

The "instance of" test is provisional: it's testing the type annotation of the node on the assumption that it will turn out to be valid. If it doesn't turn out to be valid, your code will be aborted in due course, because validation errors are fatal. This is why try/catch on validation can only catch the error at the level of the validation episode as a whole, it can't recover at a finer-grained level.
I'm not sure off-hand about the id() and idref() tests, but I suspect it's because a captured accumulator is effectively taking a snapshot() of the streamed node, and a snapshot is rooted at a document node. Presumably idref() will only work if the reference is within the snapshot.

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

The difference between Mono.just(1) vs. Flux.just(1)

I wonder, is there any difference in behavior/guarantees between the MonoJust and FluxJust created with exactly one argument?
From the source code of the Reactor Core 3.3.7 I am able to see that the former one is using the Operators#ScalarSubscription as its subscription object, while the latter one uses its private WeakScalarSubscription.
The only difference between these two is that ScalarSubscription has this volatile int once thing (a counter) defined and checked on each method call and somewhat ensures the onComplete() is called exactly once. At the same time, WeakScalarSubscription uses the boolean terminado thing (a non-volatile flag) for the same purposes, but without the "exactly once" guarantees for onComplete() call.
Using volatile in Java has its price, which is payed off e.g. when one creates a lot of these things (with Mono.just(1) or Flux.just(1)) in the highly-concurrent client code. (As we do in our project inside the flatMap that runs in parallel on a dedicated thread pool.)
There's no class javadoc for MonoJust, so I wonder if my assumptions are correct: that the only difference is that FluxJust may send the completion signal more than once in some circumstances — and that's it? Or are there other subtle differences?
I think that the biggest difference is how you use Flux and Mono. Mono emits one item or error and then completes, whereas Flux can emit more than one element, error, and then completion signal.
just() methods are meant to evaluate one element (or vararg variant for Flux) and return it immediately. I can imagine cases when Flux with only one element is returned.

Direct Collocation/Direct Transcription with a Multibody Plant

I'm trying to do trajectory optimization for a custom robot I've specified with an sdf file.
I'd like to use direct collocation, but when I try to create the MultibodyPlant with time_step=0.0 I fail with a segfault. It works just fine when I use discrete time (e.g. Multibodyplant(time_step=.005).
However, if I use discrete time, the state is no longer continuous so I can't use direct collocation. So I tried to use direct transcription and I get the error
SystemExit: Failure at bazel-out/k8-opt/bin/systems/framework/_virtual_includes/context/drake/systems/framework/context.h:111 in num_total_states(): condition 'num_abstract_states() == 0' failed.
I think the reason is that DirectTranscription does not have a assume_non_continuous_states_are_fixed, the same issue as in this question: direct transcription for compass gait. So maybe the easiest solution to my problem is to request this feature..
I recommended above that we do add that assume_non_continuous_states_are_fixed to DirectTranscription. But the reason that this option was not implemented already is a little subtle, so I’ll add it here.
It’s not actually MultibodyPlant that is adding the abstract state, but SceneGraph. For dynamics/planning, you only need SceneGraph if you are relying on contact forces in your dynamics. For acrobots / cart-poles, etc, you can already use a MultibodyPlant with DirectTranscription by passing the MBP only (no SceneGraph) to the optimization. And for systems that do make and break contact, I would have said that DirectTranscription might not be the algorithm you want; although there are no hard and fast rules saying it won’t work. It’s just that you’ll end up with stiff differential equations which are hard to transcribe in a reasonable trajectory optimization that doesn’t reason explicitly about the contact.
I think I know your application, which involves wheels that are in contact and stay in contact. That means that you do need SceneGraph. That might be a case were this currently missing combination makes perfect sense, and we should add it.

Constructing Dataflow pipeline with same transforms on side outputs

We are building a streaming pipeline where the data may encounter different errors at several steps, such as serialization error, validation error, and runtime error on writing to the storage. Whenever the error happens, we direct the data to a side output. The error handling logic is the same on these side outputs. We write the data to a common error storage for post processing / reporting.
There are at least three options to construct the pipleine. (pseudo code below)
Handle each side output with a new instance of the transform.
sideOutput1.apply(new HandleErrorTransform());
sideOutput2.apply(new HandleErrorTransform());
Handle each side output with a single instance of the transform.
Transform errorTransform = new HandleErrorTransform();
sideOutput1.apply(errorTransform);
sideOutput2.apply(errorTransofrm);
Flatten the output from these side outputs and use a single transform to handle all the errors.
PCollectionList.of(sideOutput1).and(sideOutput2)
.apply(Flatten.<ErrorMessage>pCollections())
.apply(new HandleErrorTransform());
Is there any advice on which one to use, for better scalability and performance? Or maybe it doesn't matter?
1 and 2 are basically the same -- since the pipeline is serialized the sharing doesn't offer any advantages.
Option 3 may have have some advantages since it is easier for you to add more logic to that path. It will likely be a bit easier to scale since there will only be one source writing elements to the final location, and that means fewer buffers, more opportunities to batch elements, etc.
One downside of 3 is that the use of flatten will hold up any windows created in the HandleErrorTransform until all of the main pipeline has processed those timestamps. This may be desirable -- all errors from records in this window -- but if not can be addressed using triggers.

using AUGraph vs direct connections between Audio Units by means of a GCD queue

So i've gone thru the Audio Unit Hosting Guide for iOS and they hint all along that one might always want to use an AUGraph instead of direct connections between AUs. Among the most notable reasons they mention a high-priority thread, being able to reconfigure the graph while it is running and general thread-safety.
My problem is that I'm trying to get as close to "making my own custom dsp effects" as possible given that iOS does not really let you dynamically load custom code. So, my approach is to create a generic output unit and write the DSP code in its render callback. Now the problem with this approach is if I wanted to chain two of these units with custom callbacks in a graph. Since a graph must have a single output au (for head), trying to add any more output units won't fly. That is, I cant have 2 Generic I/O units and a Remote I/O unit in sequence in a graph.
If I really wanted to use AUGraphs, I can think of one solution along the lines of:
A graph interface that internally keeps an AUGraph with a single Output unit plus, for each connected node in the graph, a list of "custom callback" generic output nodes that are in theory connected to such node. Maybe it could be a class / interface over AUNode instead, but hopefully you get the idea.
If I add non output units to this graph, it essentially connects them in the usual way to the backing AUGraph.
If however, I add a generic output node, this really means adding the node's au to the list and whichever node I am connecting in the graph to, actually gets its input scope / element 0 callback set to something like:
for each unit in your connected list:
call AudioUnitRender(...) and merge in ioData;
So that the node which is "connected" to any number of those "custom" nodes basically pulls the processed output from them and outputs it to whatever next non-custom node.
Sure there might be some loose ends in the idea above, but I think this would work with a bit more thought.
Now my question is, what if instead I do direct connections between AUs without an AUGraph whatsoever? Without an AUGraph in the picture, I can definitely connect many generic output units with callbacks to one final Remote I/O output and this seems to work just fine. Thing is kAudioUnitProperty_MakeConnection is a property. So I'm not sure that once an AU is initialized I can re set properties. I believe if I uninitialize then it's ok. If so, couldn't I just get GCD's high priority queue and dispatch any changes in async blocks that uninitialize, re connect and initialize again?

Resources